Maintenance Mode

When you need to do hardware or operating system maintenance on a server that hosts an RS node, it is important that you move all of the shards on that node to another node to protect the data. You can use maintenance mode to handle this process simply and efficiently.

Turning Maintenance Mode ON

When you turn maintenance mode on, RS:

  1. Checks whether shutdown of the node causes quorum loss in the current cluster state. If so, maintenance mode is not turned on.
  2. Takes a snapshot of the node configuration as a record of which shards and endpoints are on the node at that time.
  3. Marks the node as a quorum node to prevent shards and endpoints from migrating into the node. The maintenance node entry in the rladmin status output is colored yellow to indicate that it cannot accept shard migration, just as a quorum_only node.
  4. Migrates shards to other nodes and binds endpoints to other nodes, if space is available on other nodes.

Note

If the node is the master node in the cluster, maintenance mode does not demote the node. As usual, the cluster elects a new master node when the master node is restarted.

To turn maintenance mode on, on one of the nodes in the cluster run:

rladmin node <node_id> maintenance_mode on

After all of the shards and endpoints are moved from the node, it is safe to do maintenance on the server if there are enough nodes up to maintain the quorum.

Cluster status with maintenance

For example, when you have a 3 node cluster with 4 shards, the status of the cluster is:

redislabs@rp1_node1:/opt$ rladmin status
CLUSTER NODES:
NODE:ID   ROLE     ADDRESS       EXTERNAL_ADDRESS     HOSTNAME    SHARDS
*node:1   master   172.17.0.2                         rp1_node1   2/100
node:2    slave    172.17.0.4                         rp3_node1   2/100
node:3    slave    172.17.0.3                         rp2_node1   0/100

When you turn on maintenance mode for node 2, RS takes a snapshot and then moves the shards and endpoints from node 2 to another node. In our example, they are moved to node 3.

The node in maintenance mode shows that 0/0 shards are on the node because no shards can be accepted on node 2. A node in quorum_only mode also shows 0/0 shards.

redislabs@rp1_node1:/opt$ rladmin node 2 maintenance_mode on
Performing maintenance_on action on node:2: 0%
created snapshot NodeSnapshot<name=maintenance_mode_2019-03-14_09-50-59,time=None,node_uid=2>

node:2 will not accept any more shards
Performing maintenance_on action on node:2: 100%
OK
redislabs@rp1_node1:/opt$ rladmin status
CLUSTER NODES:
NODE:ID   ROLE     ADDRESS       EXTERNAL_ADDRESS     HOSTNAME    SHARDS
*node:1   master   172.17.0.2                         rp1_node1   2/100
node:2    slave    172.17.0.4                         rp3_node1   0/0
node:3    slave    172.17.0.3                         rp2_node1   2/100

Prevent slave shard migration

If you do not have enough resources in other cluster nodes to migrate all of the shards to other nodes, you can turn maintenance mode on without migrating the slave shards.

Warning

If you prevent slave shard migration, the slave shards are kept on the node during maintenance. If the maintenance node fails, the master shards will not have slave shards for data redundancy and high availability.

To turn maintenance mode on and prevent slave shard migration, on one of the nodes in the cluster run:

rladmin node <node_id> maintenance_mode on keep_slave_shards

Turning Maintenance Mode OFF

When you turn maintenance mode off, RS:

  1. Loads the latest snapshot, unless a snapshot is specified.
  2. Unmarks the node as a quorum node to allow shards and endpoints to migrate into the node.
  3. Restores the shards and endpoints that were in node at the time when the snapshot was taken.
  4. Deletes the snapshot.

To turn maintenance mode off after you finish the server maintenance, on one of the nodes in the cluster run:

rladmin node <node_id> maintenance_mode off

Specifying a snapshot

Each time maintenance mode is turned on, a snapshot of the node configuration is saved. If there are multiple snapshots, you can restore a specified snapshot when you turn maintenance mode off.

To specify a snapshot when you turn maintenance mode off, on one of the nodes in the cluster run:

rladmin node <node_id> maintenance_mode off snapshot_name <snapshot_name>

Note

If an error occurs when you turn on maintenance mode, the snapshot is not deleted. When you re-run the command, we recommend that you use the snapshot from the initial attempt because it contains the original state of the node.

You can see the list of available snapshots with the command:

rladmin node <node_id> snapshot list

Skipping shard restoration

If you do not want to change the distribution of shards and endpoints in the cluster when you turn maintenance mode off, you can turn maintenance mode off and prevent the shards and endpoints from moving back to the node.

To turn maintenance mode off while skipping shard restoration run:

rladmin node <node_id> maintenance_mode off skip_shards_restore

Toggling Maintenance Mode via API

Maintenance Mode can be toggled via the API.

Maintenance Mode ON

This request triggers the maintenance_on action:

curl -X POST https://<hostname>:9443/v1/nodes/<node_id>/actions/maintenance_on -k -u <user>:<password> --data '{"keep_slave_shards":true}' -H "Content-Type: application/json"
{..."status":"queued","task_id":"38c7405b-26a7-4379-b84c-cab4b3db706d"}

You can choose to prevent slave shard migration using the keep_slave_shards boolean flag.

You can track the maintenance_on action using the API as well:

curl https://<hostname>:9443/v1/nodes/<node_id>/actions/maintenance_on -k -u <user>:<password>
{..."status":"completed","task_id":"38c7405b-26a7-4379-b84c-cab4b3db706d"}

The task_id should be the one you got when triggering the action. The status of the action is ‘queued’ then ‘running’, until it is ‘completed’, or ‘failed’.

Maintenance Mode OFF

This request triggers the maintenance_off action:

curl -X POST https://<hostname>:9443/v1/nodes/<node_id>/actions/maintenance_off -k -u <user>:<password> --data '{"skip_shards_restore":false}' -H "Content-Type: application/json"
{..."status":"queued","task_id":"6c3c0d03-fb6f-40ad-9eca-9d46aa6a8487"}

You can choose to skip slave shard restoration using the skip_shards_restore boolean flag.

You can track the maintenance_on action using the API as well:

curl https://<hostname>:9443/v1/nodes/<node_id>/actions/maintenance_off -k -u <user>:<password>