Recovery

After a node has been failed over, it can be recovered: that is, added back into the cluster from which it was failed over, by means of the rebalance operation.

Understanding Recovery

After failover has occurred, there are two principal options:

The node can be removed from the cluster, by means of the Rebalance operation.
The node can be recovered, and thereby added back into the cluster; again by means of the rebalance operation.

There are two options for recovery; which are full and delta. The remainder of this page explains the principles and architectural aspects of these options. For practical examples, see Recover a Node and Rebalance.

Full Recovery

Full recovery involves removing all pre-existing data from, and assigning new data to, the node that is being recovered. Therefore, when this option is applied:

All existing vBuckets and documents are removed from the node.
If GSI Indexes reside on the node, they are left unmodified during the rebalance process.
A new set of vBuckets and documents is assigned to the node.
When the node’s vBuckets are all up to date, and the rebalance process concludes, the node recommences the serving of data. If GSI Indexes reside on the node, they become active, and are updated by the Index Service as appropriate.

Delta Recovery

Delta recovery maintains and resynchronizes a node’s pre-existing data. Therefore, when this option is applied:

No existing vBucket or document is removed from the node.
All existing data is loaded into memory.
The point at which mutations to node-data most recently stopped is determined. Then, vBuckets are duly updated from that point; based on the data-changes that have since occurred, elsewhere on the cluster.
If GSI Indexes reside on the node, they are left unmodified during the rebalance process.
When the node’s vBuckets are all up to date, and the rebalance process concludes, the node recommences the serving of data. If GSI Indexes reside on the node, they become active, and are updated by the Index Service as appropriate (this includes updates that correspond to whatever mutations were made by the Data Service while the node was in a failed over state).

Delta-Recovery Requirements

Delta recovery can only be performed when:

The node to be recovered is healthy; likewise the cluster.
The node to be recovered is in a failed over state.

Delta recovery is possible for all Buckets on the node. Note that the correspondence of Buckets, vBuckets, and documents on the node remains entirely unchanged by Delta recovery.

Delta-Recovery Failures

In some cases, when Delta recovery is attempted, and all the requirements listed immediately above have been met, the operation nevertheless either fails or defaults to Full recovery. This indicates that one or more of the following conditions exist:

Cluster-topology has changed since the node was last available within the cluster.
The node was hard failed over, and is marked for removal.
The rebalance was configured to perform the Delta recovery while simultaneously moving other nodes in or out of the cluster, and the numbers of nodes intended respectively to leave and join the cluster were unequal. (Note that in this case, a Swap Rebalance can be performed instead.)
Bucket-operations were performed while Delta recovery was pending: this changed configurations, and has made Delta recovery impossible.
Either the node being recovered or another node in the cluster has crashed, or become otherwise unavailable, at some point during the process of recovery and rebalance.

Recovery Performance

In many cases, Delta recovery is faster than Full recovery; since a significant quantity of usable data already resides on the node, and therefore does not require network-transfer: only updates made since the node’s last-recorded mutation need to be accessed from other nodes in the cluster.

However, in cases where the node’s memory-footprint is extremely large, and data exceeds bucket memory-quotas, the memory-management overhead potentially entailed by Delta recovery might imply Full recovery’s taking less time overall.

Note that in Couchbase Server 6.5 and later, a scan can be requested from each index, as soon as the index has warmed up — the recovery does not need to be fully complete. This can reduce the downtime during recovery.

Only scans where consistency is not_bounded, or scans for which a consistent snapshot is available, are allowed during recovery. In other cases, an error is returned, so that replicas can be tried.