Skip to content

Add /{db}/_shard_sync endpoint to force internal sync of all shards #1807

@wohali

Description

@wohali

Current and expected behavior

Right now, when replacing or introducing new nodes into a cluster, we have a defined process in our documentation that states:

After you complete the previous step, as soon as CouchDB receives a write request for a shard on the target node, CouchDB will check if the target node’s shard(s) are up to date. If it finds they are not up to date, it will trigger an internal replication job to complete this task. You can observe this happening by triggering a write to the database (update a document, or create a new one), while monitoring the /_node//_system endpoint, which includes the internal_replication_jobs metric.

Once this metric has returned to the baseline from before you wrote the document, or is 0, the shard replica is ready to serve data and we can bring the node out of maintenance mode.

This is all well and good, but if you have a system with any number of idle databases (especially true for db-per-user setups or similar) or a number of databases that are mostly read-only, the sync will never be triggered. Should another node failure occur (assuming n=3) you would start failing quorum checks; lose a third node and you'd definitely experience data loss.

This isn't good.

By introducing a new /{db}/_shard_sync server-admin-only endpoint, a server admin should be able to force these internal replication jobs to trigger. One approach to making this work was provided by @chewbranca in this gist (untested!) which should immediately trigger internal shard replications for a given shard between all nodes. This code still needs to be passed a list of shards to replicate, so to work at the db level, you'd want to wrap this in another comprehension that walks across all shards of a given DB.

Writing a test case to validate this given our current eunit framework will be a bit painful, but I bet it could be done using the without-quorum type tests we currently have in JS (and @jjrodrig is working to convert to elixir).

We could then update our documentation to include the following text:

After you complete the previous step, *as server admin, POST /{db}/_compact for each database moved, one database at a time. This will tell CouchDB to trigger internal replication jobs between all copies of all shards in that database. You can then monitor the /_node//_system endpoint, which includes the internal_replication_jobs metric, indicating how many of these jobs are currently running.

Once this metric has returned to the baseline from before you wrote the document, or is 0, the new node is ready to serve data for that database. We can then repeat for other databases, or, if done, we can bring the node out of maintenance mode and return the cluster to normal operation.

We may also want to include advice about allowing more sync concurrency temporarily while these operations occur:

During the sync operation, one internal sync job is started for each permutation of 2 shard copies. For a default cluster (n=3, q=8) this means that syncing all shards for a given database will start 48 sync jobs ( 8 shards × P(3,2) = 48). However, by default, CouchDB will perform up to 10 simultaneous shard sync operations. This can create a bottleneck. You can temporarily increase the [mem3] sync_concurrency value to a larger figure to allow the cluster to consume more resources for the sync operation. Be sure to lower this value back to default after synchronization is complete.

Workaround

Today's (CouchDB 2.0.0 thru 2.3.0) workaround would be to follow this section in our documentation which relies on copying shard files over from known-good nodes to the newly added nodes.

As this isn't automated, this process is error-prone, and requires additional tooling on the part of the administrator we don't currently provide. It also doesn't help if there are writes during the copy, but the DB is once again quiescent after the new node is brought out of maintenance mode. We can do better.

Possible Solution

Start from this gist, and wrap another list comprehension around it to get all of the shard names for a given db. Expose at the new /{db}/_sync endpoint, and ensure that endpoint's access is server admin-level access only.

Context

Users of CouchDB with many databases who lose a node and don't want to scp shards around have a hard time ensuring the newly added node to a database has been fully "caught up" with all of the data necessary.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions