Skip to content

consul-replicate sync empty data during raft leader election in the master DC #82

@vaLski

Description

@vaLski

Hello.

We ran into very strange issue with consul-replicate - three times.

consul-replicate v0.3.1 (172f34b) - configured with **max_stale = "5m"**
consul v0.7.5

During outage and raft leader election in the main data center, consul-replicate in the followers data centers synchronized blank or incomplete data from the main dc.

As a result, huge amount of data is erased from the prefixes we were replicating in the followers data centers. We experienced this with a prefix holding ~150000 subkeys/records.

UPDATE: My latest tests are showing that this issue can NOT be recreated if I configure consul-replicate without max_stale.

I suspect there might be something like race condition, where consul-replicate prefix watch is attached to a main dc follower that did not catch up completely with the master yet and whose kv db was empty resulting in triggering a kv prefix watch event with blank data. In this case I am not really sure how it get past the following check:

// Get the prefix data
view, ok := r.get(prefix)
if !ok {
    log.Printf("[INFO] (runner) no data for %q", prefix.Dependency)
    doneCh <- struct{}{}
    return
}

Here is a quick timeline what happened + logs

This was the last successful replication prior main dc outage

20180313T084044.363696: 2018/03/13 08:40:44.363681 [INFO] (runner) replicated 0 updates, 23 deletes

Main data center goes down and consul-replicate in follower data center start to return errors. Some of the logs skipped as they are identical.

20180313T084122.659085: 2018/03/13 08:41:22.659073 [WARN] (view) kv.list(pub@main-dc): Unexpected response code: 500 (retry attempt 1 after "10s")

Retry has been successful and some data was replicated:

20180313T084251.126752: 2018/03/13 08:42:51.126734 [INFO] (runner) quiescence timers starting
20180313T084251.276830: 2018/03/13 08:42:51.276821 [INFO] (runner) quiescence minTimer fired
20180313T084251.276842: 2018/03/13 08:42:51.276827 [INFO] (runner) running
20180313T084252.059906: 2018/03/13 08:42:52.059891 [INFO] (runner) replicated 49 updates, 53 deletes
20180313T084259.630092: 2018/03/13 08:42:59.630080 [INFO] (runner) quiescence timers starting
20180313T084259.780182: 2018/03/13 08:42:59.780168 [INFO] (runner) quiescence minTimer fired
20180313T084259.780193: 2018/03/13 08:42:59.780179 [INFO] (runner) running

Main data center goes down again:

20180313T084409.232366: 2018/03/13 08:44:09.232265 [WARN] (view) kv.list(pub@main-dc): Unexpected response code: 500 (retry attempt 4 after "1m20s")
20180313T085144.446896: 2018/03/13 08:51:44.446804 [WARN] (view) kv.list(pub@main-us): Unexpected response code: 500 (retry attempt 6 after "5m20s")

The very next log line is showing how consul-replicate synchronized empty data from the main data center resulting all keys from the pub prefix to be deleted

20180313T090040.647493: 2018/03/13 09:00:40.647411 [INFO] (runner) replicated 0 updates, 127719 deletes

Also after the outage in the main data center was resolved, it refused to synchronize proper data from the main dc KV store again due to the following code:

// Ignore if the modify index is old
    if pair.ModifyIndex <= status.LastReplicated {
        log.Printf("[DEBUG] (runner) skipping because %q is already "+
            "replicated", key)
        continue
    }

To force re-synchronization I had to erase recursively, all entries under the following prefix:

consul kv delete -recurse service/consul-replicate/statuses

Not sure if adjusting the following options could have prevented (sync blank data) this from happening. I am still unsure how to read those timers exactly. Did they mean "force replication each N seconds" or I am getting this wrong. How they are related to the cluster consistent state?

# This is the quiescence timers; it defines the minimum and maximum amount of
# time to wait for the cluster to reach a consistent state before rendering a
# replicating. This is useful to enable in systems that have a lot of flapping,
# because it will reduce the the number of times a replication occurs.
wait {
  min = "5s"
  max = "10s"
}

I am aware that the consul version we are using is a bit old, but I suspect that the issue is more to be related consul-replicate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions