raft: fix Campaign on restore if the node id is not in the progress list#1221
Conversation
When restarting a node from its state, logs are reapplied to restore the cluster membership with `ConfChangeAddNode` log entries. There is a chance to reach the portion calling `Campaign` to restart a single cluster member faster when we are still in the process of restoring the memberlist. In this case the node may not appear in the progress list and we might call `MaybeUpdate` on the node id from the map without sanity checking thus resulting in a panic. To avoid that, check if the node is present in the Progress list before trying to Campaign on node restore. Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>
Current coverage is 55.10% (diff: 100%)@@ master #1221 diff @@
==========================================
Files 77 77
Lines 12368 12369 +1
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 6799 6816 +17
+ Misses 4632 4620 -12
+ Partials 937 933 -4
|
|
I've tested and there is no panic. |
|
Let's say the node does not exist in the |
|
@aaronlehmann Not sure if I get your question right but I'll try to answer: No it will give up on the election "fast-track" process and wait for the full This is the easiest way I think to circumvent the panic without having to cherry-pick the patch doing the sanity check alongside |
|
Well. The actual fix is here: etcd-io/etcd#6039 |
|
@abronan @aaronlehmann Before calling Campaign, we need to make sure all pervious configuration changes are applied. You can ensure this at application layer. But I feel we should ensure this at raft layer, so applications do not need to repeat this logic. The one line change was just a quick hack. If it is urgent, you can take it. |
|
LGTM |
|
LGTM Filed #1229 to keep track of vendoring update |
In moby#1221, the code was changed not to call Campaign on restart unless n.Node.Status().Progress[n.Config.ID] was set. The problem is that this is never set unless the node is already the leader, so this effectively disabled the quick leader election on restart for a single-node cluster. Instead of checking Progress, figure out whether it's safe to call Campaign by setting a flag when we apply the config change that adds the local node to the raft state machine. This flag is set in registerNode, which is technically also called when loading a snapshot, but in this case the node is restored into the state machine through ConfState, so setting the flag in both cases is correct. Also, only enable the calls to Campaign when the node was restarted from state on disk. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
In moby#1221, the code was changed not to call Campaign on restart unless n.Node.Status().Progress[n.Config.ID] was set. The problem is that this is never set unless the node is already the leader, so this effectively disabled the quick leader election on restart for a single-node cluster. Instead of checking Progress, figure out whether it's safe to call Campaign by checking if we are in the member list. This list is kept in sync with the config changes we apply in raft. Also, only enable the calls to Campaign when the node was restarted from state on disk. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
When restarting a node from its state, logs are reapplied to
restore the cluster membership with
ConfChangeAddNodelogentries.
There is a chance to reach the portion calling
Campaigntorestart a single cluster member faster when we are still
in the process of restoring the memberlist. In this case
the node may not appear in the progress list and we might
call
MaybeUpdateon the node id from the map withoutsanity checking thus resulting in a panic.
To avoid that, check if the node is present in the Progress
list before trying to Campaign on node restore.
Fix #1196
/cc @aaronlehmann @LK4D4
Signed-off-by: Alexandre Beslic alexandre.beslic@gmail.com