Skip to content

raft: fix Campaign on restore if the node id is not in the progress list#1221

Merged
aaronlehmann merged 1 commit into
moby:masterfrom
abronan:fix_campaign_on_restore
Jul 26, 2016
Merged

raft: fix Campaign on restore if the node id is not in the progress list#1221
aaronlehmann merged 1 commit into
moby:masterfrom
abronan:fix_campaign_on_restore

Conversation

@abronan
Copy link
Copy Markdown
Contributor

@abronan abronan commented Jul 25, 2016

When restarting a node from its state, logs are reapplied to
restore the cluster membership with ConfChangeAddNode log
entries.

There is a chance to reach the portion calling Campaign to
restart a single cluster member faster when we are still
in the process of restoring the memberlist. In this case
the node may not appear in the progress list and we might
call MaybeUpdate on the node id from the map without
sanity checking thus resulting in a panic.

To avoid that, check if the node is present in the Progress
list before trying to Campaign on node restore.

Fix #1196

/cc @aaronlehmann @LK4D4

Signed-off-by: Alexandre Beslic alexandre.beslic@gmail.com

When restarting a node from its state, logs are reapplied to
restore the cluster membership with `ConfChangeAddNode` log
entries.

There is a chance to reach the portion calling `Campaign` to
restart a single cluster member faster when we are still
in the process of restoring the memberlist. In this case
the node may not appear in the progress list and we might
call `MaybeUpdate` on the node id from the map without
sanity checking thus resulting in a panic.

To avoid that, check if the node is present in the Progress
list before trying to Campaign on node restore.

Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>
@codecov-io
Copy link
Copy Markdown

Current coverage is 55.10% (diff: 100%)

Merging #1221 into master will increase coverage by 0.13%

@@             master      #1221   diff @@
==========================================
  Files            77         77          
  Lines         12368      12369     +1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           6799       6816    +17   
+ Misses         4632       4620    -12   
+ Partials        937        933     -4   

Sunburst

Powered by Codecov. Last update 4d7e443...e786694

@LK4D4
Copy link
Copy Markdown
Contributor

LK4D4 commented Jul 25, 2016

I've tested and there is no panic.
LGTM

@aaronlehmann
Copy link
Copy Markdown
Collaborator

Let's say the node does not exist in the Progress map the first time we reach this code. Is there a guarantee that the Ready() channel will give us another chance to try it?

@abronan
Copy link
Copy Markdown
Contributor Author

abronan commented Jul 25, 2016

@aaronlehmann Not sure if I get your question right but I'll try to answer:

No it will give up on the election "fast-track" process and wait for the full ElectionTick because the node wasn't ready to be elected yet. I think it is more reasonable than panicking :)

This is the easiest way I think to circumvent the panic without having to cherry-pick the patch doing the sanity check alongside v2.3.2 and vendor. It has the same effect of avoiding the call to Campaign if the node is not in the progress map. Only difference is that we do the check upstream before even calling Campaign (which is sending a MsgHup, triggering a Step() and calling the private campaign method).

@xiang90
Copy link
Copy Markdown

xiang90 commented Jul 25, 2016

Well. The actual fix is here: etcd-io/etcd#6039

@xiang90
Copy link
Copy Markdown

xiang90 commented Jul 25, 2016

@abronan @aaronlehmann Before calling Campaign, we need to make sure all pervious configuration changes are applied. You can ensure this at application layer. But I feel we should ensure this at raft layer, so applications do not need to repeat this logic. The one line change was just a quick hack. If it is urgent, you can take it.

@aaronlehmann
Copy link
Copy Markdown
Collaborator

LGTM

@aaronlehmann aaronlehmann merged commit c3b3ca8 into moby:master Jul 26, 2016
@aluzzardi aluzzardi mentioned this pull request Jul 26, 2016
@aluzzardi
Copy link
Copy Markdown
Member

LGTM

Filed #1229 to keep track of vendoring update

@abronan abronan deleted the fix_campaign_on_restore branch July 26, 2016 01:45
aaronlehmann pushed a commit to aaronlehmann/swarmkit that referenced this pull request Sep 29, 2016
In moby#1221, the code was changed not to call Campaign on restart unless
n.Node.Status().Progress[n.Config.ID] was set. The problem is that this
is never set unless the node is already the leader, so this effectively
disabled the quick leader election on restart for a single-node cluster.

Instead of checking Progress, figure out whether it's safe to call
Campaign by setting a flag when we apply the config change that adds the
local node to the raft state machine. This flag is set in registerNode,
which is technically also called when loading a snapshot, but in this
case the node is restored into the state machine through ConfState, so
setting the flag in both cases is correct.

Also, only enable the calls to Campaign when the node was restarted from
state on disk.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
aaronlehmann pushed a commit to aaronlehmann/swarmkit that referenced this pull request Sep 29, 2016
In moby#1221, the code was changed not to call Campaign on restart unless
n.Node.Status().Progress[n.Config.ID] was set. The problem is that this
is never set unless the node is already the leader, so this effectively
disabled the quick leader election on restart for a single-node cluster.

Instead of checking Progress, figure out whether it's safe to call
Campaign by checking if we are in the member list. This list is kept in
sync with the config changes we apply in raft.

Also, only enable the calls to Campaign when the node was restarted from
state on disk.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants