[1.12] raft: Fix campaign on restart#1588
Conversation
This brings the code on master in sync with the 1.12 changes in moby#1588. The code on master does not have the same problem (since a newer etcd/raft library lets it skip the problematic check), but the version being adopted by 1.12 should be a bit more robust. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Current coverage is 55.06% (diff: 100%)
@@ bump_v1.12.2 #1588 diff @@
==============================================
Files 78 78
Lines 12559 12561 +2
Methods 0 0
Messages 0 0
Branches 0 0
==============================================
+ Hits 6905 6917 +12
+ Misses 4695 4686 -9
+ Partials 959 958 -1
|
|
Test this patch on top of 1.12.2-rc1 code and the delay in cluster restart seems to be fixed: |
| wal *wal.WAL | ||
| snapshotter *snap.Snapshotter | ||
| restored bool | ||
| addedSelf bool // true if the raft state machine knows about the local node |
There was a problem hiding this comment.
Can't we just check in n.cluster?
There was a problem hiding this comment.
Oh, sorry, went directly to the code without reading description.
|
LGTM |
This brings the code on master in sync with the 1.12 changes in moby#1588. The code on master does not have the same problem (since a newer etcd/raft library lets it skip the problematic check), but the version being adopted by 1.12 should be a bit more robust. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
|
This seems to break Maybe this just exposes flakiness in the test? |
|
Passed after a second run. So maybe the test is just flaky on my setup. I can't understand how this code change would cause this failure. It shouldn't change the behavior when |
|
Waiting for #1584 (or a similar fix) to land before merging this as it would hide the underlying error |
This brings the code on master in sync with the 1.12 changes in moby#1588. The code on master does not have the same problem (since a newer etcd/raft library lets it skip the problematic check), but the version being adopted by 1.12 should be a bit more robust. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
In moby#1221, the code was changed not to call Campaign on restart unless n.Node.Status().Progress[n.Config.ID] was set. The problem is that this is never set unless the node is already the leader, so this effectively disabled the quick leader election on restart for a single-node cluster. Instead of checking Progress, figure out whether it's safe to call Campaign by checking if we are in the member list. This list is kept in sync with the config changes we apply in raft. Also, only enable the calls to Campaign when the node was restarted from state on disk. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
65a09d7 to
f56cc6d
Compare
|
Updated based on feedback in #1589. |
|
Patch LGTM. I think it's probably really cool to have it in 1.12.1 |
|
@LK4D4 you mean 1.12.2 I believe |
|
@vieux yup |
This brings the code on master in sync with the 1.12 changes in moby#1588. The code on master does not have the same problem (since a newer etcd/raft library lets it skip the problematic check), but the version being adopted by 1.12 should be a bit more robust. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
This brings the code on master in sync with the 1.12 changes in moby#1588. The code on master does not have the same problem (since a newer etcd/raft library lets it skip the problematic check), but the version being adopted by 1.12 should be a bit more robust. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
In #1221, the code was changed not to call Campaign on restart unless
n.Node.Status().Progress[n.Config.ID] was set. The problem is that this
is never set unless the node is already the leader, so this effectively
disabled the quick leader election on restart for a single-node cluster.
Instead of checking Progress, figure out whether it's safe to call
Campaign by setting a flag when we apply the config change that adds the
local node to the raft state machine. This flag is set in registerNode,
which is technically also called when loading a snapshot, but in this
case the node is restored into the state machine through ConfState, so
setting the flag in both cases is correct.
Also, only enable the calls to Campaign when the node was restarted from
state on disk.
This PR is against the 1.12.2 branch. The fix does not seem necessary on master. I'm going to open a separate PR there to simplify the code.
cc @mrjana @LK4D4