[1.12] raft: Fix campaign on restart by aaronlehmann · Pull Request #1588 · moby/swarmkit

aaronlehmann · 2016-09-29T17:25:01Z

In #1221, the code was changed not to call Campaign on restart unless
n.Node.Status().Progress[n.Config.ID] was set. The problem is that this
is never set unless the node is already the leader, so this effectively
disabled the quick leader election on restart for a single-node cluster.

Instead of checking Progress, figure out whether it's safe to call
Campaign by setting a flag when we apply the config change that adds the
local node to the raft state machine. This flag is set in registerNode,
which is technically also called when loading a snapshot, but in this
case the node is restored into the state machine through ConfState, so
setting the flag in both cases is correct.

Also, only enable the calls to Campaign when the node was restarted from
state on disk.

This PR is against the 1.12.2 branch. The fix does not seem necessary on master. I'm going to open a separate PR there to simplify the code.

cc @mrjana @LK4D4

This brings the code on master in sync with the 1.12 changes in moby#1588. The code on master does not have the same problem (since a newer etcd/raft library lets it skip the problematic check), but the version being adopted by 1.12 should be a bit more robust. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

codecov-io · 2016-09-29T17:35:50Z

Current coverage is 55.06% (diff: 100%)

Merging #1588 into bump_v1.12.2 will increase coverage by 0.08%

@@           bump_v1.12.2      #1588   diff @@
==============================================
  Files                78         78          
  Lines             12559      12561     +2   
  Methods               0          0          
  Messages              0          0          
  Branches              0          0          
==============================================
+ Hits               6905       6917    +12   
+ Misses             4695       4686     -9   
+ Partials            959        958     -1

Powered by Codecov. Last update de507ff...f56cc6d

mrjana · 2016-09-29T17:38:11Z

Test this patch on top of 1.12.2-rc1 code and the delay in cluster restart seems to be fixed:

Sep 29 13:36:29 ip-172-31-27-46.us-west-1.compute.internal dockerd[28350]: time="2016-09-29T13:36:29.222161100-04:00" level=info msg="Loading containers: done."
Sep 29 13:36:29 ip-172-31-27-46.us-west-1.compute.internal dockerd[28350]: time="2016-09-29T13:36:29.245711430-04:00" level=info msg="3f3e8717c9a69bbb became follower at term 2"
Sep 29 13:36:29 ip-172-31-27-46.us-west-1.compute.internal dockerd[28350]: time="2016-09-29T13:36:29.245742196-04:00" level=info msg="newRaft 3f3e8717c9a69bbb [peers: [], term: 2, commit: 19, applied: 0, lastindex: 19, lastterm: 2]"
Sep 29 13:36:29 ip-172-31-27-46.us-west-1.compute.internal dockerd[28350]: time="2016-09-29T13:36:29.246899745-04:00" level=info msg="Listening for connections" addr="[::]:2377" proto=tcp
Sep 29 13:36:29 ip-172-31-27-46.us-west-1.compute.internal dockerd[28350]: time="2016-09-29T13:36:29.247120263-04:00" level=info msg="Listening for local connections" addr="/var/lib/docker/swarm/control.sock" proto=unix
Sep 29 13:36:29 ip-172-31-27-46.us-west-1.compute.internal dockerd[28350]: time="2016-09-29T13:36:29.250285413-04:00" level=info msg="3f3e8717c9a69bbb is starting a new election at term 2"
Sep 29 13:36:29 ip-172-31-27-46.us-west-1.compute.internal dockerd[28350]: time="2016-09-29T13:36:29.250338902-04:00" level=info msg="3f3e8717c9a69bbb became candidate at term 3"
Sep 29 13:36:29 ip-172-31-27-46.us-west-1.compute.internal dockerd[28350]: time="2016-09-29T13:36:29.250349018-04:00" level=info msg="3f3e8717c9a69bbb received vote from 3f3e8717c9a69bbb at term 3"
Sep 29 13:36:29 ip-172-31-27-46.us-west-1.compute.internal dockerd[28350]: time="2016-09-29T13:36:29.250370675-04:00" level=info msg="3f3e8717c9a69bbb became leader at term 3"
Sep 29 13:36:29 ip-172-31-27-46.us-west-1.compute.internal dockerd[28350]: time="2016-09-29T13:36:29.250380316-04:00" level=info msg="raft.node: 3f3e8717c9a69bbb elected leader 3f3e8717c9a69bbb at term 3"

LK4D4 · 2016-09-29T17:43:49Z

 	wal                 *wal.WAL
 	snapshotter         *snap.Snapshotter
-	restored            bool
+	addedSelf           bool // true if the raft state machine knows about the local node


Can't we just check in n.cluster?

Oh, sorry, went directly to the code without reading description.

LK4D4 · 2016-09-29T17:47:07Z

LGTM

This brings the code on master in sync with the 1.12 changes in moby#1588. The code on master does not have the same problem (since a newer etcd/raft library lets it skip the problematic check), but the version being adopted by 1.12 should be a bit more robust. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann · 2016-09-29T18:37:39Z

This seems to break TestApiSwarmManagerRestore, but I don't understand why:

----------------------------------------------------------------------
FAIL: docker_api_swarm_test.go:661: DockerSwarmSuite.TestApiSwarmManagerRestore

[d79866457] waiting for daemon to start
[d79866457] daemon started
[d79866457] exiting daemon
[d79866457] waiting for daemon to start
[d79866457] daemon started
[d13403274] waiting for daemon to start
[d13403274] daemon started
[d13403274] exiting daemon
[d13403274] waiting for daemon to start
[d13403274] daemon started
docker_api_swarm_test.go:677:
    d2.getService(c, id)
github.com/docker/docker/integration-cli/_test/_obj_test/daemon_swarm.go:156:
    ...open github.com/docker/docker/integration-cli/_test/_obj_test/daemon_swarm.go: no such file or directory
... obtained int = 500
... expected int = 200
... output: "{\"message\":\"rpc error: code = 2 desc = raft: no elected cluster leader\"}\n"

[d79866457] exiting daemon
[d13403274] exiting daemon

----------------------------------------------------------------------

Maybe this just exposes flakiness in the test?

aaronlehmann · 2016-09-29T18:46:40Z

Passed after a second run. So maybe the test is just flaky on my setup. I can't understand how this code change would cause this failure. It shouldn't change the behavior when d2 restarts (and even if it did, that shouldn't cause problems).

aluzzardi · 2016-09-29T21:38:31Z

Waiting for #1584 (or a similar fix) to land before merging this as it would hide the underlying error

This brings the code on master in sync with the 1.12 changes in moby#1588. The code on master does not have the same problem (since a newer etcd/raft library lets it skip the problematic check), but the version being adopted by 1.12 should be a bit more robust. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

In moby#1221, the code was changed not to call Campaign on restart unless n.Node.Status().Progress[n.Config.ID] was set. The problem is that this is never set unless the node is already the leader, so this effectively disabled the quick leader election on restart for a single-node cluster. Instead of checking Progress, figure out whether it's safe to call Campaign by checking if we are in the member list. This list is kept in sync with the config changes we apply in raft. Also, only enable the calls to Campaign when the node was restarted from state on disk. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann · 2016-09-29T22:51:58Z

Updated based on feedback in #1589.

LK4D4 · 2016-09-30T16:16:31Z

Patch LGTM. I think it's probably really cool to have it in 1.12.1

vieux · 2016-09-30T16:41:34Z

@LK4D4 you mean 1.12.2 I believe

LK4D4 · 2016-09-30T16:55:06Z

@vieux yup

This brings the code on master in sync with the 1.12 changes in moby#1588. The code on master does not have the same problem (since a newer etcd/raft library lets it skip the problematic check), but the version being adopted by 1.12 should be a bit more robust. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

GordonTheTurtle added the status/0-triage label Sep 29, 2016

aaronlehmann added this to the 1.12.2 milestone Sep 29, 2016

aaronlehmann changed the title ~~raft: Fix campaign on restart~~ [1.12] raft: Fix campaign on restart Sep 29, 2016

LK4D4 reviewed Sep 29, 2016

View reviewed changes

aaronlehmann mentioned this pull request Sep 29, 2016

raft: Forward-port improvements to campaign on restart #1589

Merged

aaronlehmann force-pushed the fix-campaign-on-restart branch from 65a09d7 to f56cc6d Compare September 29, 2016 22:51

aluzzardi merged commit 5f934bd into moby:bump_v1.12.2 Sep 30, 2016

aaronlehmann deleted the fix-campaign-on-restart branch September 30, 2016 19:45

LK4D4 mentioned this pull request Sep 30, 2016

vendor: update swarmkit for 1.12.2 moby/moby#27077

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.12] raft: Fix campaign on restart#1588

[1.12] raft: Fix campaign on restart#1588
aluzzardi merged 1 commit into
moby:bump_v1.12.2from
aaronlehmann:fix-campaign-on-restart

aaronlehmann commented Sep 29, 2016

Uh oh!

codecov-io commented Sep 29, 2016 •

edited

Loading

Uh oh!

mrjana commented Sep 29, 2016

Uh oh!

LK4D4 Sep 29, 2016

Uh oh!

LK4D4 Sep 29, 2016

Uh oh!

LK4D4 commented Sep 29, 2016

Uh oh!

aaronlehmann commented Sep 29, 2016

Uh oh!

aaronlehmann commented Sep 29, 2016

Uh oh!

aluzzardi commented Sep 29, 2016

Uh oh!

aaronlehmann commented Sep 29, 2016

Uh oh!

LK4D4 commented Sep 30, 2016

Uh oh!

vieux commented Sep 30, 2016

Uh oh!

LK4D4 commented Sep 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

aaronlehmann commented Sep 29, 2016

Uh oh!

codecov-io commented Sep 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 55.06% (diff: 100%)

Uh oh!

mrjana commented Sep 29, 2016

Uh oh!

LK4D4 Sep 29, 2016

Choose a reason for hiding this comment

Uh oh!

LK4D4 Sep 29, 2016

Choose a reason for hiding this comment

Uh oh!

LK4D4 commented Sep 29, 2016

Uh oh!

aaronlehmann commented Sep 29, 2016

Uh oh!

aaronlehmann commented Sep 29, 2016

Uh oh!

aluzzardi commented Sep 29, 2016

Uh oh!

aaronlehmann commented Sep 29, 2016

Uh oh!

LK4D4 commented Sep 30, 2016

Uh oh!

vieux commented Sep 30, 2016

Uh oh!

LK4D4 commented Sep 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

codecov-io commented Sep 29, 2016 •

edited

Loading