Adding a recovery mechanism for a split gossip cluster #2134

dani-docker · 2018-04-06T20:29:31Z

Steps To Reproduce:

3 managers 3 workers cluster
Stop all managers in the cluster (leave the workers running)
Wait a minute and start all managers

Expected Results:

All nodes converge into a single gossip cluster

Actual Results:

The workers and the managers ended up on 2 different gossip clusters. This causes overlay and unpredicted networking issues between containers on the 2 different clusters.

Workaround:

Restart "All" the workers to force re-joining the manager cluster
or
To avoid a restart, use the diagnostic tool with the /join endpoint

PR fix:
This PR adds a go routine that runs every minute and checks that at least one node from the bootStrap list (ie: managers nodes) is part of the cluster, if we couldn't find any, then we attempt a /join for 10 seconds .

Note:
While testing, I ran into an issue, that I couldn't reliably reproduce;
a leave event was followed by a false positive join event was received by a worker when the manager node was down, this caused an inconsistency in the nDB.nodes which caused the logic in this PR to fail.

@fcrisciani recommended the below change and he'll be looking into the problem in more details.


	// If the node is not known from memberlist we cannot process save any state of it else if it actually
	// dies we won't receive any notification and we will remain stuck with it
	if _, ok := nDB.nodes[nEvent.NodeName]; !ok {
		logrus.Error("node: %s is unknown to memberlist", nEvent.NodeName)
		return false
	}

Signed-off-by: Dani Louca dani.louca@docker.com

codecov-io · 2018-04-06T20:51:23Z

Codecov Report

❗ No coverage uploaded for pull request base (master@c15b372). Click here to learn what that means.
The diff coverage is 34.48%.

@@            Coverage Diff            @@
##             master    #2134   +/-   ##
=========================================
  Coverage          ?   40.49%           
=========================================
  Files             ?      139           
  Lines             ?    22467           
  Branches          ?        0           
=========================================
  Hits              ?     9097           
  Misses            ?    12031           
  Partials          ?     1339

Impacted Files	Coverage Δ
networkdb/networkdb.go	`67.8% <ø> (ø)`
networkdb/delegate.go	`75% <100%> (ø)`
networkdb/cluster.go	`60.75% <26.92%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c15b372...744334d. Read the comment docs.

fcrisciani · 2018-04-06T22:06:17Z

networkdb/delegate.go

 func (d *delegate) LocalState(join bool) []byte {
 	if join {
-		// Update all the local node/network state to a new time to
+		// Update all the local node/network state to  a new time to


extra space :D

How this happened... Fat finger :)
I'll fix it

fcrisciani · 2018-04-06T23:03:21Z

networkdb/cluster.go

 	if _, err := mlist.Join(members); err != nil {
 		// In case of failure, keep retrying join until it succeeds or the cluster is shutdown.
-		go nDB.retryJoin(members, nDB.stopCh)
+		go nDB.retryJoin(nDB.ctx, members)


with this change I don't think launching this routine makes anymore sense. If there is a failure here, the next attempt will happen in 60s or less with the other codepath

Good point, clusterJoin is only called when agent is initialized to join the boot_strap nodes; unless it's time critical and we can't wait 60 seconds, it's now redundant.
Happy to take it out.

fcrisciani · 2018-04-06T23:07:23Z

networkdb/cluster.go

+	ctx, cancel := context.WithTimeout(nDB.ctx, 10*time.Second)
+	defer cancel()
+	nDB.retryJoin(ctx, bootStrapIPs)
+


extra new line

fcrisciani · 2018-04-06T23:08:08Z

networkdb/cluster.go

+	nDB.RUnlock()
+	logrus.Debugf("rejoinClusterBootStrap, calling cluster join with bootStrap %v", bootStrapIPs)
+	// All bootStrap nodes are not in the cluster, call memberlist join
+	ctx, cancel := context.WithTimeout(nDB.ctx, 10*time.Second)


we should put this time of 10s at the top

ddebroy · 2018-04-06T23:25:07Z

networkdb/delegate.go

-		}
+	// If the node is not known from memberlist we cannot process save any state of it else if it actually
+	// dies we won't receive any notification and we will remain stuck with it
+	if _, ok := nDB.nodes[nEvent.NodeName]; !ok {


Is the nDB.findNode check necessary above [L28] given this check?

only to filter based on the lamport clock

ddebroy · 2018-04-07T00:57:43Z

networkdb/cluster.go

+// rejoinClusterBootStrap is called periodically to check if all bootStrap nodes are active in the cluster,
+// if not, call the cluster join to merge 2 separate clusters that are formed when all managers
+// stopped/started at the same time
+func (nDB *NetworkDB) rejoinClusterBootStrap() {


Will it make sense to also attempt to refresh nDB.bootStrapIP here through a call to something like GetRemoteAddressList in case the list has changed?

@fcrisciani and I had a discussion about other use cases that can lead us to the same "split cluster" and the possibility of re-checking/updating the bootStrap IPs (through GetRemoteAddressList)
1- Re-ip the managers .
2- Demote/Promote managers/workers .

The first one is not an issue as a re-ip to all managers will force all nodes in the cluster to restart .
As for 2), customers will only hit it if they demoted all managers in the cluster + restarting those managers without restarting the workers... We felt like this is a very edge case.

This has been said, If you guys think we should still refresh the bootStrapIP, then we can add the logic to the newly introduced rejoinClusterBootStrap

I think the 2 problems can be handled separately. Memberist with this PR will honor the bootstrapIP that got passed at the beginning.
The second issue will need a periodic check of the GetRemoteAddressList and if the list change, the routine have to call the networkDB.Join with the new bootstrapIPs

ddebroy

LGTM

ddebroy · 2018-04-23T16:41:23Z

networkdb/cluster.go

+		bootStrapIPs = append(bootStrapIPs, bootIP.String())
+	}
+	nDB.RUnlock()
+	// Not all bootStrap nodes are in the cluster, call memberlist join


I think you mean None of the bootStrap nodes are in the cluster rather than Not all ... in the comment, right?

yep. I will reword it and push the update.
Thx

fcrisciani · 2018-04-23T17:28:16Z

LGTM, if you fix the comment that @ddebroy mentioned we can merge

Signed-off-by: Dani Louca <dani.louca@docker.com>

dani-docker · 2018-04-23T18:22:03Z

Thx guys. PR is updated with latest change request.

fcrisciani

LGTM

fcrisciani · 2018-06-21T21:22:57Z

networkdb/cluster.go

+	bootStrapIPs := make([]string, 0, len(nDB.bootStrapIP))
+	for _, bootIP := range nDB.bootStrapIP {
+		for _, node := range nDB.nodes {
+			if node.Addr.Equal(bootIP) {


@dani-docker thinking more about this, I guess here it's missing the check that the IP is != from current node IP else this fix won't work for the managers. Every manager will see itself in the list and won't try to reconnect

fcrisciani requested a review from ddebroy April 6, 2018 22:04

fcrisciani reviewed Apr 6, 2018

View reviewed changes

ddebroy reviewed Apr 6, 2018

View reviewed changes

ddebroy reviewed Apr 7, 2018

View reviewed changes

dani-docker force-pushed the esc-532 branch 2 times, most recently from 0a10a03 to e0dd4ed Compare April 11, 2018 14:59

ddebroy approved these changes Apr 23, 2018

View reviewed changes

Adding a recovery mechanism for a split gossip cluster

744334d

Signed-off-by: Dani Louca <dani.louca@docker.com>

dani-docker force-pushed the esc-532 branch from e0dd4ed to 744334d Compare April 23, 2018 18:19

fcrisciani approved these changes Apr 23, 2018

View reviewed changes

fcrisciani merged commit 51812c9 into moby:master Apr 23, 2018

thaJeztah mentioned this pull request May 20, 2018

bump libnetwork to eb6b2a57955e5c149d47c3973573216e8f8baa09 moby/moby#37111

Merged

fcrisciani reviewed Jun 21, 2018

View reviewed changes

fcrisciani mentioned this pull request Jun 27, 2018

[backport 17.06] Adding a recovery mechanism for a split gossip cluster #2169

Merged

trapier mentioned this pull request May 17, 2019

[Backport 17.06] NetworkDB qlen optimization #2379

Merged

Adding a recovery mechanism for a split gossip cluster #2134

Adding a recovery mechanism for a split gossip cluster #2134

Uh oh!

Conversation

dani-docker commented Apr 6, 2018 • edited by abhi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Apr 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddebroy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fcrisciani commented Apr 23, 2018

Uh oh!

dani-docker commented Apr 23, 2018

Uh oh!

fcrisciani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dani-docker commented Apr 6, 2018 •

edited by abhi

Loading

codecov-io commented Apr 6, 2018 •

edited

Loading