[backport 17.06] Adding a recovery mechanism for a split gossip cluster #2169

fcrisciani · 2018-05-31T03:16:35Z

Backport bugfix #2134

Signed-off-by: Dani Louca dani.louca@docker.com
(cherry picked from commit 744334d)

Signed-off-by: Dani Louca <dani.louca@docker.com> (cherry picked from commit 744334d)

fcrisciani · 2018-05-31T03:18:26Z

Still WIP the backport was not clean, just checked that compiles at the moment

fcrisciani · 2018-06-01T21:54:06Z

@dani-docker can you take a look too? I did the backport of your fix on 17.06, there were a bunch of conflicts so better to have another pair of eyes on this

dani-docker · 2018-06-04T16:32:48Z

@fcrisciani
The changes here
https://github.com/docker/libnetwork/blob/744334d441587278813cddbb3f79eb12a48fad1c/networkdb/delegate.go#L41-L46
are not ported to https://github.com/fcrisciani/libnetwork/blob/46c52cdcb67137eeaeabe69a3143d317268a56a7/networkdb/delegate.go#L98-L105

fcrisciani · 2018-06-04T18:38:24Z

@dani-docker cannot backport that part here is why:
In the event_delegate.go we have on a join:

func (e *eventDelegate) NotifyJoin(mn *memberlist.Node) {            
  logrus.Infof("Node %s/%s, joined gossip cluster", mn.Name, mn.Addr)
  e.broadcastNodeEvent(mn.Addr, opCreate)                            
  e.nDB.Lock()                                                       
  // In case the node is rejoining after a failure or leave,         
  // wait until an explicit join message arrives before adding       
  // it to the nodes just to make sure this is not a stale           
  // join. If you don't know about this node add it immediately.     
  _, fOk := e.nDB.failedNodes[mn.Name]                               
  _, lOk := e.nDB.leftNodes[mn.Name]                                 
  if fOk || lOk {                                                    
    e.nDB.Unlock()                                                   
    return                                                           
  }

meaning that in this code base the rejoin of a node that lose connectivity but do no change identity has to happen through the logic of delegate.go. If I embed that change, this will break and a node that temporary gets disconnected won't be able to join back.

dani-docker

LGTM

tiborvass · 2018-06-18T23:21:40Z

To test this PR:

1. Bring up 3 managers 2 workers
2. docker network create -d overlay --attachable net1
3. docker service create --name test --network net1 --replicas 10 busybox sleep 10000
4. Bring down 3 managers
5. Bring up 3 managers
6. docker run -dit --network net1 --name foo busybox sleep 10000
7. ssh to worker and run: docker exec -it $someTaskContainer nslookup foo

The last step should fail without this PR, and pass with this PR.

Another way to test is:

Everything the same till step 5
6. wait 5 min
7. cat docker.log | grep "NetworkDB stats"  # look for netPeers
8. ssh to worker and run: cat docker.log | grep "NetworkDB stats" # look for netPeers

The two netPeers should not match with this PR, and match with this PR.

tiborvass · 2018-06-21T18:49:15Z

@fcrisciani I tested this PR, and nslookup still fails with this PR.

tiborvass · 2018-06-21T20:46:38Z

@fcrisciani I now managed to confirm this patch fixes the original issue.

tiborvass · 2018-06-26T00:14:26Z

LGTM (IANAM)

tiborvass · 2018-06-26T00:14:40Z

Ping @abhi

abhi

LGTM

Adding a recovery mechanism for a split gossip cluster

46c52cd

Signed-off-by: Dani Louca <dani.louca@docker.com> (cherry picked from commit 744334d)

fcrisciani requested a review from dani-docker May 31, 2018 03:16

fcrisciani changed the title ~~[WIP] Adding a recovery mechanism for a split gossip cluster~~ Adding a recovery mechanism for a split gossip cluster Jun 1, 2018

dani-docker approved these changes Jun 4, 2018

View reviewed changes

fcrisciani changed the title ~~Adding a recovery mechanism for a split gossip cluster~~ [backport 17.06] Adding a recovery mechanism for a split gossip cluster Jun 5, 2018

abhi approved these changes Jun 26, 2018

View reviewed changes

abhi merged commit fc60a75 into moby:bump_17.06 Jun 27, 2018

trapier mentioned this pull request May 17, 2019

[Backport 17.06] NetworkDB qlen optimization #2379

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backport 17.06] Adding a recovery mechanism for a split gossip cluster #2169

[backport 17.06] Adding a recovery mechanism for a split gossip cluster #2169

Uh oh!

fcrisciani commented May 31, 2018 •

edited by andrewhsu

Loading

Uh oh!

fcrisciani commented May 31, 2018

Uh oh!

fcrisciani commented Jun 1, 2018

Uh oh!

dani-docker commented Jun 4, 2018

Uh oh!

fcrisciani commented Jun 4, 2018

Uh oh!

dani-docker left a comment

Uh oh!

tiborvass commented Jun 18, 2018 •

edited

Loading

Uh oh!

tiborvass commented Jun 21, 2018

Uh oh!

tiborvass commented Jun 21, 2018

Uh oh!

tiborvass commented Jun 26, 2018

Uh oh!

tiborvass commented Jun 26, 2018

Uh oh!

abhi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[backport 17.06] Adding a recovery mechanism for a split gossip cluster #2169

[backport 17.06] Adding a recovery mechanism for a split gossip cluster #2169

Uh oh!

Conversation

fcrisciani commented May 31, 2018 • edited by andrewhsu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fcrisciani commented May 31, 2018

Uh oh!

fcrisciani commented Jun 1, 2018

Uh oh!

dani-docker commented Jun 4, 2018

Uh oh!

fcrisciani commented Jun 4, 2018

Uh oh!

dani-docker left a comment

Choose a reason for hiding this comment

Uh oh!

tiborvass commented Jun 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiborvass commented Jun 21, 2018

Uh oh!

tiborvass commented Jun 21, 2018

Uh oh!

tiborvass commented Jun 26, 2018

Uh oh!

tiborvass commented Jun 26, 2018

Uh oh!

abhi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fcrisciani commented May 31, 2018 •

edited by andrewhsu

Loading

tiborvass commented Jun 18, 2018 •

edited

Loading