Skip to content

Conversation

@fcrisciani
Copy link

@fcrisciani fcrisciani commented May 31, 2018

Backport bugfix #2134

Signed-off-by: Dani Louca dani.louca@docker.com
(cherry picked from commit 744334d)

Signed-off-by: Dani Louca <dani.louca@docker.com>
(cherry picked from commit 744334d)
@fcrisciani fcrisciani requested a review from dani-docker May 31, 2018 03:16
@fcrisciani
Copy link
Author

Still WIP the backport was not clean, just checked that compiles at the moment

@fcrisciani fcrisciani changed the title [WIP] Adding a recovery mechanism for a split gossip cluster Adding a recovery mechanism for a split gossip cluster Jun 1, 2018
@fcrisciani
Copy link
Author

@dani-docker can you take a look too? I did the backport of your fix on 17.06, there were a bunch of conflicts so better to have another pair of eyes on this

@fcrisciani
Copy link
Author

@dani-docker cannot backport that part here is why:
In the event_delegate.go we have on a join:

func (e *eventDelegate) NotifyJoin(mn *memberlist.Node) {            
  logrus.Infof("Node %s/%s, joined gossip cluster", mn.Name, mn.Addr)
  e.broadcastNodeEvent(mn.Addr, opCreate)                            
  e.nDB.Lock()                                                       
  // In case the node is rejoining after a failure or leave,         
  // wait until an explicit join message arrives before adding       
  // it to the nodes just to make sure this is not a stale           
  // join. If you don't know about this node add it immediately.     
  _, fOk := e.nDB.failedNodes[mn.Name]                               
  _, lOk := e.nDB.leftNodes[mn.Name]                                 
  if fOk || lOk {                                                    
    e.nDB.Unlock()                                                   
    return                                                           
  }                                                                  

meaning that in this code base the rejoin of a node that lose connectivity but do no change identity has to happen through the logic of delegate.go. If I embed that change, this will break and a node that temporary gets disconnected won't be able to join back.

Copy link
Contributor

@dani-docker dani-docker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fcrisciani fcrisciani changed the title Adding a recovery mechanism for a split gossip cluster [backport 17.06] Adding a recovery mechanism for a split gossip cluster Jun 5, 2018
@tiborvass
Copy link
Contributor

tiborvass commented Jun 18, 2018

To test this PR:

1. Bring up 3 managers 2 workers
2. docker network create -d overlay --attachable net1
3. docker service create --name test --network net1 --replicas 10 busybox sleep 10000
4. Bring down 3 managers
5. Bring up 3 managers
6. docker run -dit --network net1 --name foo busybox sleep 10000
7. ssh to worker and run: docker exec -it $someTaskContainer nslookup foo

The last step should fail without this PR, and pass with this PR.

Another way to test is:

Everything the same till step 5
6. wait 5 min
7. cat docker.log | grep "NetworkDB stats"  # look for netPeers
8. ssh to worker and run: cat docker.log | grep "NetworkDB stats" # look for netPeers

The two netPeers should not match with this PR, and match with this PR.

@tiborvass
Copy link
Contributor

@fcrisciani I tested this PR, and nslookup still fails with this PR.

@tiborvass
Copy link
Contributor

@fcrisciani I now managed to confirm this patch fixes the original issue.

@tiborvass
Copy link
Contributor

LGTM (IANAM)

@tiborvass
Copy link
Contributor

Ping @abhi

Copy link
Contributor

@abhi abhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@abhi abhi merged commit fc60a75 into moby:bump_17.06 Jun 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants