Skip to content

Conversation

@fcrisciani
Copy link

A rapid (within networkReapTime 30min) leave/join network
can corrupt the list of nodes per network with multiple copies
of the same nodes.
The fix makes sure that each node is present only once

Signed-off-by: Flavio Crisciani flavio.crisciani@docker.com

@fcrisciani
Copy link
Author

Also the nDB.networks is not properly cleaned up but will take care of that in a separate PR


logrus.Debugf("%s: joined network %s", nDB.config.NodeName, nid)
if _, err := nDB.bulkSync(networkNodes, true); err != nil {
if _, err := nDB.bulkSync(nDB.networkNodes[nid], true); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should nDB.bulksync() be under a mutex ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also not sure if nDB.networkNodes[nid] needs to be passed explicitly. I guess the bulkSync function can use networkNodes from the nDB object right ? wdyt ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked, my bad, I thought that passing the element doing a read was safe instead is not.

Definitely yes, I was actually thinking of creating some helper function for getNodesNetwork and getNetworkNodes and delete all the duplicate code that is around. I will probably fix this and create a new PR for that. Thanks for catching that

@fcrisciani fcrisciani force-pushed the network-db-extra-nodes branch from b8bea36 to 88698b2 Compare July 11, 2017 16:06
@sanimej
Copy link

sanimej commented Jul 15, 2017

@fcrisciani I tried quick add/remove of containers from a node and also quick deamon kill/start on a node. But not hitting this issue. So the exact sequence of events is not very clear to me..

What steps are you using exactly ? And when you see the issue, does the same node's IP occur multiple times but with different names ? We make the node name unique every time a node joins the cluster. If this is what is happening it should be a temporary issue because Peers function we have..

 for _, nodeName := range nDB.networkNodes[nid] {
                if node, ok := nDB.nodes[nodeName]; ok {

ie: a node is printed as peer only if exists in nDB.nodes. After an ungraceful daemon restart, memberlist will quickly the old node-name as not alive any more and will trigger a leave. So we should take it out of nDB.nodes.

@fcrisciani
Copy link
Author

@sanimej just run the test that is in the PR with the master code, that will show you the issue

@mavenugo
Copy link
Contributor

@fcrisciani trying the test in master indeed fails as you explained and it passes with your fix.
Also the changes LGTM with a minor change request.

maxRetry := 5
dbs := createNetworkDBInstances(t, 2, "node")

logrus.SetLevel(logrus.DebugLevel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls remove this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

A rapid (within networkReapTime 30min) leave/join network
can corrupt the list of nodes per network with multiple copies
of the same nodes.
The fix makes sure that each node is present only once

Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>
@fcrisciani fcrisciani force-pushed the network-db-extra-nodes branch from 88698b2 to 297c3d4 Compare July 18, 2017 23:58
@mavenugo
Copy link
Contributor

LGTM

@mavenugo mavenugo merged commit f81e09a into moby:master Jul 19, 2017
@fcrisciani fcrisciani deleted the network-db-extra-nodes branch August 3, 2017 00:30
@fcrisciani fcrisciani restored the network-db-extra-nodes branch August 3, 2017 00:31
@fcrisciani fcrisciani deleted the network-db-extra-nodes branch August 3, 2017 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants