NetworkDB incorrect number of entries in networkNodes #1836

fcrisciani · 2017-07-11T00:00:04Z

A rapid (within networkReapTime 30min) leave/join network
can corrupt the list of nodes per network with multiple copies
of the same nodes.
The fix makes sure that each node is present only once

Signed-off-by: Flavio Crisciani flavio.crisciani@docker.com

fcrisciani · 2017-07-11T00:00:49Z

Also the nDB.networks is not properly cleaned up but will take care of that in a separate PR

abhi · 2017-07-11T14:04:28Z

networkdb/networkdb.go


 	logrus.Debugf("%s: joined network %s", nDB.config.NodeName, nid)
-	if _, err := nDB.bulkSync(networkNodes, true); err != nil {
+	if _, err := nDB.bulkSync(nDB.networkNodes[nid], true); err != nil {


should nDB.bulksync() be under a mutex ?

also not sure if nDB.networkNodes[nid] needs to be passed explicitly. I guess the bulkSync function can use networkNodes from the nDB object right ? wdyt ?

Just checked, my bad, I thought that passing the element doing a read was safe instead is not.

Definitely yes, I was actually thinking of creating some helper function for getNodesNetwork and getNetworkNodes and delete all the duplicate code that is around. I will probably fix this and create a new PR for that. Thanks for catching that

sanimej · 2017-07-15T00:17:13Z

@fcrisciani I tried quick add/remove of containers from a node and also quick deamon kill/start on a node. But not hitting this issue. So the exact sequence of events is not very clear to me..

What steps are you using exactly ? And when you see the issue, does the same node's IP occur multiple times but with different names ? We make the node name unique every time a node joins the cluster. If this is what is happening it should be a temporary issue because Peers function we have..

 for _, nodeName := range nDB.networkNodes[nid] {
                if node, ok := nDB.nodes[nodeName]; ok {

ie: a node is printed as peer only if exists in nDB.nodes. After an ungraceful daemon restart, memberlist will quickly the old node-name as not alive any more and will trigger a leave. So we should take it out of nDB.nodes.

fcrisciani · 2017-07-15T00:20:05Z

@sanimej just run the test that is in the PR with the master code, that will show you the issue

mavenugo · 2017-07-18T23:57:03Z

@fcrisciani trying the test in master indeed fails as you explained and it passes with your fix.
Also the changes LGTM with a minor change request.

mavenugo · 2017-07-18T23:57:14Z

networkdb/networkdb_test.go

+	maxRetry := 5
+	dbs := createNetworkDBInstances(t, 2, "node")
+
+	logrus.SetLevel(logrus.DebugLevel)


Pls remove this

A rapid (within networkReapTime 30min) leave/join network can corrupt the list of nodes per network with multiple copies of the same nodes. The fix makes sure that each node is present only once Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>

mavenugo · 2017-07-19T00:42:32Z

LGTM

abhi reviewed Jul 11, 2017

View reviewed changes

fcrisciani force-pushed the network-db-extra-nodes branch from b8bea36 to 88698b2 Compare July 11, 2017 16:06

mavenugo reviewed Jul 18, 2017

View reviewed changes

fcrisciani force-pushed the network-db-extra-nodes branch from 88698b2 to 297c3d4 Compare July 18, 2017 23:58

mavenugo merged commit f81e09a into moby:master Jul 19, 2017

andrewhsu mentioned this pull request Aug 1, 2017

[17.06] backport networkdb fixes #1878

Merged

fcrisciani deleted the network-db-extra-nodes branch August 3, 2017 00:30

fcrisciani restored the network-db-extra-nodes branch August 3, 2017 00:31

fcrisciani deleted the network-db-extra-nodes branch August 3, 2017 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NetworkDB incorrect number of entries in networkNodes #1836

NetworkDB incorrect number of entries in networkNodes #1836

Uh oh!

fcrisciani commented Jul 11, 2017

Uh oh!

fcrisciani commented Jul 11, 2017

Uh oh!

abhi Jul 11, 2017

Uh oh!

abhi Jul 11, 2017

Uh oh!

fcrisciani Jul 11, 2017

Uh oh!

sanimej commented Jul 15, 2017

Uh oh!

fcrisciani commented Jul 15, 2017

Uh oh!

mavenugo commented Jul 18, 2017

Uh oh!

mavenugo Jul 18, 2017

Uh oh!

fcrisciani Jul 18, 2017

Uh oh!

mavenugo commented Jul 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NetworkDB incorrect number of entries in networkNodes #1836

NetworkDB incorrect number of entries in networkNodes #1836

Uh oh!

Conversation

fcrisciani commented Jul 11, 2017

Uh oh!

fcrisciani commented Jul 11, 2017

Uh oh!

abhi Jul 11, 2017

Choose a reason for hiding this comment

Uh oh!

abhi Jul 11, 2017

Choose a reason for hiding this comment

Uh oh!

fcrisciani Jul 11, 2017

Choose a reason for hiding this comment

Uh oh!

sanimej commented Jul 15, 2017

Uh oh!

fcrisciani commented Jul 15, 2017

Uh oh!

mavenugo commented Jul 18, 2017

Uh oh!

mavenugo Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

fcrisciani Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

mavenugo commented Jul 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants