-
Notifications
You must be signed in to change notification settings - Fork 886
NetworkDB incorrect number of entries in networkNodes #1836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Also the nDB.networks is not properly cleaned up but will take care of that in a separate PR |
networkdb/networkdb.go
Outdated
|
|
||
| logrus.Debugf("%s: joined network %s", nDB.config.NodeName, nid) | ||
| if _, err := nDB.bulkSync(networkNodes, true); err != nil { | ||
| if _, err := nDB.bulkSync(nDB.networkNodes[nid], true); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should nDB.bulksync() be under a mutex ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also not sure if nDB.networkNodes[nid] needs to be passed explicitly. I guess the bulkSync function can use networkNodes from the nDB object right ? wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked, my bad, I thought that passing the element doing a read was safe instead is not.
Definitely yes, I was actually thinking of creating some helper function for getNodesNetwork and getNetworkNodes and delete all the duplicate code that is around. I will probably fix this and create a new PR for that. Thanks for catching that
b8bea36 to
88698b2
Compare
|
@fcrisciani I tried quick add/remove of containers from a node and also quick deamon kill/start on a node. But not hitting this issue. So the exact sequence of events is not very clear to me.. What steps are you using exactly ? And when you see the issue, does the same node's IP occur multiple times but with different names ? We make the node name unique every time a node joins the cluster. If this is what is happening it should be a temporary issue because ie: a node is printed as peer only if exists in nDB.nodes. After an ungraceful daemon restart, memberlist will quickly the old node-name as not alive any more and will trigger a leave. So we should take it out of |
|
@sanimej just run the test that is in the PR with the master code, that will show you the issue |
|
@fcrisciani trying the test in master indeed fails as you explained and it passes with your fix. |
networkdb/networkdb_test.go
Outdated
| maxRetry := 5 | ||
| dbs := createNetworkDBInstances(t, 2, "node") | ||
|
|
||
| logrus.SetLevel(logrus.DebugLevel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
A rapid (within networkReapTime 30min) leave/join network can corrupt the list of nodes per network with multiple copies of the same nodes. The fix makes sure that each node is present only once Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>
88698b2 to
297c3d4
Compare
|
LGTM |
A rapid (within networkReapTime 30min) leave/join network
can corrupt the list of nodes per network with multiple copies
of the same nodes.
The fix makes sure that each node is present only once
Signed-off-by: Flavio Crisciani flavio.crisciani@docker.com