-
Notifications
You must be signed in to change notification settings - Fork 886
Adding a recovery mechanism for a split gossip cluster #2134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2134 +/- ##
=========================================
Coverage ? 40.49%
=========================================
Files ? 139
Lines ? 22467
Branches ? 0
=========================================
Hits ? 9097
Misses ? 12031
Partials ? 1339
Continue to review full report at Codecov.
|
networkdb/delegate.go
Outdated
| func (d *delegate) LocalState(join bool) []byte { | ||
| if join { | ||
| // Update all the local node/network state to a new time to | ||
| // Update all the local node/network state to a new time to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How this happened... Fat finger :)
I'll fix it
networkdb/cluster.go
Outdated
| if _, err := mlist.Join(members); err != nil { | ||
| // In case of failure, keep retrying join until it succeeds or the cluster is shutdown. | ||
| go nDB.retryJoin(members, nDB.stopCh) | ||
| go nDB.retryJoin(nDB.ctx, members) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this change I don't think launching this routine makes anymore sense. If there is a failure here, the next attempt will happen in 60s or less with the other codepath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, clusterJoin is only called when agent is initialized to join the boot_strap nodes; unless it's time critical and we can't wait 60 seconds, it's now redundant.
Happy to take it out.
networkdb/cluster.go
Outdated
| ctx, cancel := context.WithTimeout(nDB.ctx, 10*time.Second) | ||
| defer cancel() | ||
| nDB.retryJoin(ctx, bootStrapIPs) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra new line
networkdb/cluster.go
Outdated
| nDB.RUnlock() | ||
| logrus.Debugf("rejoinClusterBootStrap, calling cluster join with bootStrap %v", bootStrapIPs) | ||
| // All bootStrap nodes are not in the cluster, call memberlist join | ||
| ctx, cancel := context.WithTimeout(nDB.ctx, 10*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should put this time of 10s at the top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roger that
| } | ||
| // If the node is not known from memberlist we cannot process save any state of it else if it actually | ||
| // dies we won't receive any notification and we will remain stuck with it | ||
| if _, ok := nDB.nodes[nEvent.NodeName]; !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the nDB.findNode check necessary above [L28] given this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only to filter based on the lamport clock
| // rejoinClusterBootStrap is called periodically to check if all bootStrap nodes are active in the cluster, | ||
| // if not, call the cluster join to merge 2 separate clusters that are formed when all managers | ||
| // stopped/started at the same time | ||
| func (nDB *NetworkDB) rejoinClusterBootStrap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it make sense to also attempt to refresh nDB.bootStrapIP here through a call to something like GetRemoteAddressList in case the list has changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fcrisciani and I had a discussion about other use cases that can lead us to the same "split cluster" and the possibility of re-checking/updating the bootStrap IPs (through GetRemoteAddressList)
1- Re-ip the managers .
2- Demote/Promote managers/workers .
The first one is not an issue as a re-ip to all managers will force all nodes in the cluster to restart .
As for 2), customers will only hit it if they demoted all managers in the cluster + restarting those managers without restarting the workers... We felt like this is a very edge case.
This has been said, If you guys think we should still refresh the bootStrapIP, then we can add the logic to the newly introduced rejoinClusterBootStrap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the 2 problems can be handled separately. Memberist with this PR will honor the bootstrapIP that got passed at the beginning.
The second issue will need a periodic check of the GetRemoteAddressList and if the list change, the routine have to call the networkDB.Join with the new bootstrapIPs
0a10a03 to
e0dd4ed
Compare
ddebroy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
networkdb/cluster.go
Outdated
| bootStrapIPs = append(bootStrapIPs, bootIP.String()) | ||
| } | ||
| nDB.RUnlock() | ||
| // Not all bootStrap nodes are in the cluster, call memberlist join |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean None of the bootStrap nodes are in the cluster rather than Not all ... in the comment, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. I will reword it and push the update.
Thx
|
LGTM, if you fix the comment that @ddebroy mentioned we can merge |
Signed-off-by: Dani Louca <dani.louca@docker.com>
|
Thx guys. PR is updated with latest change request. |
fcrisciani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| bootStrapIPs := make([]string, 0, len(nDB.bootStrapIP)) | ||
| for _, bootIP := range nDB.bootStrapIP { | ||
| for _, node := range nDB.nodes { | ||
| if node.Addr.Equal(bootIP) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dani-docker thinking more about this, I guess here it's missing the check that the IP is != from current node IP else this fix won't work for the managers. Every manager will see itself in the list and won't try to reconnect
Steps To Reproduce:
Expected Results:
Actual Results:
overlayand unpredicted networking issues between containers on the 2 different clusters.Workaround:
or
/joinendpointPR fix:
This PR adds a go routine that runs every minute and checks that at least one node from the
bootStraplist (ie: managers nodes) is part of the cluster, if we couldn't find any, then we attempt a/joinfor 10 seconds .Note:
While testing, I ran into an issue, that I couldn't reliably reproduce;
a
leaveevent was followed by a false positivejoinevent was received by aworkerwhen themanagernode was down, this caused an inconsistency in thenDB.nodeswhich caused the logic in this PR to fail.@fcrisciani recommended the below change and he'll be looking into the problem in more details.
Signed-off-by: Dani Louca dani.louca@docker.com