[bump_v17.06] backport fix deadlock in dispatcher#2753
Merged
Conversation
There was a rare case where the dispatcher could end up deadlocked when calling stop, which would cause the whole leadership change procedure to go sideways, the dispatcher to pile up with goroutines, and the node to crash. In a nutshell, calls to the Session RPC end up in a (*Cond).Wait(), waiting for a Broadcast that, once Stop is called, may never come. To avoid that case, Stop, after being called and canceling the Dispatcher context, does one final Broadcast to wake the sleeping waiters. However, because the rpcRW lock, which stops Stop from proceeding until all RPCs have returned, was previously obtained BEFORE the call to Broadcast, Stop would never reach this final Broadcast call, waiting on the Session RPCs to release the rpcRW lock, which they could not do until Broadcast was called. Hence, deadlock. To fix this, we simple have to move this final Broadcast to above the attempt to acquire the rpcRW lock, allowing everything to proceed correctly. Signed-off-by: Drew Erny <drew.erny@docker.com>
Collaborator
|
@anshulpundir there's a race detector failure in the test... |
Codecov Report
@@ Coverage Diff @@
## bump_v17.06 #2753 +/- ##
==============================================
- Coverage 61.28% 61.1% -0.19%
==============================================
Files 121 121
Lines 20215 20172 -43
==============================================
- Hits 12389 12326 -63
- Misses 6452 6491 +39
+ Partials 1374 1355 -19 |
wk8
approved these changes
Sep 20, 2018
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport of #2744
cherry-pick was clean.
There was a rare case where the dispatcher could end up deadlocked when
calling stop, which would cause the whole leadership change procedure to
go sideways, the dispatcher to pile up with goroutines, and the node to
crash.
In a nutshell, calls to the Session RPC end up in a (*Cond).Wait(),
waiting for a Broadcast that, once Stop is called, may never come. To
avoid that case, Stop, after being called and canceling the Dispatcher
context, does one final Broadcast to wake the sleeping waiters.
However, because the rpcRW lock, which stops Stop from proceeding until
all RPCs have returned, was previously obtained BEFORE the call to
Broadcast, Stop would never reach this final Broadcast call, waiting on
the Session RPCs to release the rpcRW lock, which they could not do
until Broadcast was called. Hence, deadlock.
To fix this, we simple have to move this final Broadcast to above the
attempt to acquire the rpcRW lock, allowing everything to proceed
correctly.