peer: make PingManager disconnect call async #8385
Conversation
|
I think before we merge this we should have an itest that showed the prior deadlock issue, which is then resolved by this PR. Rather than add some new build tag, I think we can instead spin up a skeleton server (just for p2p connections), then have it pass in a purposefully blocking call back into the ping manager. |
WalkthroughThe update to the Changes
Assessment against linked issues
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
|
@coderabbitai review |
There was a problem hiding this comment.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (1)
- peer/ping_manager.go (2 hunks)
Additional comments: 7
peer/ping_manager.go (7)
- 102-106: The initialization of
pingTicker,pingTimeout, and the addition of the goroutine forpingHandlerin theStartmethod are correctly moved from the oldstartmethod.- 112-197: The
pingHandlermethod has been refactored to handle the ping/pong protocol enforcement. The logic appears to be correctly structured to handle the ping cycle, timeout, and pong reception.- 98-201: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [198-217]
Verify that the
Stopmethod correctly interrupts the goroutines owned byPingManagerafter the refactoring. This is crucial to ensure that the deadlock issue the PR aims to resolve is indeed fixed.
- 13-18: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [218-235]
The
setPingStatemethod's logic remains unchanged and is correctly used within thepingHandlermethod to track the state of outgoing pings.
- 13-18: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [236-251]
The
resetPingStatemethod's logic remains unchanged and is correctly used within thepingHandlermethod to reset the state after a ping cycle completes or a pong is received.
- 13-18: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [252-262]
The
GetPingTimeMicroSecondsmethod correctly reports the RTT calculated by thepingManager. No changes were made to this method in the PR.
- 13-18: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [263-271]
The
ReceivedPongmethod correctly handles incoming pong messages and interacts with thepongChanchannel, which is used within thepingHandlermethod.
| "in unclean state", | ||
| ) | ||
| m.cfg.OnPongFailure(e) | ||
| go m.cfg.OnPongFailure(e) |
There was a problem hiding this comment.
could instead leave this as is but change the actual call back implementation to do things async. Since the caller defines the call-back, it feels like the caller should be responsible for making it run async if needed
There was a problem hiding this comment.
Sorta two ways about it: the core issue here is that we made an assumption about the call back that led to the deadlock in the first place. This also isn't the first time we've run into an issue like this. Last year we had a few instances of a circular call happening in the peer during init (eg: go to fetch latest node ann, calls into server, tries to grab mutex already held) that would've been resolved by minimizing assumptions w.r.t the synchronicity of a passed call back.
Callback aside, I think generally a better way to structure code like this would be to instead have the ping manager signal another sub-system that the expectation (pong not recv'd in time) has been violated in an async manner. This way, we don't need to care about disconnecting only that the signal to disconnect has been sent.
There was a problem hiding this comment.
Review Status
Actionable comments generated: 2
Configuration used: CodeRabbit UI
Files selected for processing (2)
- peer/brontide.go (1 hunks)
- peer/ping_manager.go (2 hunks)
Additional comments: 7
peer/ping_manager.go (6)
- 101-106: The initialization of
pingTicker,pingTimeout, and the call topingHandlerhave been moved from thestartmethod to theStartmethod. This change is consistent with the summary and seems to be a refactoring to improve the initialization process. Ensure that theStartmethod is called appropriately wherever thePingManageris instantiated.- 112-197: The
startmethod has been refactored intopingHandler. This method enforces the ping/pong protocol and contains logic for handling timeouts and pong responses. The refactoring appears to be done to address the deadlock issue by making the disconnection process asynchronous. Ensure that thepingHandlermethod is robust and correctly handles all possible cases, including timeouts and incorrect pong responses.- 146-148: The error handling for
setPingStateis correct, but ensure that theOnPongFailurecallback is designed to handle errors asynchronously to avoid the deadlock issue this PR is intended to solve.Verification successful
The
OnPongFailurecallback within theNewBrontidefunction inpeer/brontide.gois designed to handle errors asynchronously. It logs a warning and disconnects from the peer using a new goroutine, ensuring that the operation does not block the current execution flow. The tests inpeer/ping_manager_test.gofurther confirm the asynchronous behavior of theOnPongFailurecallback by checking that it is called correctly under various conditions.* 155-164: The timeout handling logic here is straightforward and correct. However, ensure that the `OnPongFailure` callback does not introduce any potential deadlocks when handling this timeout error. * 178-185: The check for the pong size matching the expected size is correct. However, ensure that the `OnPongFailure` callback is properly handling this error case asynchronously. * 191-192: Storing the RTT is a good practice for performance monitoring. Ensure that the atomic operation is thread-safe and that the stored RTT is used appropriately elsewhere in the codebase.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that OnPongFailure handles errors asynchronously. ast-grep --lang go --pattern $'OnPongFailure($_)'Length of output: 50
Script:
#!/bin/bash # Search for the string "OnPongFailure" to find references in the codebase. rg "OnPongFailure"Length of output: 563
Script:
#!/bin/bash # Examine the implementation of OnPongFailure in peer/brontide.go and peer/ping_manager_test.go cat peer/brontide.go cat peer/ping_manager_test.goLength of output: 131180
peer/brontide.go (1)
- 579-579: The change to make the
Disconnectmethod call asynchronous is in line with the PR's objectives to prevent deadlocks by avoiding circular waiting scenarios. However, ensure that the asynchronous disconnection is properly handled and does not lead to any race conditions or leaks of goroutines.
497e269 to
41ee461
Compare
There was a problem hiding this comment.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Commits
Files that changed from the base of the PR and between 1bc1295 and 41ee461420144aabe3449b6d3b6cc2d1bc68efe1.Files selected for processing (2)
- peer/brontide.go (1 hunks)
- peer/ping_manager.go (2 hunks)
Files skipped from review as they are similar to previous changes (1)
- peer/brontide.go
Additional comments: 7
peer/ping_manager.go (7)
- 101-105: The
Startmethod now initializes thepingTicker,pingTimeout, and starts thepingHandlergoroutine. This change centralizes the initialization logic within theStartmethod, which is a good practice for maintainability and readability.- 111-196: The
pingHandlermethod has been refactored from the originalstartmethod. It enforces the ping/pong protocol and handles the ping cycle, timeout, and pong responses. The refactoring appears to be done to address the deadlock issue by making the disconnection process asynchronous. Ensure that the logic within this method correctly handles all cases and that the asynchronous behavior does not introduce new race conditions or logic errors.- 129-136: The error handling for an "impossible" state where a new ping begins while awaiting a pong response is to return an error and stop the handler. This is a change from the previous logic and should be verified to ensure it aligns with the intended error handling strategy for the application.
Verification successful
The error handling for the "impossible" state in
ping_manager.go, where an error is returned and the handler is stopped, is consistent with the application's overall strategy for handling unexpected situations. The term "impossible" is used across the codebase to describe scenarios that should not occur under normal operation, and the typical response includes logging a warning, returning an error, or bypassing certain checks, which is in line with the observed pattern inping_manager.go.* 144-146: The `setPingState` method is called to set up bookkeeping for a new Ping. It is important to ensure that the error handling within this method is consistent with the rest of the application and that the state is correctly managed to prevent any potential race conditions. * 174-184: The check for a pong response that does not match the expected size is a critical part of the protocol enforcement. Ensure that this logic is robust and that the error handling is appropriate for the application's needs.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that the error handling strategy for the "impossible" state is consistent with the application's overall strategy. # This script searches for similar error handling patterns to compare strategies. rg --type go "impossible" --files-with-matchesLength of output: 281
Script:
#!/bin/bash # Examine a few instances of the "impossible" error handling pattern to understand the typical strategy used in the application. # Selecting a few files from the list to check the context of "impossible" usage. rg --type go -C 5 "impossible" rpcperms/interceptor.go rg --type go -C 5 "impossible" routing/pathfind.go rg --type go -C 5 "impossible" peer/brontide.go rg --type go -C 5 "impossible" channeldb/channel.goLength of output: 3305
Verification successful
The verification process has confirmed that the pong size check within the
PingManageris robust and includes a specific test case for handling a bad pong size. Additionally, theOnPongFailurecallback is implemented to log a warning and disconnect the peer, which is an appropriate error handling strategy for the application. Based on this, the review comment has been verified as accurate.* 186-191: The calculation of RTT (round-trip-time) is an important metric for network performance. Ensure that the storage and retrieval of this metric are thread-safe and that the atomic operations are used correctly. * 193-195: The case for `<-m.quit` in the select statement is used to stop the `pingHandler` goroutine. Ensure that the quit channel is correctly managed across the application to prevent leaks or premature shutdowns.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that the pong size check is robust and error handling is appropriate. # This script searches for similar error handling patterns to compare strategies. rg --type go "pong response does not match expected size" --files-with-matchesLength of output: 80
Script:
#!/bin/bash # Find the definition and usage of the OnPongFailure callback to understand how errors are handled. ast-grep --lang go --pattern 'func ($_ *$_) OnPongFailure($_ error) { $$$ }' # Search for tests that might be covering the pong size check scenario. rg --type go "OnPongFailure" --files-with-matchesLength of output: 192
Script:
#!/bin/bash # Examine the test file for tests related to the pong size check. cat peer/ping_manager_test.go # Look into brontide.go to understand the implementation and usage of OnPongFailure. cat peer/brontide.goLength of output: 131180
Script:
#!/bin/bash # Examine the test file for tests related to the pong size check. cat peer/ping_manager_test.go # Look into brontide.go to understand the implementation and usage of OnPongFailure. cat peer/brontide.goLength of output: 131180
There was a problem hiding this comment.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Commits
Files that changed from the base of the PR and between 41ee461420144aabe3449b6d3b6cc2d1bc68efe1 and d41a91f98681304e2c665065b1633614ebd60b69.Files selected for processing (1)
- docs/release-notes/release-notes-0.18.0.md (1 hunks)
Additional comments: 1
docs/release-notes/release-notes-0.18.0.md (1)
- 68-70: The addition to the release notes correctly reflects the changes made in PR #8385, which address the deadlock issue by making ping failures not deadlock the peer connection.
ellemouton
left a comment
There was a problem hiding this comment.
LGTM 🔥
main things to address: can squash 2 of the commits into 1 & need to move release notes to 0.17.4
In this commit, we make all calls to disconnect after a ping/pong violation is detected in the `PingManager` async. We do this to avoid circular waiting that may occur if the disconnect call back ends up waiting on the peer goroutine to be torn down. If this happens, then the peer goroutine will be blocked on the ping manager fully tearing down, which is blocked on the peer disconnect succeeding. This is a similar class of issue we've delt with recently as pertains to the peer and the server: sync all back execution must not lead to a circular waiting loop. Fixes lightningnetwork#8379
The error was never used as the init couldn't return an error, so we do away with that. We also modify the main event loop dispatch to more closely match other areas of the codebase.
d41a91f to
359f271
Compare
There was a problem hiding this comment.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (3)
- docs/release-notes/release-notes-0.17.4.md (2 hunks)
- peer/brontide.go (1 hunks)
- peer/ping_manager.go (2 hunks)
Files skipped from review as they are similar to previous changes (1)
- peer/brontide.go
Additional comments: 4
docs/release-notes/release-notes-0.17.4.md (1)
- 32-34: The release notes mention the PR that prevents ping failures from deadlocking the peer connection. This aligns with the PR objectives and the AI-generated summaries.
peer/ping_manager.go (3)
- 101-105: The
Startmethod has been correctly updated to initialize thepingTicker,pingTimeout, and start thepingHandlergoroutine within async.Onceblock to ensure it only runs once.- 111-195: The
pingHandlermethod has been refactored and now contains the logic previously in thestartmethod. It correctly handles the ping/pong protocol, including error handling and state resetting. The use of channels and timers appears to be correct and should prevent the deadlock issue as described in the PR objectives.- 13-18: > 📝 NOTE
This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [1-10]
The
PingManagerConfigstruct still contains theNewPingPayloadfunction, which contradicts the previous comment from the bot that it was removed. This was already clarified in the conversation with ProofOfKeags, so no further action is needed here.
|
@ProofOfKeags @ellemouton updated to squash change into first commit (ended up just modifying it as it was more pleasing to the Rebase Gods), and updated the release notes to point to v0.17.4. |
In this commit, we make all calls to disconnect after a ping/pong
violation is detected in the
PingManagerasync. We do this to avoidcircular waiting that may occur if the disconnect call back ends up
waiting on the peer goroutine to be torn down. If this happens, then the
peer goroutine will be blocked on the ping manager fully tearing down,
which is blocked on the peer disconnect succeeding.
This is a similar class of issue we've delt with recently as pertains to
the peer and the server: sync all back execution must not lead to
a circular waiting loop.
Alternatively, we could have the internal of the call back itself be
async, but I prefer this route as it minimizes assumptions.
Fixes #8379
Summary by CodeRabbit
Disconnectmethod call in theNewBrontidefunction to likely spawn a new goroutine.FilterKnownChanIDs.