peer: make PingManager disconnect call async by Roasbeef · Pull Request #8385 · lightningnetwork/lnd

Roasbeef · 2024-01-15T21:35:30Z

In this commit, we make all calls to disconnect after a ping/pong
violation is detected in the PingManager async. We do this to avoid
circular waiting that may occur if the disconnect call back ends up
waiting on the peer goroutine to be torn down. If this happens, then the
peer goroutine will be blocked on the ping manager fully tearing down,
which is blocked on the peer disconnect succeeding.

This is a similar class of issue we've delt with recently as pertains to
the peer and the server: sync all back execution must not lead to
a circular waiting loop.

Alternatively, we could have the internal of the call back itself be
async, but I prefer this route as it minimizes assumptions.

Fixes #8379

Summary by CodeRabbit

Refactor
- Improved the ping management system for enhanced protocol enforcement.
- Modified the Disconnect method call in the NewBrontide function to likely spawn a new goroutine.
Bug Fixes
- Prevented ping failures from deadlocking the peer connection.
- Fixed a mutex acquisition issue in FilterKnownChanIDs.

Roasbeef · 2024-01-15T23:04:52Z

I think before we merge this we should have an itest that showed the prior deadlock issue, which is then resolved by this PR.

Rather than add some new build tag, I think we can instead spin up a skeleton server (just for p2p connections), then have it pass in a purposefully blocking call back into the ping manager.

coderabbitai · 2024-01-17T01:31:23Z

Walkthrough

The update to the PingManager addresses a critical deadlock issue by refining the shutdown process. It restructures the handling of pings and the management of goroutines and timers, aiming to ensure a graceful termination without leaving any routines hanging, thus preventing the deadlock scenario previously encountered.

Changes

File(s)	Summary of Changes
`peer/ping_manager.go`	Refactored to include initialization of timers and goroutines, and introduced `pingHandler` for ping/pong protocol management.
`peer/brontide.go`	Modified `NewBrontide` function to add a `go` statement before the `Disconnect` method call, likely spawning a new goroutine for the operation.
`docs/release-notes/release-notes-0.18.0.md`	Updated to prevent ping failures from deadlocking the peer connection.

Assessment against linked issues

Objective	Addressed	Explanation
Resolve deadlocks in `PingManager` (#8379)	✅
Proper shutdown of `PingManager` to prevent waiting on `WaitGroup` (#8379)	✅
Ability to consume the quit channel of `PingManager` (#8379)	✅
Verify changes with @ProofOfKeags (#8379)	❓	The code changes seem to address the issue, but it's unclear if they have been verified with the user @ProofOfKeags as required.

Poem

In the realm of code, a deadlock's a fright,
'Twas the PingManager's bane, both day and night.
But with refactor's charm and a coder's delight,
Now graceful shutdowns make all just right. 🐇💻✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit-tests for this file.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit tests for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository from git and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit tests.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
The JSON schema for the configuration file is available here.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

Roasbeef · 2024-01-17T01:31:26Z

@coderabbitai review

coderabbitai

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 76ae3e4 and 0e7583f.

Files selected for processing (1)

peer/ping_manager.go (2 hunks)

Additional comments: 7

peer/ping_manager.go (7)

102-106: The initialization of pingTicker, pingTimeout, and the addition of the goroutine for pingHandler in the Start method are correctly moved from the old start method.

112-197: The pingHandler method has been refactored to handle the ping/pong protocol enforcement. The logic appears to be correctly structured to handle the ping cycle, timeout, and pong reception.

98-201: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [198-217]

Verify that the Stop method correctly interrupts the goroutines owned by PingManager after the refactoring. This is crucial to ensure that the deadlock issue the PR aims to resolve is indeed fixed.

13-18: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [218-235]

The setPingState method's logic remains unchanged and is correctly used within the pingHandler method to track the state of outgoing pings.

13-18: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [236-251]

The resetPingState method's logic remains unchanged and is correctly used within the pingHandler method to reset the state after a ping cycle completes or a pong is received.

13-18: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [252-262]

The GetPingTimeMicroSeconds method correctly reports the RTT calculated by the pingManager. No changes were made to this method in the PR.

13-18: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [263-271]

The ReceivedPong method correctly handles incoming pong messages and interacts with the pongChan channel, which is used within the pingHandler method.

ellemouton · 2024-01-17T15:05:51Z

 						"in unclean state",
 					)
-					m.cfg.OnPongFailure(e)
+					go m.cfg.OnPongFailure(e)


could instead leave this as is but change the actual call back implementation to do things async. Since the caller defines the call-back, it feels like the caller should be responsible for making it run async if needed

Sorta two ways about it: the core issue here is that we made an assumption about the call back that led to the deadlock in the first place. This also isn't the first time we've run into an issue like this. Last year we had a few instances of a circular call happening in the peer during init (eg: go to fetch latest node ann, calls into server, tries to grab mutex already held) that would've been resolved by minimizing assumptions w.r.t the synchronicity of a passed call back.

Callback aside, I think generally a better way to structure code like this would be to instead have the ping manager signal another sub-system that the expectation (pong not recv'd in time) has been violated in an async manner. This way, we don't need to care about disconnecting only that the signal to disconnect has been sent.

coderabbitai

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 0e7583f and 497e269.

Files selected for processing (2)

peer/brontide.go (1 hunks)
peer/ping_manager.go (2 hunks)

Additional comments: 7

peer/ping_manager.go (6)
101-106: The initialization of pingTicker, pingTimeout, and the call to pingHandler have been moved from the start method to the Start method. This change is consistent with the summary and seems to be a refactoring to improve the initialization process. Ensure that the Start method is called appropriately wherever the PingManager is instantiated.

112-197: The start method has been refactored into pingHandler. This method enforces the ping/pong protocol and contains logic for handling timeouts and pong responses. The refactoring appears to be done to address the deadlock issue by making the disconnection process asynchronous. Ensure that the pingHandler method is robust and correctly handles all possible cases, including timeouts and incorrect pong responses.

146-148: The error handling for setPingState is correct, but ensure that the OnPongFailure callback is designed to handle errors asynchronously to avoid the deadlock issue this PR is intended to solve.

Verification successful

The OnPongFailure callback within the NewBrontide function in peer/brontide.go is designed to handle errors asynchronously. It logs a warning and disconnects from the peer using a new goroutine, ensuring that the operation does not block the current execution flow. The tests in peer/ping_manager_test.go further confirm the asynchronous behavior of the OnPongFailure callback by checking that it is called correctly under various conditions.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Verify that OnPongFailure handles errors asynchronously.
ast-grep --lang go --pattern $'OnPongFailure($_)'
Length of output: 50

Script:
#!/bin/bash
# Search for the string "OnPongFailure" to find references in the codebase.
rg "OnPongFailure"
Length of output: 563

Script:
#!/bin/bash
# Examine the implementation of OnPongFailure in peer/brontide.go and peer/ping_manager_test.go
cat peer/brontide.go
cat peer/ping_manager_test.go
Length of output: 131180
* 155-164: The timeout handling logic here is straightforward and correct. However, ensure that the `OnPongFailure` callback does not introduce any potential deadlocks when handling this timeout error. * 178-185: The check for the pong size matching the expected size is correct. However, ensure that the `OnPongFailure` callback is properly handling this error case asynchronously. * 191-192: Storing the RTT is a good practice for performance monitoring. Ensure that the atomic operation is thread-safe and that the stored RTT is used appropriately elsewhere in the codebase.
peer/brontide.go (1)

579-579: The change to make the Disconnect method call asynchronous is in line with the PR's objectives to prevent deadlocks by avoiding circular waiting scenarios. However, ensure that the asynchronous disconnection is properly handled and does not lead to any race conditions or leaks of goroutines.

coderabbitai

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 1bc1295 and 41ee461420144aabe3449b6d3b6cc2d1bc68efe1.

Files selected for processing (2)

peer/brontide.go (1 hunks)
peer/ping_manager.go (2 hunks)

Files skipped from review as they are similar to previous changes (1)

peer/brontide.go

Additional comments: 7

peer/ping_manager.go (7)
101-105: The Start method now initializes the pingTicker, pingTimeout, and starts the pingHandler goroutine. This change centralizes the initialization logic within the Start method, which is a good practice for maintainability and readability.

111-196: The pingHandler method has been refactored from the original start method. It enforces the ping/pong protocol and handles the ping cycle, timeout, and pong responses. The refactoring appears to be done to address the deadlock issue by making the disconnection process asynchronous. Ensure that the logic within this method correctly handles all cases and that the asynchronous behavior does not introduce new race conditions or logic errors.

129-136: The error handling for an "impossible" state where a new ping begins while awaiting a pong response is to return an error and stop the handler. This is a change from the previous logic and should be verified to ensure it aligns with the intended error handling strategy for the application.

Verification successful

The error handling for the "impossible" state in ping_manager.go, where an error is returned and the handler is stopped, is consistent with the application's overall strategy for handling unexpected situations. The term "impossible" is used across the codebase to describe scenarios that should not occur under normal operation, and the typical response includes logging a warning, returning an error, or bypassing certain checks, which is in line with the observed pattern in ping_manager.go.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Verify that the error handling strategy for the "impossible" state is consistent with the application's overall strategy.
# This script searches for similar error handling patterns to compare strategies.
rg --type go "impossible" --files-with-matches
Length of output: 281

Script:
#!/bin/bash
# Examine a few instances of the "impossible" error handling pattern to understand the typical strategy used in the application.
# Selecting a few files from the list to check the context of "impossible" usage.
rg --type go -C 5 "impossible" rpcperms/interceptor.go
rg --type go -C 5 "impossible" routing/pathfind.go
rg --type go -C 5 "impossible" peer/brontide.go
rg --type go -C 5 "impossible" channeldb/channel.go
Length of output: 3305
* 144-146: The `setPingState` method is called to set up bookkeeping for a new Ping. It is important to ensure that the error handling within this method is consistent with the rest of the application and that the state is correctly managed to prevent any potential race conditions. * 174-184: The check for a pong response that does not match the expected size is a critical part of the protocol enforcement. Ensure that this logic is robust and that the error handling is appropriate for the application's needs.

Verification successful

The verification process has confirmed that the pong size check within the PingManager is robust and includes a specific test case for handling a bad pong size. Additionally, the OnPongFailure callback is implemented to log a warning and disconnect the peer, which is an appropriate error handling strategy for the application. Based on this, the review comment has been verified as accurate.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Verify that the pong size check is robust and error handling is appropriate.
# This script searches for similar error handling patterns to compare strategies.
rg --type go "pong response does not match expected size" --files-with-matches
Length of output: 80

Script:
#!/bin/bash
# Find the definition and usage of the OnPongFailure callback to understand how errors are handled.
ast-grep --lang go --pattern 'func ($_ *$_) OnPongFailure($_ error) { $$$ }'

# Search for tests that might be covering the pong size check scenario.
rg --type go "OnPongFailure" --files-with-matches
Length of output: 192

Script:
#!/bin/bash
# Examine the test file for tests related to the pong size check.
cat peer/ping_manager_test.go

# Look into brontide.go to understand the implementation and usage of OnPongFailure.
cat peer/brontide.go
Length of output: 131180

Script:
#!/bin/bash
# Examine the test file for tests related to the pong size check.
cat peer/ping_manager_test.go

# Look into brontide.go to understand the implementation and usage of OnPongFailure.
cat peer/brontide.go
Length of output: 131180
* 186-191: The calculation of RTT (round-trip-time) is an important metric for network performance. Ensure that the storage and retrieval of this metric are thread-safe and that the atomic operations are used correctly. * 193-195: The case for `<-m.quit` in the select statement is used to stop the `pingHandler` goroutine. Ensure that the quit channel is correctly managed across the application to prevent leaks or premature shutdowns.

coderabbitai

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 41ee461420144aabe3449b6d3b6cc2d1bc68efe1 and d41a91f98681304e2c665065b1633614ebd60b69.

Files selected for processing (1)

docs/release-notes/release-notes-0.18.0.md (1 hunks)

Additional comments: 1

docs/release-notes/release-notes-0.18.0.md (1)

68-70: The addition to the release notes correctly reflects the changes made in PR #8385, which address the deadlock issue by making ping failures not deadlock the peer connection.

ellemouton

LGTM 🔥

main things to address: can squash 2 of the commits into 1 & need to move release notes to 0.17.4

In this commit, we make all calls to disconnect after a ping/pong violation is detected in the `PingManager` async. We do this to avoid circular waiting that may occur if the disconnect call back ends up waiting on the peer goroutine to be torn down. If this happens, then the peer goroutine will be blocked on the ping manager fully tearing down, which is blocked on the peer disconnect succeeding. This is a similar class of issue we've delt with recently as pertains to the peer and the server: sync all back execution must not lead to a circular waiting loop. Fixes lightningnetwork#8379

The error was never used as the init couldn't return an error, so we do away with that. We also modify the main event loop dispatch to more closely match other areas of the codebase.

coderabbitai

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 41c167d and 359f271.

Files selected for processing (3)

docs/release-notes/release-notes-0.17.4.md (2 hunks)
peer/brontide.go (1 hunks)
peer/ping_manager.go (2 hunks)

Files skipped from review as they are similar to previous changes (1)

peer/brontide.go

Additional comments: 4

docs/release-notes/release-notes-0.17.4.md (1)

32-34: The release notes mention the PR that prevents ping failures from deadlocking the peer connection. This aligns with the PR objectives and the AI-generated summaries.

peer/ping_manager.go (3)

101-105: The Start method has been correctly updated to initialize the pingTicker, pingTimeout, and start the pingHandler goroutine within a sync.Once block to ensure it only runs once.

111-195: The pingHandler method has been refactored and now contains the logic previously in the start method. It correctly handles the ping/pong protocol, including error handling and state resetting. The use of channels and timers appears to be correct and should prevent the deadlock issue as described in the PR objectives.

13-18: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [1-10]

The PingManagerConfig struct still contains the NewPingPayload function, which contradicts the previous comment from the bot that it was removed. This was already clarified in the conversation with ProofOfKeags, so no further action is needed here.

Roasbeef · 2024-01-23T02:24:26Z

@ProofOfKeags @ellemouton updated to squash change into first commit (ended up just modifying it as it was more pleasing to the Rebase Gods), and updated the release notes to point to v0.17.4.

Roasbeef added p2p Code related to the peer-to-peer behaviour bug fix labels Jan 15, 2024

guggero reviewed Jan 16, 2024

View reviewed changes

Comment thread peer/ping_manager.go Outdated

coderabbitai Bot reviewed Jan 17, 2024

View reviewed changes

Roasbeef added the llm-review add to a PR to have an LLM bot review it label Jan 17, 2024

ProofOfKeags suggested changes Jan 17, 2024

View reviewed changes

Comment thread peer/ping_manager.go

Comment thread peer/ping_manager.go Outdated

ellemouton reviewed Jan 17, 2024

View reviewed changes

coderabbitai Bot reviewed Jan 17, 2024

View reviewed changes

Comment thread peer/ping_manager.go

Comment thread peer/ping_manager.go

ProofOfKeags force-pushed the ping-async-dc branch from 497e269 to 41ee461 Compare January 18, 2024 01:59

coderabbitai Bot reviewed Jan 18, 2024

View reviewed changes

Roasbeef mentioned this pull request Jan 18, 2024

[bug?]: batchopenchannel hangs indefinitely #8362

Closed

Roasbeef added this to the v0.17.4 milestone Jan 18, 2024

saubyk assigned Roasbeef Jan 21, 2024

ellemouton approved these changes Jan 22, 2024

View reviewed changes

Comment thread peer/brontide.go Outdated

Comment thread docs/release-notes/release-notes-0.18.0.md Outdated

Comment thread peer/ping_manager.go Outdated

Roasbeef and others added 3 commits January 22, 2024 18:20

peer: refactor main event loop for ping handler

185119f

The error was never used as the init couldn't return an error, so we do away with that. We also modify the main event loop dispatch to more closely match other areas of the codebase.

docs: update release notes

359f271

Roasbeef force-pushed the ping-async-dc branch from d41a91f to 359f271 Compare January 23, 2024 02:22

coderabbitai Bot reviewed Jan 23, 2024

View reviewed changes

ProofOfKeags approved these changes Jan 23, 2024

View reviewed changes

ellemouton merged commit 51de320 into lightningnetwork:master Jan 23, 2024

Roasbeef mentioned this pull request Jan 23, 2024

release: create v0.17.4 rc1 release branch #8416

Merged

Conversation

Roasbeef commented Jan 15, 2024 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Roasbeef commented Jan 15, 2024

Uh oh!

Uh oh!

coderabbitai Bot commented Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Assessment against linked issues

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

CodeRabbit Discord Community

Uh oh!

Roasbeef commented Jan 17, 2024

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ellemouton Jan 17, 2024

Choose a reason for hiding this comment

Uh oh!

Roasbeef Jan 23, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellemouton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Roasbeef commented Jan 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Roasbeef commented Jan 15, 2024 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 17, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)