Skip to content

peer: make PingManager disconnect call async #8385

Merged
ellemouton merged 3 commits into
lightningnetwork:masterfrom
Roasbeef:ping-async-dc
Jan 23, 2024
Merged

peer: make PingManager disconnect call async #8385
ellemouton merged 3 commits into
lightningnetwork:masterfrom
Roasbeef:ping-async-dc

Conversation

@Roasbeef
Copy link
Copy Markdown
Member

@Roasbeef Roasbeef commented Jan 15, 2024

In this commit, we make all calls to disconnect after a ping/pong
violation is detected in the PingManager async. We do this to avoid
circular waiting that may occur if the disconnect call back ends up
waiting on the peer goroutine to be torn down. If this happens, then the
peer goroutine will be blocked on the ping manager fully tearing down,
which is blocked on the peer disconnect succeeding.

This is a similar class of issue we've delt with recently as pertains to
the peer and the server: sync all back execution must not lead to
a circular waiting loop.

Alternatively, we could have the internal of the call back itself be
async, but I prefer this route as it minimizes assumptions.

Fixes #8379

Summary by CodeRabbit

  • Refactor
    • Improved the ping management system for enhanced protocol enforcement.
    • Modified the Disconnect method call in the NewBrontide function to likely spawn a new goroutine.
  • Bug Fixes
    • Prevented ping failures from deadlocking the peer connection.
    • Fixed a mutex acquisition issue in FilterKnownChanIDs.

@Roasbeef Roasbeef added p2p Code related to the peer-to-peer behaviour bug fix labels Jan 15, 2024
@Roasbeef
Copy link
Copy Markdown
Member Author

I think before we merge this we should have an itest that showed the prior deadlock issue, which is then resolved by this PR.

Rather than add some new build tag, I think we can instead spin up a skeleton server (just for p2p connections), then have it pass in a purposefully blocking call back into the ping manager.

Comment thread peer/ping_manager.go Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 17, 2024

Walkthrough

The update to the PingManager addresses a critical deadlock issue by refining the shutdown process. It restructures the handling of pings and the management of goroutines and timers, aiming to ensure a graceful termination without leaving any routines hanging, thus preventing the deadlock scenario previously encountered.

Changes

File(s) Summary of Changes
peer/ping_manager.go Refactored to include initialization of timers and goroutines, and introduced pingHandler for ping/pong protocol management.
peer/brontide.go Modified NewBrontide function to add a go statement before the Disconnect method call, likely spawning a new goroutine for the operation.
docs/release-notes/release-notes-0.18.0.md Updated to prevent ping failures from deadlocking the peer connection.

Assessment against linked issues

Objective Addressed Explanation
Resolve deadlocks in PingManager (#8379)
Proper shutdown of PingManager to prevent waiting on WaitGroup (#8379)
Ability to consume the quit channel of PingManager (#8379)
Verify changes with @ProofOfKeags (#8379) The code changes seem to address the issue, but it's unclear if they have been verified with the user @ProofOfKeags as required.

Poem

In the realm of code, a deadlock's a fright,
'Twas the PingManager's bane, both day and night.
But with refactor's charm and a coder's delight,
Now graceful shutdowns make all just right. 🐇💻✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository from git and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

@Roasbeef
Copy link
Copy Markdown
Member Author

@coderabbitai review

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 76ae3e4 and 0e7583f.
Files selected for processing (1)
  • peer/ping_manager.go (2 hunks)
Additional comments: 7
peer/ping_manager.go (7)
  • 102-106: The initialization of pingTicker, pingTimeout, and the addition of the goroutine for pingHandler in the Start method are correctly moved from the old start method.
  • 112-197: The pingHandler method has been refactored to handle the ping/pong protocol enforcement. The logic appears to be correctly structured to handle the ping cycle, timeout, and pong reception.
  • 98-201: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [198-217]

Verify that the Stop method correctly interrupts the goroutines owned by PingManager after the refactoring. This is crucial to ensure that the deadlock issue the PR aims to resolve is indeed fixed.

  • 13-18: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [218-235]

The setPingState method's logic remains unchanged and is correctly used within the pingHandler method to track the state of outgoing pings.

  • 13-18: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [236-251]

The resetPingState method's logic remains unchanged and is correctly used within the pingHandler method to reset the state after a ping cycle completes or a pong is received.

  • 13-18: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [252-262]

The GetPingTimeMicroSeconds method correctly reports the RTT calculated by the pingManager. No changes were made to this method in the PR.

  • 13-18: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [263-271]

The ReceivedPong method correctly handles incoming pong messages and interacts with the pongChan channel, which is used within the pingHandler method.

@Roasbeef Roasbeef added the llm-review add to a PR to have an LLM bot review it label Jan 17, 2024
Comment thread peer/ping_manager.go
Comment thread peer/ping_manager.go Outdated
Comment thread peer/ping_manager.go Outdated
"in unclean state",
)
m.cfg.OnPongFailure(e)
go m.cfg.OnPongFailure(e)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could instead leave this as is but change the actual call back implementation to do things async. Since the caller defines the call-back, it feels like the caller should be responsible for making it run async if needed

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorta two ways about it: the core issue here is that we made an assumption about the call back that led to the deadlock in the first place. This also isn't the first time we've run into an issue like this. Last year we had a few instances of a circular call happening in the peer during init (eg: go to fetch latest node ann, calls into server, tries to grab mutex already held) that would've been resolved by minimizing assumptions w.r.t the synchronicity of a passed call back.

Callback aside, I think generally a better way to structure code like this would be to instead have the ping manager signal another sub-system that the expectation (pong not recv'd in time) has been violated in an async manner. This way, we don't need to care about disconnecting only that the signal to disconnect has been sent.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 0e7583f and 497e269.
Files selected for processing (2)
  • peer/brontide.go (1 hunks)
  • peer/ping_manager.go (2 hunks)
Additional comments: 7
peer/ping_manager.go (6)
  • 101-106: The initialization of pingTicker, pingTimeout, and the call to pingHandler have been moved from the start method to the Start method. This change is consistent with the summary and seems to be a refactoring to improve the initialization process. Ensure that the Start method is called appropriately wherever the PingManager is instantiated.
  • 112-197: The start method has been refactored into pingHandler. This method enforces the ping/pong protocol and contains logic for handling timeouts and pong responses. The refactoring appears to be done to address the deadlock issue by making the disconnection process asynchronous. Ensure that the pingHandler method is robust and correctly handles all possible cases, including timeouts and incorrect pong responses.
  • 146-148: The error handling for setPingState is correct, but ensure that the OnPongFailure callback is designed to handle errors asynchronously to avoid the deadlock issue this PR is intended to solve.
Verification successful

The OnPongFailure callback within the NewBrontide function in peer/brontide.go is designed to handle errors asynchronously. It logs a warning and disconnects from the peer using a new goroutine, ensuring that the operation does not block the current execution flow. The tests in peer/ping_manager_test.go further confirm the asynchronous behavior of the OnPongFailure callback by checking that it is called correctly under various conditions.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify that OnPongFailure handles errors asynchronously.
ast-grep --lang go --pattern $'OnPongFailure($_)'

Length of output: 50



Script:

#!/bin/bash
# Search for the string "OnPongFailure" to find references in the codebase.
rg "OnPongFailure"

Length of output: 563



Script:

#!/bin/bash
# Examine the implementation of OnPongFailure in peer/brontide.go and peer/ping_manager_test.go
cat peer/brontide.go
cat peer/ping_manager_test.go

Length of output: 131180

* 155-164: The timeout handling logic here is straightforward and correct. However, ensure that the `OnPongFailure` callback does not introduce any potential deadlocks when handling this timeout error. * 178-185: The check for the pong size matching the expected size is correct. However, ensure that the `OnPongFailure` callback is properly handling this error case asynchronously. * 191-192: Storing the RTT is a good practice for performance monitoring. Ensure that the atomic operation is thread-safe and that the stored RTT is used appropriately elsewhere in the codebase.
peer/brontide.go (1)
  • 579-579: The change to make the Disconnect method call asynchronous is in line with the PR's objectives to prevent deadlocks by avoiding circular waiting scenarios. However, ensure that the asynchronous disconnection is properly handled and does not lead to any race conditions or leaks of goroutines.

Comment thread peer/ping_manager.go
Comment thread peer/ping_manager.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 1bc1295 and 41ee461420144aabe3449b6d3b6cc2d1bc68efe1.
Files selected for processing (2)
  • peer/brontide.go (1 hunks)
  • peer/ping_manager.go (2 hunks)
Files skipped from review as they are similar to previous changes (1)
  • peer/brontide.go
Additional comments: 7
peer/ping_manager.go (7)
  • 101-105: The Start method now initializes the pingTicker, pingTimeout, and starts the pingHandler goroutine. This change centralizes the initialization logic within the Start method, which is a good practice for maintainability and readability.
  • 111-196: The pingHandler method has been refactored from the original start method. It enforces the ping/pong protocol and handles the ping cycle, timeout, and pong responses. The refactoring appears to be done to address the deadlock issue by making the disconnection process asynchronous. Ensure that the logic within this method correctly handles all cases and that the asynchronous behavior does not introduce new race conditions or logic errors.
  • 129-136: The error handling for an "impossible" state where a new ping begins while awaiting a pong response is to return an error and stop the handler. This is a change from the previous logic and should be verified to ensure it aligns with the intended error handling strategy for the application.
Verification successful

The error handling for the "impossible" state in ping_manager.go, where an error is returned and the handler is stopped, is consistent with the application's overall strategy for handling unexpected situations. The term "impossible" is used across the codebase to describe scenarios that should not occur under normal operation, and the typical response includes logging a warning, returning an error, or bypassing certain checks, which is in line with the observed pattern in ping_manager.go.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify that the error handling strategy for the "impossible" state is consistent with the application's overall strategy.
# This script searches for similar error handling patterns to compare strategies.
rg --type go "impossible" --files-with-matches

Length of output: 281



Script:

#!/bin/bash
# Examine a few instances of the "impossible" error handling pattern to understand the typical strategy used in the application.
# Selecting a few files from the list to check the context of "impossible" usage.
rg --type go -C 5 "impossible" rpcperms/interceptor.go
rg --type go -C 5 "impossible" routing/pathfind.go
rg --type go -C 5 "impossible" peer/brontide.go
rg --type go -C 5 "impossible" channeldb/channel.go

Length of output: 3305

* 144-146: The `setPingState` method is called to set up bookkeeping for a new Ping. It is important to ensure that the error handling within this method is consistent with the rest of the application and that the state is correctly managed to prevent any potential race conditions. * 174-184: The check for a pong response that does not match the expected size is a critical part of the protocol enforcement. Ensure that this logic is robust and that the error handling is appropriate for the application's needs.
Verification successful

The verification process has confirmed that the pong size check within the PingManager is robust and includes a specific test case for handling a bad pong size. Additionally, the OnPongFailure callback is implemented to log a warning and disconnect the peer, which is an appropriate error handling strategy for the application. Based on this, the review comment has been verified as accurate.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify that the pong size check is robust and error handling is appropriate.
# This script searches for similar error handling patterns to compare strategies.
rg --type go "pong response does not match expected size" --files-with-matches

Length of output: 80



Script:

#!/bin/bash
# Find the definition and usage of the OnPongFailure callback to understand how errors are handled.
ast-grep --lang go --pattern 'func ($_ *$_) OnPongFailure($_ error) { $$$ }'

# Search for tests that might be covering the pong size check scenario.
rg --type go "OnPongFailure" --files-with-matches

Length of output: 192



Script:

#!/bin/bash
# Examine the test file for tests related to the pong size check.
cat peer/ping_manager_test.go

# Look into brontide.go to understand the implementation and usage of OnPongFailure.
cat peer/brontide.go

Length of output: 131180



Script:

#!/bin/bash
# Examine the test file for tests related to the pong size check.
cat peer/ping_manager_test.go

# Look into brontide.go to understand the implementation and usage of OnPongFailure.
cat peer/brontide.go

Length of output: 131180

* 186-191: The calculation of RTT (round-trip-time) is an important metric for network performance. Ensure that the storage and retrieval of this metric are thread-safe and that the atomic operations are used correctly. * 193-195: The case for `<-m.quit` in the select statement is used to stop the `pingHandler` goroutine. Ensure that the quit channel is correctly managed across the application to prevent leaks or premature shutdowns.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 41ee461420144aabe3449b6d3b6cc2d1bc68efe1 and d41a91f98681304e2c665065b1633614ebd60b69.
Files selected for processing (1)
  • docs/release-notes/release-notes-0.18.0.md (1 hunks)
Additional comments: 1
docs/release-notes/release-notes-0.18.0.md (1)
  • 68-70: The addition to the release notes correctly reflects the changes made in PR #8385, which address the deadlock issue by making ping failures not deadlock the peer connection.

Copy link
Copy Markdown
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🔥

main things to address: can squash 2 of the commits into 1 & need to move release notes to 0.17.4

Comment thread peer/brontide.go Outdated
Comment thread docs/release-notes/release-notes-0.18.0.md Outdated
Comment thread peer/ping_manager.go Outdated
Roasbeef and others added 3 commits January 22, 2024 18:20
In this commit, we make all calls to disconnect after a ping/pong
violation is detected in the `PingManager` async. We do this to avoid
circular waiting that may occur if the disconnect call back ends up
waiting on the peer goroutine to be torn down. If this happens, then the
peer goroutine will be blocked on the ping manager fully tearing down,
which is blocked on the peer disconnect succeeding.

This is a similar class of issue we've delt with recently as pertains to
the peer and the server: sync all back execution must not lead to
a circular waiting loop.

Fixes lightningnetwork#8379
The error was never used as the init couldn't return an error, so we do
away with that. We also modify the main event loop dispatch to more
closely match other areas of the codebase.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 41c167d and 359f271.
Files selected for processing (3)
  • docs/release-notes/release-notes-0.17.4.md (2 hunks)
  • peer/brontide.go (1 hunks)
  • peer/ping_manager.go (2 hunks)
Files skipped from review as they are similar to previous changes (1)
  • peer/brontide.go
Additional comments: 4
docs/release-notes/release-notes-0.17.4.md (1)
  • 32-34: The release notes mention the PR that prevents ping failures from deadlocking the peer connection. This aligns with the PR objectives and the AI-generated summaries.
peer/ping_manager.go (3)
  • 101-105: The Start method has been correctly updated to initialize the pingTicker, pingTimeout, and start the pingHandler goroutine within a sync.Once block to ensure it only runs once.
  • 111-195: The pingHandler method has been refactored and now contains the logic previously in the start method. It correctly handles the ping/pong protocol, including error handling and state resetting. The use of channels and timers appears to be correct and should prevent the deadlock issue as described in the PR objectives.
  • 13-18: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [1-10]

The PingManagerConfig struct still contains the NewPingPayload function, which contradicts the previous comment from the bot that it was removed. This was already clarified in the conversation with ProofOfKeags, so no further action is needed here.

@Roasbeef
Copy link
Copy Markdown
Member Author

@ProofOfKeags @ellemouton updated to squash change into first commit (ended up just modifying it as it was more pleasing to the Rebase Gods), and updated the release notes to point to v0.17.4.

@ellemouton ellemouton merged commit 51de320 into lightningnetwork:master Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug fix llm-review add to a PR to have an LLM bot review it p2p Code related to the peer-to-peer behaviour

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug]: PingManager not shutting down correctly

4 participants