htlcswitch: fix hodlQueue deadlock by stopping htlcManager first#10719
htlcswitch: fix hodlQueue deadlock by stopping htlcManager first#10719ziggie1984 wants to merge 1 commit intolightningnetwork:masterfrom
Conversation
The channelLink.Stop() teardown had an inverted ordering that could cause a permanent deadlock of the invoice registry under concurrent peer disconnect. The previous order was: 1. HodlUnsubscribeAll -- removes subscriptions 2. hodlQueue.Stop() -- kills the queue's internal goroutine 3. cg.Quit() -- signals htlcManager to stop 4. cg.WgWait() -- waits for htlcManager to exit The race window between steps 2 and 4 left htlcManager alive. A RevokeAndAck arriving during that window could drive processRemoteAdds → processExitHop → NotifyExitHopHtlc, registering a new hodl subscription backed by a dead hodlQueue (ChanIn() has no reader). Any subsequent call to notifyHodlSubscribers (e.g. MPP auto-release timer, expiry watcher, or explicit settle/cancel) would then block indefinitely on the unbuffered ChanIn(), holding hodlSubscriptionsMux. Concurrent NotifyExitHopHtlc calls waiting for that lock, plus callers holding the invoice-level lock waiting for those, produce a full deadlock of the invoice registry with no recovery path short of a daemon restart. The fix is to stop htlcManager before touching the hodl subscription state. htlcManager is the sole caller of NotifyExitHopHtlc, so once cg.WgWait() returns no new subscriptions can be registered, making HodlUnsubscribeAll and hodlQueue.Stop() race-free. Fixes: lightningnetwork#10718
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical deadlock issue occurring during the shutdown of a channel link. By reordering the teardown sequence to ensure the htlcManager is fully stopped before modifying subscription states, the change prevents the registration of orphaned subscriptions that previously caused the invoice registry to freeze. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request reorders the shutdown sequence in channelLink.Stop to ensure the htlcManager goroutine is fully stopped before unsubscribing from HODL events and shutting down the hodlQueue. This change prevents a potential deadlock where a new subscription could be registered against a dead queue. A review comment suggests that the added documentation, while correctly explaining the 'why' behind the change, is overly verbose and should be made more concise to improve readability.
| // Stop the htlcManager goroutine first. This is critical: htlcManager | ||
| // is the sole caller of NotifyExitHopHtlc, which registers new hodl | ||
| // subscriptions. We must guarantee it has fully exited before we | ||
| // remove subscriptions and stop the hodlQueue. Without this ordering, | ||
| // a RevokeAndAck processed in the race window between hodlQueue.Stop() | ||
| // and cg.Quit() can register an orphaned subscription against a dead | ||
| // queue, causing notifyHodlSubscribers to block permanently and | ||
| // deadlock the entire invoice registry. |
There was a problem hiding this comment.
The comment block is quite long and verbose. While it explains the 'why' as required by the style guide, it could be more concise to improve readability while still maintaining the necessary context.
References
- Comments must not explain the code 1:1 but instead explain the why behind a certain block of code, in case it requires contextual knowledge.
Safety Analysis of the ReorderBefore merging, I did a thorough read of the dependency graph to confirm the What is under
|
Summary
This PR fixes a deadlock in the invoice registry caused by an inverted
teardown order in
channelLink.Stop().Closes #10718
Root Cause
The previous
Stop()sequence was:Steps ② and ③ create a race window where
htlcManageris still alivebut
hodlQueueis dead. ARevokeAndAckarriving in this windowdrives:
This registers a new hodl subscription backed by a dead
hodlQueue.ChanIn()(its internal goroutine has exited, so theunbuffered channel has no reader).
The orphaned subscription then causes a deadlock cascade:
notifyHodlSubscribers(MPP auto-release timer, expirywatcher, or explicit settle/cancel) blocks on the dead
ChanIn(),holding
hodlSubscriptionsMux.NotifyExitHopHtlccalls waiting for that lock stall.i.Lock()waiting on thosefreeze the entire registry.
There is no recovery path short of a daemon restart.
The risk is amplified when
channeldbis backed by bbolt (KVstore) because bbolt's global write lock increases the latency of
processRemoteRevokeAndAck, widening the race window and raising theprobability of the scheduler preempting between steps ② and ③.
Fix
Stop
htlcManagerfirst, before touching hodl subscription state:htlcManageris the sole caller ofNotifyExitHopHtlc. Oncecg.WgWait()returns, no new subscriptions can be registered and theremaining teardown is race-free.
There are no ordering dependencies that prevent this reorder:
ChainEvents.Cancel()has no dependency onhtlcManagerrunning.htlcManagerhas exited.htlcManagerreadshodlQueue.ChanOut()in aselectthat alsohandles
cg.Done(), so it exits cleanly when signalled.Testing
The existing
htlcswitchtests pass. A targeted regression test forthe race would require a concurrency harness that synchronises a
RevokeAndAckdelivery with theStop()window; this is left for afollow-up alongside the broader hodl-channel redesign tracked in the
companion issue.