Skip to content

htlcswitch: fix hodlQueue deadlock by stopping htlcManager first#10719

Open
ziggie1984 wants to merge 1 commit intolightningnetwork:masterfrom
ziggie1984:hodlqueue-stop-order-fix
Open

htlcswitch: fix hodlQueue deadlock by stopping htlcManager first#10719
ziggie1984 wants to merge 1 commit intolightningnetwork:masterfrom
ziggie1984:hodlqueue-stop-order-fix

Conversation

@ziggie1984
Copy link
Copy Markdown
Collaborator

Summary

This PR fixes a deadlock in the invoice registry caused by an inverted
teardown order in channelLink.Stop().

Closes #10718

Root Cause

The previous Stop() sequence was:

① HodlUnsubscribeAll(hodlQueue.ChanIn())   -- removes subscriptions
② hodlQueue.Stop()                          -- kills queue goroutine
③ cg.Quit()                                 -- signals htlcManager
④ cg.WgWait()                               -- waits for htlcManager

Steps ② and ③ create a race window where htlcManager is still alive
but hodlQueue is dead. A RevokeAndAck arriving in this window
drives:

processRemoteAdds → processExitHop → NotifyExitHopHtlc

This registers a new hodl subscription backed by a dead
hodlQueue.ChanIn() (its internal goroutine has exited, so the
unbuffered channel has no reader).

The orphaned subscription then causes a deadlock cascade:

  • Any call to notifyHodlSubscribers (MPP auto-release timer, expiry
    watcher, or explicit settle/cancel) blocks on the dead ChanIn(),
    holding hodlSubscriptionsMux.
  • Concurrent NotifyExitHopHtlc calls waiting for that lock stall.
  • Callers holding the invoice-level lock i.Lock() waiting on those
    freeze the entire registry.

There is no recovery path short of a daemon restart.

The risk is amplified when channeldb is backed by bbolt (KV
store) because bbolt's global write lock increases the latency of
processRemoteRevokeAndAck, widening the race window and raising the
probability of the scheduler preempting between steps ② and ③.

Fix

Stop htlcManager first, before touching hodl subscription state:

① cg.Quit() + cg.WgWait()                  -- htlcManager fully gone
② HodlUnsubscribeAll(hodlQueue.ChanIn())   -- safe: no new subscribers
③ hodlQueue.Stop()                          -- queue is idle

htlcManager is the sole caller of NotifyExitHopHtlc. Once
cg.WgWait() returns, no new subscriptions can be registered and the
remaining teardown is race-free.

There are no ordering dependencies that prevent this reorder:

  • ChainEvents.Cancel() has no dependency on htlcManager running.
  • The timer drain is harmless after htlcManager has exited.
  • htlcManager reads hodlQueue.ChanOut() in a select that also
    handles cg.Done(), so it exits cleanly when signalled.

Testing

The existing htlcswitch tests pass. A targeted regression test for
the race would require a concurrency harness that synchronises a
RevokeAndAck delivery with the Stop() window; this is left for a
follow-up alongside the broader hodl-channel redesign tracked in the
companion issue.

The channelLink.Stop() teardown had an inverted ordering that could
cause a permanent deadlock of the invoice registry under concurrent
peer disconnect.

The previous order was:
  1. HodlUnsubscribeAll  -- removes subscriptions
  2. hodlQueue.Stop()    -- kills the queue's internal goroutine
  3. cg.Quit()           -- signals htlcManager to stop
  4. cg.WgWait()         -- waits for htlcManager to exit

The race window between steps 2 and 4 left htlcManager alive. A
RevokeAndAck arriving during that window could drive processRemoteAdds
→ processExitHop → NotifyExitHopHtlc, registering a new hodl
subscription backed by a dead hodlQueue (ChanIn() has no reader).

Any subsequent call to notifyHodlSubscribers (e.g. MPP auto-release
timer, expiry watcher, or explicit settle/cancel) would then block
indefinitely on the unbuffered ChanIn(), holding hodlSubscriptionsMux.
Concurrent NotifyExitHopHtlc calls waiting for that lock, plus callers
holding the invoice-level lock waiting for those, produce a full
deadlock of the invoice registry with no recovery path short of a
daemon restart.

The fix is to stop htlcManager before touching the hodl subscription
state. htlcManager is the sole caller of NotifyExitHopHtlc, so once
cg.WgWait() returns no new subscriptions can be registered, making
HodlUnsubscribeAll and hodlQueue.Stop() race-free.

Fixes: lightningnetwork#10718
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical deadlock issue occurring during the shutdown of a channel link. By reordering the teardown sequence to ensure the htlcManager is fully stopped before modifying subscription states, the change prevents the registration of orphaned subscriptions that previously caused the invoice registry to freeze.

Highlights

  • Teardown Sequence Reordering: Modified the channelLink.Stop() method to shut down the htlcManager before unsubscribing from the hodlQueue.
  • Deadlock Prevention: Eliminated a race condition where a late RevokeAndAck could register an orphaned subscription against a stopped queue, leading to a permanent deadlock in the invoice registry.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reorders the shutdown sequence in channelLink.Stop to ensure the htlcManager goroutine is fully stopped before unsubscribing from HODL events and shutting down the hodlQueue. This change prevents a potential deadlock where a new subscription could be registered against a dead queue. A review comment suggests that the added documentation, while correctly explaining the 'why' behind the change, is overly verbose and should be made more concise to improve readability.

Comment on lines +632 to +639
// Stop the htlcManager goroutine first. This is critical: htlcManager
// is the sole caller of NotifyExitHopHtlc, which registers new hodl
// subscriptions. We must guarantee it has fully exited before we
// remove subscriptions and stop the hodlQueue. Without this ordering,
// a RevokeAndAck processed in the race window between hodlQueue.Stop()
// and cg.Quit() can register an orphaned subscription against a dead
// queue, causing notifyHodlSubscribers to block permanently and
// deadlock the entire invoice registry.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment block is quite long and verbose. While it explains the 'why' as required by the style guide, it could be more concise to improve readability while still maintaining the necessary context.

References
  1. Comments must not explain the code 1:1 but instead explain the why behind a certain block of code, in case it requires contextual knowledge.

@ziggie1984
Copy link
Copy Markdown
Collaborator Author

Safety Analysis of the Reorder

Before merging, I did a thorough read of the dependency graph to confirm the
new teardown order has no side-effects. Summary below.

What is under cg?

Two goroutines are tracked by the ControlledGoroutine:

  1. htlcManager (started at Start(), line 614) — the main select loop.
  2. fwdPkgGarbager (started inside resumeLink, line 3979) — periodic GC.

Both already contain case <-l.cg.Done(): return as a select case. When
cg.Quit() fires they will exit on their next select iteration.


Is hodlQueue needed while cg.WgWait() blocks?

htlcManager reads from hodlQueue.ChanOut() in its select (line 1425).
With the new order, hodlQueue.Stop() is called after cg.WgWait()
returns, meaning the hodlQueue's internal goroutine stays alive for the entire
duration of the wait. This is actually better than the old ordering for
one reason:

notifyHodlSubscribers (registry side) sends directly on subscriber
(= hodlQueue.ChanIn(), which is an unbuffered channel). It blocks until
the hodlQueue goroutine reads from the other end:

select {
case subscriber <- htlcResolution:   // blocks if no reader
case <-i.quit:
    return
}

In the old ordering, if notifyHodlSubscribers was mid-send when
hodlQueue.Stop() was called, the hodlQueue goroutine exited and the send
blocked forever. cancelInvoiceImpl (the caller) holds i.Lock() while
blocked there. If htlcManager was concurrently trying to enter
NotifyExitHopHtlc — which also takes i.Lock()cg.WgWait() would
itself block indefinitely. The new ordering eliminates this: the hodlQueue
goroutine remains alive, every in-flight send on ChanIn() completes, and
cg.WgWait() is unblocked.


Can htlcManager register a new subscription after cg.Quit()?

Yes, in theory. If htlcManager was already deep in
processRemoteRevokeAndAckprocessRemoteAddsprocessExitHop
NotifyExitHopHtlc when cg.Quit() fired, it cannot see cg.Done() until
it returns to the top-level select. It may register a subscription before
returning. This is safe: HodlUnsubscribeAll runs after cg.WgWait()
returns (i.e., after htlcManager has fully exited), so it will remove any
subscription registered during that window.


ChainEvents.Cancel() after htlcManager exits

htlcManager selects on l.cfg.ChainEvents.RemoteUnilateralClosure
(line 1381). After cg.WgWait() returns, nobody reads that channel. We call
ChainEvents.Cancel() immediately afterward, so the upstream notifier is
cleaned up promptly. No goroutine leak; just a brief window with an
unread channel.


Timer drain

updateFeeTimer is only read inside htlcManager's select loop. After
cg.WgWait() returns, the goroutine is gone. The drain is harmless and
keeps the time.Timer in a clean state.


mailBox.ResetPackets() and AddPreimages

Unchanged — both still execute after all teardown is complete. ✓


Verdict

The reordering is safe. The new sequence has no previously-satisfied
dependency that is now violated, and it is strictly safer in the case where
notifyHodlSubscribers races with link shutdown, because the hodlQueue
goroutine remains alive throughout cg.WgWait().

@saubyk saubyk added this to v0.21 Apr 7, 2026
@saubyk saubyk moved this to In review in v0.21 Apr 7, 2026
Copy link
Copy Markdown
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM🛡️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

htlcswitch: invoice registry deadlock via orphaned hodl subscription after link shutdown

3 participants