Summary
A race condition in channelLink.Stop() allows a hodl HTLC subscription to be
registered against a hodlQueue that has already been shut down. When
notifyHodlSubscribers is subsequently called for that circuit key — by the
MPP auto-release timer, the invoice expiry watcher, SettleHodlInvoice, or
CancelInvoice — it blocks indefinitely while holding hodlSubscriptionsMux.
Combined with a concurrent NotifyExitHopHtlc call holding the registry's main
lock, this freezes the entire InvoiceRegistry until lnd is restarted.
Background
The hodl subscription mechanism is the async bridge between the invoice registry
and channel links. When a hold-invoice HTLC arrives, the link passes its
hodlQueue.ChanIn() as a subscriber channel to NotifyExitHopHtlc. The
registry stores this in hodlSubscriptions[circuitKey] and returns nil
(held) to the link. The link parks the HTLC in its hodlMap and waits for a
resolution event. When the invoice is later settled or cancelled,
notifyHodlSubscribers sends a resolution on that channel, waking the link to
settle or fail the HTLC.
Root Cause
channelLink.Stop() tears down the hodl machinery in this order:
l.cfg.Registry.HodlUnsubscribeAll(l.hodlQueue.ChanIn()) // ① removes all subscriptions
l.hodlQueue.Stop() // ② kills queue goroutine
l.cg.Quit() // ③ signals htlcManager to stop
l.cg.WgWait() // ④ waits for htlcManager to exit
cg.Quit() (step ③) is called after hodlQueue.Stop() (step ②). Between
steps ① and ③ the htlcManager goroutine is still alive and continues
consuming peer messages from the mailbox. If a RevokeAndAck is pending at
shutdown time, the htlcManager processes it:
processRemoteRevokeAndAck
→ processRemoteAdds
→ processExitHop
→ NotifyExitHopHtlc
→ hodlSubscribe(hodlQueue.ChanIn(), circuitKey) ← NEW subscription
The partial guard inside processRemoteRevokeAndAck:
select {
case <-l.cg.Done():
return nil
default:
}
does not protect here because cg.Quit() has not yet been called — the
default branch always executes and processRemoteAdds proceeds.
After step ②, hodlQueue.ChanIn() is an unbuffered channel with no goroutine
reading it. The new subscription is permanently orphaned: present in
hodlSubscriptions but unable to receive a message.
This can only happen for subscriptions added after HodlUnsubscribeAll
completes. Subscriptions that existed before step ① are cleanly removed by it.
The bug is exclusively about the race window between step ① and step ③.
Deadlock Cascade
When notifyHodlSubscribers is later called for the orphaned circuit key (by
the expiry watcher, SettleHodlInvoice, CancelInvoice, or the MPP
auto-release timer):
notifyHodlSubscribers
→ hodlSubscriptionsMux.Lock() ← acquired
→ select {
case chanIn <- resolution: ← blocks: no reader, queue goroutine dead
case <-i.quit: ← only fires on full registry shutdown
}
A concurrent NotifyExitHopHtlc from any active channel holds i.Lock() and
waits for hodlSubscriptionsMux inside hodlSubscribe, producing the full
cascade:
goroutine A holds hodlSubscriptionsMux
blocked: chanIn <- resolution (dead queue)
goroutine B holds i.Lock()
blocked: hodlSubscriptionsMux.Lock() (inside hodlSubscribe)
goroutine C+ blocked: i.Lock() (NotifyExitHopHtlc, cancelInvoiceImpl,
AddInvoice, SettleHodlInvoice, ...)
The InvoiceRegistry is completely non-functional until lnd restarts.
Note: cancelInvoiceImpl acquires i.Lock() before calling
notifyHodlSubscribers, so when the expiry watcher is the trigger it holds
i.Lock() during the blocking send, causing the same cascade with no
concurrent NotifyExitHopHtlc required.
Conditions Required
Only two conditions must coincide:
-
A RevokeAndAck is pending in the mailbox at shutdown time for a hold
invoice HTLC (single-part or MPP), causing processRemoteAdds →
NotifyExitHopHtlc → hodlSubscribe to execute in the race window after
HodlUnsubscribeAll has already run.
-
notifyHodlSubscribers is subsequently called for that circuit key. This
happens automatically via the invoice expiry watcher (block-height based,
fires independently), the MPP 120-second auto-release timer, or any explicit
SettleHodlInvoice or CancelInvoice call.
No active MPP partial set is required. Any hold invoice HTLC that becomes
committed in the race window is sufficient. The expiry watcher alone is enough
to trigger the deadlock once the orphaned subscription exists.
Why This Is Reproducible Under Load
On nodes with high hold-invoice volume across multiple channels, condition 1 is
frequently true. With multiple peers disconnecting simultaneously (as seen in
goroutine dumps from affected nodes), each concurrent link teardown is an
independent opportunity to hit the race window. The probability that at least
one pending RevokeAndAck slips through on at least one channel approaches
certainty as peer churn increases.
The race window between hodlQueue.Stop() and cg.Quit() grows under CPU
pressure because Go's asynchronous preemption scheduler can deschedule the
Stop() goroutine between those two sequential lines for longer periods when
more goroutines compete for CPU time.
Increased Vulnerability on Nodes Without Native SQL channeldb
Nodes whose channeldb is still backed by the KV/bbolt store are
disproportionately affected. The forwarding package operations in
processRemoteRevokeAndAck (channel.ReceiveRevocation, fwdPkg reads/writes)
go through bbolt, which uses a global write lock per database. Under high HTLC
load this introduces significant I/O latency in the htlcManager goroutine.
This has two compounding effects:
-
Wider race window. The htlcManager spends more time inside
processRemoteRevokeAndAck on each RevokeAndAck. Slower KV I/O means it
is more likely to still be processing when Stop() is called, and more
likely to reach processRemoteAdds → hodlSubscribe before cg.Quit()
signals it to stop.
-
More scheduler preemption. bbolt's file I/O paths generate syscalls,
which are natural preemption points for the Go scheduler. When the Stop()
goroutine is preempted at a syscall between hodlQueue.Stop() and
cg.Quit(), the scheduling gap widens. Under high load the Stop()
goroutine may be delayed significantly, giving the htlcManager more time to
process additional RevokeAndAck messages in the gap.
Once channeldb is migrated to native SQL, both effects are substantially
reduced, but the underlying race remains and must be fixed regardless of storage
backend.
Impact
Once triggered:
- No incoming payments can be processed (
NotifyExitHopHtlc blocks on
i.Lock()).
- No hold invoices can be settled or cancelled.
- New invoices cannot be created.
- Invoice subscriptions stop delivering state change events.
The node appears online and connected but silently fails to process any payment.
The only recovery is a full lnd restart. The root cause recurs on the next
occurrence of the race.
Fix Direction
The fix is to invert the teardown order so the htlcManager is fully stopped
before the hodl subscription cleanup, ensuring no new subscriptions can be
created after HodlUnsubscribeAll:
// Stop htlcManager first — no more calls to NotifyExitHopHtlc are possible
l.cg.Quit()
l.cg.WgWait()
// Now safe to clean up: no goroutine can register a new subscription
l.cfg.Registry.HodlUnsubscribeAll(l.hodlQueue.ChanIn())
l.hodlQueue.Stop()
A complementary hardening is to make notifyHodlSubscribers not hold
hodlSubscriptionsMux across a potentially blocking channel send — for example
by releasing the lock before sending and re-acquiring it only for the map
cleanup. This would prevent a single blocked subscriber from freezing all other
hodl subscription operations even if the orphaned subscription somehow persists.
Summary
A race condition in
channelLink.Stop()allows a hodl HTLC subscription to beregistered against a
hodlQueuethat has already been shut down. WhennotifyHodlSubscribersis subsequently called for that circuit key — by theMPP auto-release timer, the invoice expiry watcher,
SettleHodlInvoice, orCancelInvoice— it blocks indefinitely while holdinghodlSubscriptionsMux.Combined with a concurrent
NotifyExitHopHtlccall holding the registry's mainlock, this freezes the entire
InvoiceRegistryuntil lnd is restarted.Background
The hodl subscription mechanism is the async bridge between the invoice registry
and channel links. When a hold-invoice HTLC arrives, the link passes its
hodlQueue.ChanIn()as a subscriber channel toNotifyExitHopHtlc. Theregistry stores this in
hodlSubscriptions[circuitKey]and returnsnil(held) to the link. The link parks the HTLC in its
hodlMapand waits for aresolution event. When the invoice is later settled or cancelled,
notifyHodlSubscriberssends a resolution on that channel, waking the link tosettle or fail the HTLC.
Root Cause
channelLink.Stop()tears down the hodl machinery in this order:cg.Quit()(step ③) is called afterhodlQueue.Stop()(step ②). Betweensteps ① and ③ the
htlcManagergoroutine is still alive and continuesconsuming peer messages from the mailbox. If a
RevokeAndAckis pending atshutdown time, the htlcManager processes it:
The partial guard inside
processRemoteRevokeAndAck:does not protect here because
cg.Quit()has not yet been called — thedefaultbranch always executes andprocessRemoteAddsproceeds.After step ②,
hodlQueue.ChanIn()is an unbuffered channel with no goroutinereading it. The new subscription is permanently orphaned: present in
hodlSubscriptionsbut unable to receive a message.This can only happen for subscriptions added after
HodlUnsubscribeAllcompletes. Subscriptions that existed before step ① are cleanly removed by it.
The bug is exclusively about the race window between step ① and step ③.
Deadlock Cascade
When
notifyHodlSubscribersis later called for the orphaned circuit key (bythe expiry watcher,
SettleHodlInvoice,CancelInvoice, or the MPPauto-release timer):
A concurrent
NotifyExitHopHtlcfrom any active channel holdsi.Lock()andwaits for
hodlSubscriptionsMuxinsidehodlSubscribe, producing the fullcascade:
The
InvoiceRegistryis completely non-functional until lnd restarts.Note:
cancelInvoiceImplacquiresi.Lock()before callingnotifyHodlSubscribers, so when the expiry watcher is the trigger it holdsi.Lock()during the blocking send, causing the same cascade with noconcurrent
NotifyExitHopHtlcrequired.Conditions Required
Only two conditions must coincide:
A
RevokeAndAckis pending in the mailbox at shutdown time for a holdinvoice HTLC (single-part or MPP), causing
processRemoteAdds→NotifyExitHopHtlc→hodlSubscribeto execute in the race window afterHodlUnsubscribeAllhas already run.notifyHodlSubscribersis subsequently called for that circuit key. Thishappens automatically via the invoice expiry watcher (block-height based,
fires independently), the MPP 120-second auto-release timer, or any explicit
SettleHodlInvoiceorCancelInvoicecall.No active MPP partial set is required. Any hold invoice HTLC that becomes
committed in the race window is sufficient. The expiry watcher alone is enough
to trigger the deadlock once the orphaned subscription exists.
Why This Is Reproducible Under Load
On nodes with high hold-invoice volume across multiple channels, condition 1 is
frequently true. With multiple peers disconnecting simultaneously (as seen in
goroutine dumps from affected nodes), each concurrent link teardown is an
independent opportunity to hit the race window. The probability that at least
one pending
RevokeAndAckslips through on at least one channel approachescertainty as peer churn increases.
The race window between
hodlQueue.Stop()andcg.Quit()grows under CPUpressure because Go's asynchronous preemption scheduler can deschedule the
Stop()goroutine between those two sequential lines for longer periods whenmore goroutines compete for CPU time.
Increased Vulnerability on Nodes Without Native SQL channeldb
Nodes whose
channeldbis still backed by the KV/bbolt store aredisproportionately affected. The forwarding package operations in
processRemoteRevokeAndAck(channel.ReceiveRevocation, fwdPkg reads/writes)go through bbolt, which uses a global write lock per database. Under high HTLC
load this introduces significant I/O latency in the htlcManager goroutine.
This has two compounding effects:
Wider race window. The htlcManager spends more time inside
processRemoteRevokeAndAckon eachRevokeAndAck. Slower KV I/O means itis more likely to still be processing when
Stop()is called, and morelikely to reach
processRemoteAdds→hodlSubscribebeforecg.Quit()signals it to stop.
More scheduler preemption. bbolt's file I/O paths generate syscalls,
which are natural preemption points for the Go scheduler. When the
Stop()goroutine is preempted at a syscall between
hodlQueue.Stop()andcg.Quit(), the scheduling gap widens. Under high load theStop()goroutine may be delayed significantly, giving the htlcManager more time to
process additional
RevokeAndAckmessages in the gap.Once channeldb is migrated to native SQL, both effects are substantially
reduced, but the underlying race remains and must be fixed regardless of storage
backend.
Impact
Once triggered:
NotifyExitHopHtlcblocks oni.Lock()).The node appears online and connected but silently fails to process any payment.
The only recovery is a full lnd restart. The root cause recurs on the next
occurrence of the race.
Fix Direction
The fix is to invert the teardown order so the
htlcManageris fully stoppedbefore the hodl subscription cleanup, ensuring no new subscriptions can be
created after
HodlUnsubscribeAll:A complementary hardening is to make
notifyHodlSubscribersnot holdhodlSubscriptionsMuxacross a potentially blocking channel send — for exampleby releasing the lock before sending and re-acquiring it only for the map
cleanup. This would prevent a single blocked subscriber from freezing all other
hodl subscription operations even if the orphaned subscription somehow persists.