Pre-Submission Checklist
LND Version
v0.20.99-beta
LND Configuration
lnd.conf
Backend Version
Bitcoin Core v29.0.0
Backend Configuration
default configuration
OS/Distribution
Ubunutu 22.04
Bug Details & Steps to Reproduce
Body
When resolveContract hits any non-shutdown error from Resolve(), the goroutine just logs and returns. No retry, no re-queue. The resolver dies silently for the rest of that LND session.
// channel_arbitrator.go:2648-2657
nextContract, err := currentContract.Resolve()
if err != nil {
if err == errResolverShuttingDown {
return
}
log.Errorf("ChannelArbitrator(%v): unable to "+
"progress %T: %v",
c.cfg.ChanPoint, currentContract, err)
return // goroutine exits permanently, no retry
}
The problem is nobody re-spawns this goroutine. On each new block in StateWaitingFullResolution, handleBlockbeat only calls launchResolvers() which is a no-op for already-launched resolvers:
// channel_arbitrator.go:3009-3020
if c.state.IsContractClosed() { // true for StateWaitingFullResolution
c.launchResolvers() // calls Launch() only, not resolveContract()
return nil
}
resolveContracts() (which actually spawns the goroutines with go c.resolveContract(contract)) is never called again after the initial transition into StateWaitingFullResolution.
What can trigger this
The resolvers call chain notifier APIs that can fail transiently:
RegisterSpendNtfn (htlc_outgoing_contest_resolver.go:104)
RegisterBlockEpochNtfn (htlc_outgoing_contest_resolver.go:130, htlc_incoming_contest_resolver.go:160)
- Various
Checkpoint and DB calls inside resolvers
A bitcoind restart, a ZMQ socket hiccup, or a transient DB error is enough to kill the goroutine.
Scenario
- Channel force-closes with a pending outgoing HTLC (say 0.5 BTC, CLTV expiry in 144 blocks)
resolveContract(htlcOutgoingContestResolver) goroutine starts normally
- Bitcoind restarts briefly,
RegisterSpendNtfn returns an error
- Goroutine logs the error and returns (line 2657)
- HTLC output on-chain is now completely unwatched
- No new goroutine is ever spawned for this resolver
- 144 blocks pass, counterparty sweeps the timed-out HTLC output
- Funds gone
The only recovery is a full LND restart (which calls relaunchResolvers from scratch), but there is no indication to the operator that a resolver died. It is just a single log.Errorf line buried in the logs.
Expected Behavior
The goroutine should retry with backoff on transient errors instead of exiting permanently. Something like waiting a few seconds and calling Resolve() again, or at minimum propagating the failure so the state machine can handle it.
Debug Information
No response
Environment
No response
Pre-Submission Checklist
LND Version
v0.20.99-beta
LND Configuration
lnd.conf
Backend Version
Bitcoin Core v29.0.0
Backend Configuration
default configuration
OS/Distribution
Ubunutu 22.04
Bug Details & Steps to Reproduce
Body
When
resolveContracthits any non-shutdown error fromResolve(), the goroutine just logs and returns. No retry, no re-queue. The resolver dies silently for the rest of that LND session.The problem is nobody re-spawns this goroutine. On each new block in
StateWaitingFullResolution,handleBlockbeatonly callslaunchResolvers()which is a no-op for already-launched resolvers:resolveContracts()(which actually spawns the goroutines withgo c.resolveContract(contract)) is never called again after the initial transition intoStateWaitingFullResolution.What can trigger this
The resolvers call chain notifier APIs that can fail transiently:
RegisterSpendNtfn(htlc_outgoing_contest_resolver.go:104)RegisterBlockEpochNtfn(htlc_outgoing_contest_resolver.go:130, htlc_incoming_contest_resolver.go:160)Checkpointand DB calls inside resolversA bitcoind restart, a ZMQ socket hiccup, or a transient DB error is enough to kill the goroutine.
Scenario
resolveContract(htlcOutgoingContestResolver)goroutine starts normallyRegisterSpendNtfnreturns an errorThe only recovery is a full LND restart (which calls
relaunchResolversfrom scratch), but there is no indication to the operator that a resolver died. It is just a singlelog.Errorfline buried in the logs.Expected Behavior
The goroutine should retry with backoff on transient errors instead of exiting permanently. Something like waiting a few seconds and calling
Resolve()again, or at minimum propagating the failure so the state machine can handle it.Debug Information
No response
Environment
No response