Fix race between `channel_ready` and link update by yyforyongyu · Pull Request #7518 · lightningnetwork/lnd

yyforyongyu · 2023-03-16T13:44:05Z

Depends on

This PR starts tracking pending open channels in peer.Brontide. Suppose a message is received for this pending channel, it will be cached and processed once this channel becomes active.

Fixes #7401

peer/brontide.go

funding/manager.go

peer/brontide.go

funding/manager.go

peer/brontide.go

morehouse · 2023-03-22T19:41:48Z

lnutils/sync_map.go

It looks like every use of ForEach in brontide.go could just as easily use Range. Perhaps it's not worth creating ForEach.

ForEach uses Range under the hood, so ofc you can just replace it. This is added to differentiate the two use cases.

morehouse · 2023-03-22T19:42:38Z

peer/brontide.go

What was the original reason for guarding both activeChannels and addedChannels with the same mutex? Why is it safe to separate them out?

I don't know the original motivation, it was probably that we have two maps, guard them with one mutex instead of two.

I checked and addedChannels is only used in filterChannelsToEnable which occurs in the same goroutine as <-newChannels. It doesn't seem like it's possible for any race condition to occur with the new addedChannels and activeChannels maps.

IIRC, the active channels map was originally only ever accessed in the main event loop. Once we added filterChannelsToEnable to handle channel enable/disable, then we created that other addedChannels map to track the new channels since the peer session was created.

These maps are also only ever modified right after each other in the main event loop channelManager.

If we wanted to, we could get rid of both these mutexes if we funnel all the state requests through the main channelManager goroutine (req+resp over channels).

peer/brontide.go

Roasbeef · 2023-04-11T00:42:16Z

peer/brontide.go

IIRC, the active channels map was originally only ever accessed in the main event loop. Once we added filterChannelsToEnable to handle channel enable/disable, then we created that other addedChannels map to track the new channels since the peer session was created.

These maps are also only ever modified right after each other in the main event loop channelManager.

Roasbeef · 2023-04-11T00:43:20Z

peer/brontide.go

If we wanted to, we could get rid of both these mutexes if we funnel all the state requests through the main channelManager goroutine (req+resp over channels).

Roasbeef · 2023-04-11T00:46:42Z

peer/brontide.go

We do we care about nil-ness here? We should just check the second arg to see if it exists or not.

I know we use the ok && nil trick in a few other places, but IMO we should shy away from that in favor of not storing nil values in a map to mean a certain state.

Alternatively, we can make a very simple wrapper struct around the channel itself to be able to thread through another state to fill this new perceived gap.

Agree we should avoid the implicit usage of nil in map as well. Added a commit before this one to remove it.

Roasbeef

Nice work! This is pretty much what I had in mind when I left this comment: #7401 (comment). Will be great to finally fix this very long standing bug.

My main questions surround the new maps added, and if we can get by with not storing nil in maps with a pointer value (and rely on just ok or a wrapper struct to thread through some new state).

Roasbeef · 2023-04-11T00:48:12Z

peer/brontide.go

Ah yeah, this is already in place (nil values in the map) and is indeed the first location we started to use that trick...

Added a new map to handle this specific case.

Tried to implement here, a commit here in a different branch,

To remove the nil usage, we need three maps,

activeChannels to store active channels only

pendingChannels to store pending channels only

reestablishChannels to store pending channels loaded from disk

I think that's quite some maps...meanwhile I noticed the channel store already have method lnChan.IsPending() to decide whether it's pending or not. This leads me to think we should manage the channel state inside funding manager only, and access it from brontide, instead of maintaining yet another state of the channel.

This means we should expand lnChan to have methods so the brontide can know its state, like whether it's marked open or channel ready sent. Then inside brontide, we can decide what to do based on the state of the channel saved in activeChannels, wdyt?

Can defer convo to a refactor PR, but I don't like using lnChan because it's in-memory. If another sub-system uses a different *OpenChannel to query the same channel's state, it will either:

get an incorrect value if trying to query whether channel reestablish was sent

get the correct value but this means we are persisting this new state to disk which has a cost

Can defer convo to a refactor PR, but I don't like using lnChan because it's in-memory.

Agree here. I think we wanna slowly move the subsystems to deal with the db directly, treating the disk records as the single source of truth. Atm we can already do this in brontide as it has access to the db instance, maybe in another PR.

peer/brontide.go

funding/manager.go

Crypt-iQ

verrrrry close 🦖

can you test this with a hacky test (add a sleep somewhere in handleFundingLocked and have a peer send UpdateFee) to confirm this works?

Crypt-iQ · 2023-04-05T15:46:08Z

peer/brontide.go

I don't know the original motivation, it was probably that we have two maps, guard them with one mutex instead of two.

I checked and addedChannels is only used in filterChannelsToEnable which occurs in the same goroutine as <-newChannels. It doesn't seem like it's possible for any race condition to occur with the new addedChannels and activeChannels maps.

Crypt-iQ · 2023-04-11T15:00:00Z

peer/brontide.go

close(req.err) used to be before the revocation update code - better to keep the same ordering IMO.

Good catch, reverted.

The latest patch undid this

ok reverted it again

Crypt-iQ · 2023-04-11T15:03:37Z

peer/brontide.go

activeChannels is never removed from, but I prefer keeping the mutex in situations like this. That way we can be sure that the object in the map exists without having to check again.

not sure if I follow. I think using a mutex map will need to read the object first to make sure it exists?

peer/brontide.go

Crypt-iQ · 2023-04-19T15:28:55Z

peer/brontide.go

Can defer convo to a refactor PR, but I don't like using lnChan because it's in-memory. If another sub-system uses a different *OpenChannel to query the same channel's state, it will either:

get an incorrect value if trying to query whether channel reestablish was sent

get the correct value but this means we are persisting this new state to disk which has a cost

morehouse · 2023-04-26T22:17:36Z

funding/manager.go

To fully avoid race conditions for zeroconf, I think we actually need to call AddPendingChannel before we send funding_created.

Otherwise the following sequence of events is possible:

We send funding_created and then immediately this thread is descheduled (before calling AddPendingChannel)

Peer sends funding_signed

Peer immediately sends channel_ready and update_fee since this is a zeroconf channel

Our channel manager processes the update_fee and force closes

Our funding_created thread gets scheduled again and we call AddPendingChannel

great catch

Nice catch!

This leads me to think, we'd need a new method, CancelPendingChannel on brontide to cancel the channel if the funding flow fails?

Yes, we should clean up pending state in brontide.

Which reminds me there's other places we need to clean up too... (#7228)

Crypt-iQ

super close, just two comments

Crypt-iQ · 2023-07-28T17:36:19Z

funding/manager.go

Maybe just have a bool in chanIdentifier instead? What if somebody decides to use an all-zero ChannelID for a zero-conf channel?

Good point. Does it mean we should call RemovePendingChannel even if the channel ID is all zeros? For zero-conf channels we also send them to brontide right?

We send to brontide for zero-conf channels. I think it might just be better to have a bool that each caller sets if we should call RemovePendingChannel since that seems a lot harder to mess up / no hidden gotchas

cool yeah i like the idea. Change to be more explicit!

funding/manager.go

morehouse

Looking pretty clean.

Tested before and after 818dbe0b0966dc48cb5ed8e521d81532b633ae2a:

$ git checkout pr/7518
$ git checkout 12a41f301b32b0d36181389e6ce0cb6f5cb3cfcf
$ make itest icase=update_pending_open_channels
...
--- FAIL: TestLightningNetworkDaemon (81.04s)
    --- FAIL: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels (76.17s)
        --- FAIL: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels/pending_on_funder_side (68.25s)
        --- FAIL: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels/pending_on_fundee_side (4.58s)

$ git checkout 818dbe0b0966dc48cb5ed8e521d81532b633ae2a
$ make itest icase=update_pending_open_channels
...
--- PASS: TestLightningNetworkDaemon (45.95s)
    --- PASS: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels (40.20s)
        --- PASS: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels/pending_on_funder_side (18.42s)
        --- PASS: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels/pending_on_fundee_side (18.45s)

morehouse · 2023-08-02T16:23:08Z

funding/manager.go

This is the only case where we pass true to failFundingFlow, and it's a rare case where the peer sent us funding_signed for a channel that we don't know about. Maybe it would be cleaner to handle this case specially, rather than add an extra parameter to failFundingFlow.

For example, we could keep the existing behavior by setting both cid.tempChanID and cid.chanID to msg.ChanID. Then document this strangeness with a comment explaining that since we have no pending channel ID, we use the permanent one for both.

Very clever, I like it!

morehouse · 2023-08-02T16:26:46Z

peer/brontide.go

Comment is outdated -- we now don't remove the active channel.

peer/brontide.go

peer/brontide_test.go

morehouse · 2023-08-02T17:34:19Z

itest/lnd_open_channel_test.go

Why is AssertNumPendingOpenChannels out of sync with what we'd expect?

Under the hood, this calls the rpc method PendingChannels, which relies on the field channel.IsPending to decide whether it's a pending channel or not. This field is set to true when calling MarkAsOpen.

Meanwhile, in our funding manager, this MarkAsOpen is called inside handleFundingConfirmation once the funding tx is confirmed, while I think it should be called in stateStep, either in case channelReadySent or case addedToRouterGraph, depending on the definition of being active.

So before more analysis and fixes, this rpc response will remain inaccurate.

morehouse

LGTM with a couple nits.

funding/manager.go

morehouse · 2023-08-08T14:55:59Z

funding/manager.go

The peer can send us a funding_signed message with any invalid channel ID, which makes it impossible to ensure the pending channel ID is always found...

Unless we filter such messages before sending them to the funding manager, which I think should be avoided.

message with any invalid channel ID

yeah that's true.

Unless we filter such messages before sending them to the funding manager, which I think should be avoided.

What do you mean by filtering?

What do you mean by filtering?

We could check the channel ID in peer/brontide.go and handle it there if there's no pending channel for that ID. Then we could guarantee that only real channel IDs end up in handleFundingSigned. But please don't do that -- funding manager is a better place for this.

I think we should replace the second part of the comment:

Suggested change

// this rare case, which can be removed once we can make sure

// the pending channel ID is always found here.

// this rare case since we don't have a valid pending channel ID.

yeah that's not what brontide is for. Updated the comments.

funding/manager.go

This commit adds a new channel `newPendingChannel` and its dedicated handler `handleNewPendingChannel` to keep track of pending open channels. This should not affect the original handling of new active channels, except `addedChannels` is now updated in `handleNewPendingChannel` such that this new pending channel won't be reestablished in link.

The funding manager has been updated to use `AddPendingChannel`. Note that we track the pending channel before it's confirmed as the peer may have a block height in the future(from our view), thus they may start operating in this channel before we consider it as fully open. The mocked peers have been updated to implement the new interface method.

This commit adds a new struct `chanIdentifier` which wraps the pending channel ID and active channel ID. This struct is then used in `failFundingFlow` so the channel ID can be access there.

This commit adds a new interface method, `RemovePendingChannel`, to be used when the funding flow is failed after calling `AddPendingChannel` such that the Brontide has the most up-to-date view of the active channels.

This commit makes the `updateNextRevocation` to return an error and further feed it through the request's error chan so it'd be handled by the caller.

This commit adds a new itest case to check the race condition found in issue lightningnetwork#7401. In order to control the funding manager's state, a new dev config for the funding manager is introduced to specify a duration we should hold before processing remote node's channel_ready message. A new development config, `DevConfig` is introduced in `lncfg` and will only have effect if built with flag `integration`. This can also be extended for future integration tests if more dev-only flags are needed.

This commit now sends messages to `chanStream` for both pending and active channels. If the message is sent to a pending channel, it will be queued in `chanStream`. Once the channel link becomes active, the early messages will be processed.

morehouse

ACK f9d4212

yyforyongyu self-assigned this Mar 16, 2023

yyforyongyu added funding Related to the opening of new channels with funding transactions on the blockchain brontide labels Mar 16, 2023

yyforyongyu added this to the v0.16.1 milestone Mar 16, 2023

saubyk linked an issue Mar 16, 2023 that may be closed by this pull request

[bug]: race with funding_locked and update_fee/update_add_htlc #7401

Closed

morehouse reviewed Mar 16, 2023

View reviewed changes

yyforyongyu force-pushed the fix-channel-ready-race branch 4 times, most recently from f139d88 to bd1bf67 Compare March 20, 2023 08:46

yyforyongyu requested review from Crypt-iQ and Roasbeef March 21, 2023 04:35

morehouse reviewed Mar 22, 2023

View reviewed changes

yyforyongyu force-pushed the fix-channel-ready-race branch 2 times, most recently from 6410607 to 31b8f93 Compare March 30, 2023 17:34

Roasbeef reviewed Apr 11, 2023

View reviewed changes

Roasbeef requested changes Apr 11, 2023

View reviewed changes

yyforyongyu force-pushed the fix-channel-ready-race branch from 31b8f93 to a41e779 Compare April 17, 2023 13:59

yyforyongyu changed the base branch from 0-16-1-staging to master April 17, 2023 13:59

Roasbeef modified the milestones: v0.16.1, v0.16.2 Apr 18, 2023

Crypt-iQ reviewed Apr 19, 2023

View reviewed changes

saubyk modified the milestones: v0.16.2, v0.17.0 Apr 20, 2023

morehouse reviewed Apr 26, 2023

View reviewed changes

yyforyongyu force-pushed the fix-channel-ready-race branch from a41e779 to 5562574 Compare May 10, 2023 11:49

saubyk requested review from Crypt-iQ, Roasbeef and morehouse May 11, 2023 15:18

yyforyongyu force-pushed the fix-channel-ready-race branch from 5562574 to d22f1ad Compare May 25, 2023 22:10

yyforyongyu force-pushed the fix-channel-ready-race branch from cbdbdb8 to a107237 Compare July 27, 2023 18:19

yyforyongyu requested review from Crypt-iQ and morehouse July 27, 2023 20:14

Crypt-iQ reviewed Jul 28, 2023

View reviewed changes

yyforyongyu force-pushed the fix-channel-ready-race branch 4 times, most recently from 8f42b09 to 818dbe0 Compare July 31, 2023 17:12

morehouse approved these changes Aug 2, 2023

View reviewed changes

yyforyongyu force-pushed the fix-channel-ready-race branch from 818dbe0 to f5f5c5e Compare August 8, 2023 10:16

morehouse approved these changes Aug 8, 2023

View reviewed changes

Crypt-iQ approved these changes Aug 8, 2023

View reviewed changes

funding/manager.go Outdated Show resolved Hide resolved

yyforyongyu added 4 commits August 9, 2023 00:17

peer: add method handleLinkUpdateMsg to handle channel update msgs

3eb7f54

docs: update release note for race fix

048d7d7

yyforyongyu force-pushed the fix-channel-ready-race branch from f5f5c5e to 17353ec Compare August 8, 2023 16:20

yyforyongyu added 7 commits August 9, 2023 01:29

funding: make failFundingFlow takes both channel IDs

3ed579d

This commit adds a new struct `chanIdentifier` which wraps the pending channel ID and active channel ID. This struct is then used in `failFundingFlow` so the channel ID can be access there.

multi: remove pending channel from Brontide when funding flow failed

9275725

This commit adds a new interface method, `RemovePendingChannel`, to be used when the funding flow is failed after calling `AddPendingChannel` such that the Brontide has the most up-to-date view of the active channels.

peer: return an error from updateNextRevocation and patch unit tests

d28242c

This commit makes the `updateNextRevocation` to return an error and further feed it through the request's error chan so it'd be handled by the caller.

multi: patch unit tests for handling pending channels

6b41289

golangci: update linter settings for test files

a9da25b

yyforyongyu force-pushed the fix-channel-ready-race branch from 17353ec to f9d4212 Compare August 8, 2023 17:29

morehouse approved these changes Aug 8, 2023

View reviewed changes

Roasbeef merged commit 8f693fe into lightningnetwork:master Aug 9, 2023

yyforyongyu deleted the fix-channel-ready-race branch August 9, 2023 04:30

This was referenced Aug 11, 2023

funding: remove dead code and sanity check pending chan ID #7887

Merged

funding: fix flake in itest caused by persistent fee param changes #7648

Merged

	// this rare case, which can be removed once we can make sure
	// the pending channel ID is always found here.
	// this rare case since we don't have a valid pending channel ID.

Conversation

yyforyongyu commented Mar 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Roasbeef left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Crypt-iQ left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yyforyongyu commented Mar 16, 2023 •

edited

Loading

Crypt-iQ left a comment •

edited

Loading