Fix race between channel_ready and link update#7518
Fix race between channel_ready and link update#7518Roasbeef merged 11 commits intolightningnetwork:masterfrom
channel_ready and link update#7518Conversation
f139d88 to
bd1bf67
Compare
lnutils/sync_map.go
Outdated
There was a problem hiding this comment.
It looks like every use of ForEach in brontide.go could just as easily use Range. Perhaps it's not worth creating ForEach.
There was a problem hiding this comment.
ForEach uses Range under the hood, so ofc you can just replace it. This is added to differentiate the two use cases.
peer/brontide.go
Outdated
There was a problem hiding this comment.
What was the original reason for guarding both activeChannels and addedChannels with the same mutex? Why is it safe to separate them out?
There was a problem hiding this comment.
I don't know the original motivation, it was probably that we have two maps, guard them with one mutex instead of two.
I checked and addedChannels is only used in filterChannelsToEnable which occurs in the same goroutine as <-newChannels. It doesn't seem like it's possible for any race condition to occur with the new addedChannels and activeChannels maps.
There was a problem hiding this comment.
IIRC, the active channels map was originally only ever accessed in the main event loop. Once we added filterChannelsToEnable to handle channel enable/disable, then we created that other addedChannels map to track the new channels since the peer session was created.
These maps are also only ever modified right after each other in the main event loop channelManager.
There was a problem hiding this comment.
If we wanted to, we could get rid of both these mutexes if we funnel all the state requests through the main channelManager goroutine (req+resp over channels).
6410607 to
31b8f93
Compare
peer/brontide.go
Outdated
There was a problem hiding this comment.
IIRC, the active channels map was originally only ever accessed in the main event loop. Once we added filterChannelsToEnable to handle channel enable/disable, then we created that other addedChannels map to track the new channels since the peer session was created.
These maps are also only ever modified right after each other in the main event loop channelManager.
peer/brontide.go
Outdated
There was a problem hiding this comment.
If we wanted to, we could get rid of both these mutexes if we funnel all the state requests through the main channelManager goroutine (req+resp over channels).
peer/brontide.go
Outdated
There was a problem hiding this comment.
We do we care about nil-ness here? We should just check the second arg to see if it exists or not.
I know we use the ok && nil trick in a few other places, but IMO we should shy away from that in favor of not storing nil values in a map to mean a certain state.
Alternatively, we can make a very simple wrapper struct around the channel itself to be able to thread through another state to fill this new perceived gap.
There was a problem hiding this comment.
Agree we should avoid the implicit usage of nil in map as well. Added a commit before this one to remove it.
Roasbeef
left a comment
There was a problem hiding this comment.
Nice work! This is pretty much what I had in mind when I left this comment: #7401 (comment). Will be great to finally fix this very long standing bug.
My main questions surround the new maps added, and if we can get by with not storing nil in maps with a pointer value (and rely on just ok or a wrapper struct to thread through some new state).
peer/brontide.go
Outdated
There was a problem hiding this comment.
Ah yeah, this is already in place (nil values in the map) and is indeed the first location we started to use that trick...
There was a problem hiding this comment.
Added a new map to handle this specific case.
There was a problem hiding this comment.
Tried to implement here, a commit here in a different branch,
To remove the nil usage, we need three maps,
activeChannelsto store active channels onlypendingChannelsto store pending channels onlyreestablishChannelsto store pending channels loaded from disk
I think that's quite some maps...meanwhile I noticed the channel store already have method lnChan.IsPending() to decide whether it's pending or not. This leads me to think we should manage the channel state inside funding manager only, and access it from brontide, instead of maintaining yet another state of the channel.
This means we should expand lnChan to have methods so the brontide can know its state, like whether it's marked open or channel ready sent. Then inside brontide, we can decide what to do based on the state of the channel saved in activeChannels, wdyt?
There was a problem hiding this comment.
Can defer convo to a refactor PR, but I don't like using lnChan because it's in-memory. If another sub-system uses a different *OpenChannel to query the same channel's state, it will either:
- get an incorrect value if trying to query whether channel reestablish was sent
- get the correct value but this means we are persisting this new state to disk which has a cost
There was a problem hiding this comment.
Can defer convo to a refactor PR, but I don't like using lnChan because it's in-memory.
Agree here. I think we wanna slowly move the subsystems to deal with the db directly, treating the disk records as the single source of truth. Atm we can already do this in brontide as it has access to the db instance, maybe in another PR.
31b8f93 to
a41e779
Compare
peer/brontide.go
Outdated
There was a problem hiding this comment.
I don't know the original motivation, it was probably that we have two maps, guard them with one mutex instead of two.
I checked and addedChannels is only used in filterChannelsToEnable which occurs in the same goroutine as <-newChannels. It doesn't seem like it's possible for any race condition to occur with the new addedChannels and activeChannels maps.
peer/brontide.go
Outdated
There was a problem hiding this comment.
close(req.err) used to be before the revocation update code - better to keep the same ordering IMO.
There was a problem hiding this comment.
Good catch, reverted.
There was a problem hiding this comment.
The latest patch undid this
There was a problem hiding this comment.
ok reverted it again
peer/brontide.go
Outdated
There was a problem hiding this comment.
activeChannels is never removed from, but I prefer keeping the mutex in situations like this. That way we can be sure that the object in the map exists without having to check again.
There was a problem hiding this comment.
not sure if I follow. I think using a mutex map will need to read the object first to make sure it exists?
peer/brontide.go
Outdated
There was a problem hiding this comment.
Can defer convo to a refactor PR, but I don't like using lnChan because it's in-memory. If another sub-system uses a different *OpenChannel to query the same channel's state, it will either:
- get an incorrect value if trying to query whether channel reestablish was sent
- get the correct value but this means we are persisting this new state to disk which has a cost
funding/manager.go
Outdated
There was a problem hiding this comment.
To fully avoid race conditions for zeroconf, I think we actually need to call AddPendingChannel before we send funding_created.
Otherwise the following sequence of events is possible:
- We send
funding_createdand then immediately this thread is descheduled (before callingAddPendingChannel) - Peer sends
funding_signed - Peer immediately sends
channel_readyandupdate_feesince this is a zeroconf channel - Our channel manager processes the
update_feeand force closes - Our
funding_createdthread gets scheduled again and we callAddPendingChannel
There was a problem hiding this comment.
This leads me to think, we'd need a new method, CancelPendingChannel on brontide to cancel the channel if the funding flow fails?
There was a problem hiding this comment.
Yes, we should clean up pending state in brontide.
Which reminds me there's other places we need to clean up too... (#7228)
a41e779 to
5562574
Compare
5562574 to
d22f1ad
Compare
cbdbdb8 to
a107237
Compare
Crypt-iQ
left a comment
There was a problem hiding this comment.
super close, just two comments
funding/manager.go
Outdated
There was a problem hiding this comment.
Maybe just have a bool in chanIdentifier instead? What if somebody decides to use an all-zero ChannelID for a zero-conf channel?
There was a problem hiding this comment.
Good point. Does it mean we should call RemovePendingChannel even if the channel ID is all zeros? For zero-conf channels we also send them to brontide right?
There was a problem hiding this comment.
We send to brontide for zero-conf channels. I think it might just be better to have a bool that each caller sets if we should call RemovePendingChannel since that seems a lot harder to mess up / no hidden gotchas
There was a problem hiding this comment.
cool yeah i like the idea. Change to be more explicit!
8f42b09 to
818dbe0
Compare
morehouse
left a comment
There was a problem hiding this comment.
Looking pretty clean.
Tested before and after 818dbe0b0966dc48cb5ed8e521d81532b633ae2a:
$ git checkout pr/7518
$ git checkout 12a41f301b32b0d36181389e6ce0cb6f5cb3cfcf
$ make itest icase=update_pending_open_channels
...
--- FAIL: TestLightningNetworkDaemon (81.04s)
--- FAIL: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels (76.17s)
--- FAIL: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels/pending_on_funder_side (68.25s)
--- FAIL: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels/pending_on_fundee_side (4.58s)
$ git checkout 818dbe0b0966dc48cb5ed8e521d81532b633ae2a
$ make itest icase=update_pending_open_channels
...
--- PASS: TestLightningNetworkDaemon (45.95s)
--- PASS: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels (40.20s)
--- PASS: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels/pending_on_funder_side (18.42s)
--- PASS: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels/pending_on_fundee_side (18.45s)
funding/manager.go
Outdated
There was a problem hiding this comment.
This is the only case where we pass true to failFundingFlow, and it's a rare case where the peer sent us funding_signed for a channel that we don't know about. Maybe it would be cleaner to handle this case specially, rather than add an extra parameter to failFundingFlow.
For example, we could keep the existing behavior by setting both cid.tempChanID and cid.chanID to msg.ChanID. Then document this strangeness with a comment explaining that since we have no pending channel ID, we use the permanent one for both.
There was a problem hiding this comment.
Very clever, I like it!
peer/brontide.go
Outdated
There was a problem hiding this comment.
Comment is outdated -- we now don't remove the active channel.
itest/lnd_open_channel_test.go
Outdated
There was a problem hiding this comment.
Why is AssertNumPendingOpenChannels out of sync with what we'd expect?
There was a problem hiding this comment.
Under the hood, this calls the rpc method PendingChannels, which relies on the field channel.IsPending to decide whether it's a pending channel or not. This field is set to true when calling MarkAsOpen.
Meanwhile, in our funding manager, this MarkAsOpen is called inside handleFundingConfirmation once the funding tx is confirmed, while I think it should be called in stateStep, either in case channelReadySent or case addedToRouterGraph, depending on the definition of being active.
So before more analysis and fixes, this rpc response will remain inaccurate.
818dbe0 to
f5f5c5e
Compare
morehouse
left a comment
There was a problem hiding this comment.
LGTM with a couple nits.
funding/manager.go
Outdated
There was a problem hiding this comment.
The peer can send us a funding_signed message with any invalid channel ID, which makes it impossible to ensure the pending channel ID is always found...
Unless we filter such messages before sending them to the funding manager, which I think should be avoided.
There was a problem hiding this comment.
message with any invalid channel ID
yeah that's true.
Unless we filter such messages before sending them to the funding manager, which I think should be avoided.
What do you mean by filtering?
There was a problem hiding this comment.
What do you mean by filtering?
We could check the channel ID in peer/brontide.go and handle it there if there's no pending channel for that ID. Then we could guarantee that only real channel IDs end up in handleFundingSigned. But please don't do that -- funding manager is a better place for this.
I think we should replace the second part of the comment:
| // this rare case, which can be removed once we can make sure | |
| // the pending channel ID is always found here. | |
| // this rare case since we don't have a valid pending channel ID. |
There was a problem hiding this comment.
yeah that's not what brontide is for. Updated the comments.
This commit adds a new channel `newPendingChannel` and its dedicated handler `handleNewPendingChannel` to keep track of pending open channels. This should not affect the original handling of new active channels, except `addedChannels` is now updated in `handleNewPendingChannel` such that this new pending channel won't be reestablished in link.
The funding manager has been updated to use `AddPendingChannel`. Note that we track the pending channel before it's confirmed as the peer may have a block height in the future(from our view), thus they may start operating in this channel before we consider it as fully open. The mocked peers have been updated to implement the new interface method.
f5f5c5e to
17353ec
Compare
This commit adds a new struct `chanIdentifier` which wraps the pending channel ID and active channel ID. This struct is then used in `failFundingFlow` so the channel ID can be access there.
This commit adds a new interface method, `RemovePendingChannel`, to be used when the funding flow is failed after calling `AddPendingChannel` such that the Brontide has the most up-to-date view of the active channels.
This commit makes the `updateNextRevocation` to return an error and further feed it through the request's error chan so it'd be handled by the caller.
This commit adds a new itest case to check the race condition found in issue lightningnetwork#7401. In order to control the funding manager's state, a new dev config for the funding manager is introduced to specify a duration we should hold before processing remote node's channel_ready message. A new development config, `DevConfig` is introduced in `lncfg` and will only have effect if built with flag `integration`. This can also be extended for future integration tests if more dev-only flags are needed.
This commit now sends messages to `chanStream` for both pending and active channels. If the message is sent to a pending channel, it will be queued in `chanStream`. Once the channel link becomes active, the early messages will be processed.
17353ec to
f9d4212
Compare
Depends on
FundingLockedwithChannelReady#7517This PR starts tracking pending open channels in
peer.Brontide. Suppose a message is received for this pending channel, it will be cached and processed once this channel becomes active.Fixes #7401