Skip to content

Fix race between channel_ready and link update#7518

Merged
Roasbeef merged 11 commits intolightningnetwork:masterfrom
yyforyongyu:fix-channel-ready-race
Aug 9, 2023
Merged

Fix race between channel_ready and link update#7518
Roasbeef merged 11 commits intolightningnetwork:masterfrom
yyforyongyu:fix-channel-ready-race

Conversation

@yyforyongyu
Copy link
Copy Markdown
Member

@yyforyongyu yyforyongyu commented Mar 16, 2023

Depends on

This PR starts tracking pending open channels in peer.Brontide. Suppose a message is received for this pending channel, it will be cached and processed once this channel becomes active.

Fixes #7401

@yyforyongyu yyforyongyu self-assigned this Mar 16, 2023
@yyforyongyu yyforyongyu added funding Related to the opening of new channels with funding transactions on the blockchain brontide labels Mar 16, 2023
@yyforyongyu yyforyongyu added this to the v0.16.1 milestone Mar 16, 2023
@saubyk saubyk linked an issue Mar 16, 2023 that may be closed by this pull request
@yyforyongyu yyforyongyu force-pushed the fix-channel-ready-race branch 4 times, most recently from f139d88 to bd1bf67 Compare March 20, 2023 08:46
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like every use of ForEach in brontide.go could just as easily use Range. Perhaps it's not worth creating ForEach.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ForEach uses Range under the hood, so ofc you can just replace it. This is added to differentiate the two use cases.

peer/brontide.go Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the original reason for guarding both activeChannels and addedChannels with the same mutex? Why is it safe to separate them out?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the original motivation, it was probably that we have two maps, guard them with one mutex instead of two.

I checked and addedChannels is only used in filterChannelsToEnable which occurs in the same goroutine as <-newChannels. It doesn't seem like it's possible for any race condition to occur with the new addedChannels and activeChannels maps.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the active channels map was originally only ever accessed in the main event loop. Once we added filterChannelsToEnable to handle channel enable/disable, then we created that other addedChannels map to track the new channels since the peer session was created.

These maps are also only ever modified right after each other in the main event loop channelManager.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to, we could get rid of both these mutexes if we funnel all the state requests through the main channelManager goroutine (req+resp over channels).

@yyforyongyu yyforyongyu force-pushed the fix-channel-ready-race branch 2 times, most recently from 6410607 to 31b8f93 Compare March 30, 2023 17:34
peer/brontide.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the active channels map was originally only ever accessed in the main event loop. Once we added filterChannelsToEnable to handle channel enable/disable, then we created that other addedChannels map to track the new channels since the peer session was created.

These maps are also only ever modified right after each other in the main event loop channelManager.

peer/brontide.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to, we could get rid of both these mutexes if we funnel all the state requests through the main channelManager goroutine (req+resp over channels).

peer/brontide.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do we care about nil-ness here? We should just check the second arg to see if it exists or not.

I know we use the ok && nil trick in a few other places, but IMO we should shy away from that in favor of not storing nil values in a map to mean a certain state.

Alternatively, we can make a very simple wrapper struct around the channel itself to be able to thread through another state to fill this new perceived gap.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we should avoid the implicit usage of nil in map as well. Added a commit before this one to remove it.

Copy link
Copy Markdown
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! This is pretty much what I had in mind when I left this comment: #7401 (comment). Will be great to finally fix this very long standing bug.

My main questions surround the new maps added, and if we can get by with not storing nil in maps with a pointer value (and rely on just ok or a wrapper struct to thread through some new state).

peer/brontide.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, this is already in place (nil values in the map) and is indeed the first location we started to use that trick...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new map to handle this specific case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to implement here, a commit here in a different branch,

To remove the nil usage, we need three maps,

  • activeChannels to store active channels only
  • pendingChannels to store pending channels only
  • reestablishChannels to store pending channels loaded from disk

I think that's quite some maps...meanwhile I noticed the channel store already have method lnChan.IsPending() to decide whether it's pending or not. This leads me to think we should manage the channel state inside funding manager only, and access it from brontide, instead of maintaining yet another state of the channel.

This means we should expand lnChan to have methods so the brontide can know its state, like whether it's marked open or channel ready sent. Then inside brontide, we can decide what to do based on the state of the channel saved in activeChannels, wdyt?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can defer convo to a refactor PR, but I don't like using lnChan because it's in-memory. If another sub-system uses a different *OpenChannel to query the same channel's state, it will either:

  • get an incorrect value if trying to query whether channel reestablish was sent
  • get the correct value but this means we are persisting this new state to disk which has a cost

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can defer convo to a refactor PR, but I don't like using lnChan because it's in-memory.

Agree here. I think we wanna slowly move the subsystems to deal with the db directly, treating the disk records as the single source of truth. Atm we can already do this in brontide as it has access to the db instance, maybe in another PR.

@yyforyongyu yyforyongyu force-pushed the fix-channel-ready-race branch from 31b8f93 to a41e779 Compare April 17, 2023 13:59
@yyforyongyu yyforyongyu changed the base branch from 0-16-1-staging to master April 17, 2023 13:59
@Roasbeef Roasbeef modified the milestones: v0.16.1, v0.16.2 Apr 18, 2023
Copy link
Copy Markdown
Collaborator

@Crypt-iQ Crypt-iQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verrrrry close 🦖

can you test this with a hacky test (add a sleep somewhere in handleFundingLocked and have a peer send UpdateFee) to confirm this works?

peer/brontide.go Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the original motivation, it was probably that we have two maps, guard them with one mutex instead of two.

I checked and addedChannels is only used in filterChannelsToEnable which occurs in the same goroutine as <-newChannels. It doesn't seem like it's possible for any race condition to occur with the new addedChannels and activeChannels maps.

peer/brontide.go Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close(req.err) used to be before the revocation update code - better to keep the same ordering IMO.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, reverted.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest patch undid this

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok reverted it again

peer/brontide.go Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

activeChannels is never removed from, but I prefer keeping the mutex in situations like this. That way we can be sure that the object in the map exists without having to check again.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I follow. I think using a mutex map will need to read the object first to make sure it exists?

peer/brontide.go Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can defer convo to a refactor PR, but I don't like using lnChan because it's in-memory. If another sub-system uses a different *OpenChannel to query the same channel's state, it will either:

  • get an incorrect value if trying to query whether channel reestablish was sent
  • get the correct value but this means we are persisting this new state to disk which has a cost

@saubyk saubyk modified the milestones: v0.16.2, v0.17.0 Apr 20, 2023
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fully avoid race conditions for zeroconf, I think we actually need to call AddPendingChannel before we send funding_created.

Otherwise the following sequence of events is possible:

  1. We send funding_created and then immediately this thread is descheduled (before calling AddPendingChannel)
  2. Peer sends funding_signed
  3. Peer immediately sends channel_ready and update_fee since this is a zeroconf channel
  4. Our channel manager processes the update_fee and force closes
  5. Our funding_created thread gets scheduled again and we call AddPendingChannel

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leads me to think, we'd need a new method, CancelPendingChannel on brontide to cancel the channel if the funding flow fails?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should clean up pending state in brontide.

Which reminds me there's other places we need to clean up too... (#7228)

@yyforyongyu yyforyongyu force-pushed the fix-channel-ready-race branch from a41e779 to 5562574 Compare May 10, 2023 11:49
@saubyk saubyk requested review from Crypt-iQ, Roasbeef and morehouse May 11, 2023 15:18
@yyforyongyu yyforyongyu force-pushed the fix-channel-ready-race branch from 5562574 to d22f1ad Compare May 25, 2023 22:10
@yyforyongyu yyforyongyu force-pushed the fix-channel-ready-race branch from cbdbdb8 to a107237 Compare July 27, 2023 18:19
Copy link
Copy Markdown
Collaborator

@Crypt-iQ Crypt-iQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super close, just two comments

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just have a bool in chanIdentifier instead? What if somebody decides to use an all-zero ChannelID for a zero-conf channel?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Does it mean we should call RemovePendingChannel even if the channel ID is all zeros? For zero-conf channels we also send them to brontide right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We send to brontide for zero-conf channels. I think it might just be better to have a bool that each caller sets if we should call RemovePendingChannel since that seems a lot harder to mess up / no hidden gotchas

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool yeah i like the idea. Change to be more explicit!

@yyforyongyu yyforyongyu force-pushed the fix-channel-ready-race branch 4 times, most recently from 8f42b09 to 818dbe0 Compare July 31, 2023 17:12
Copy link
Copy Markdown
Collaborator

@morehouse morehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty clean.

Tested before and after 818dbe0b0966dc48cb5ed8e521d81532b633ae2a:

$ git checkout pr/7518
$ git checkout 12a41f301b32b0d36181389e6ce0cb6f5cb3cfcf
$ make itest icase=update_pending_open_channels
...
--- FAIL: TestLightningNetworkDaemon (81.04s)
    --- FAIL: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels (76.17s)
        --- FAIL: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels/pending_on_funder_side (68.25s)
        --- FAIL: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels/pending_on_fundee_side (4.58s)

$ git checkout 818dbe0b0966dc48cb5ed8e521d81532b633ae2a
$ make itest icase=update_pending_open_channels
...
--- PASS: TestLightningNetworkDaemon (45.95s)
    --- PASS: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels (40.20s)
        --- PASS: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels/pending_on_funder_side (18.42s)
        --- PASS: TestLightningNetworkDaemon/tranche00/135-of-135/btcd/update_pending_open_channels/pending_on_fundee_side (18.45s)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only case where we pass true to failFundingFlow, and it's a rare case where the peer sent us funding_signed for a channel that we don't know about. Maybe it would be cleaner to handle this case specially, rather than add an extra parameter to failFundingFlow.

For example, we could keep the existing behavior by setting both cid.tempChanID and cid.chanID to msg.ChanID. Then document this strangeness with a comment explaining that since we have no pending channel ID, we use the permanent one for both.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clever, I like it!

peer/brontide.go Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is outdated -- we now don't remove the active channel.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is AssertNumPendingOpenChannels out of sync with what we'd expect?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the hood, this calls the rpc method PendingChannels, which relies on the field channel.IsPending to decide whether it's a pending channel or not. This field is set to true when calling MarkAsOpen.

Meanwhile, in our funding manager, this MarkAsOpen is called inside handleFundingConfirmation once the funding tx is confirmed, while I think it should be called in stateStep, either in case channelReadySent or case addedToRouterGraph, depending on the definition of being active.

So before more analysis and fixes, this rpc response will remain inaccurate.

@yyforyongyu yyforyongyu force-pushed the fix-channel-ready-race branch from 818dbe0 to f5f5c5e Compare August 8, 2023 10:16
Copy link
Copy Markdown
Collaborator

@morehouse morehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a couple nits.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The peer can send us a funding_signed message with any invalid channel ID, which makes it impossible to ensure the pending channel ID is always found...

Unless we filter such messages before sending them to the funding manager, which I think should be avoided.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

message with any invalid channel ID

yeah that's true.

Unless we filter such messages before sending them to the funding manager, which I think should be avoided.

What do you mean by filtering?

Copy link
Copy Markdown
Collaborator

@morehouse morehouse Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by filtering?

We could check the channel ID in peer/brontide.go and handle it there if there's no pending channel for that ID. Then we could guarantee that only real channel IDs end up in handleFundingSigned. But please don't do that -- funding manager is a better place for this.

I think we should replace the second part of the comment:

Suggested change
// this rare case, which can be removed once we can make sure
// the pending channel ID is always found here.
// this rare case since we don't have a valid pending channel ID.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's not what brontide is for. Updated the comments.

This commit adds a new channel `newPendingChannel` and its dedicated
handler `handleNewPendingChannel` to keep track of pending open
channels. This should not affect the original handling of new active
channels, except `addedChannels` is now updated in
`handleNewPendingChannel` such that this new pending channel won't be
reestablished in link.
The funding manager has been updated to use `AddPendingChannel`. Note
that we track the pending channel before it's confirmed as the peer may
have a block height in the future(from our view), thus they may start
operating in this channel before we consider it as fully open.

The mocked peers have been updated to implement the new interface method.
@yyforyongyu yyforyongyu force-pushed the fix-channel-ready-race branch from f5f5c5e to 17353ec Compare August 8, 2023 16:20
This commit adds a new struct `chanIdentifier` which wraps the pending
channel ID and active channel ID. This struct is then used in
`failFundingFlow` so the channel ID can be access there.
This commit adds a new interface method, `RemovePendingChannel`, to be
used when the funding flow is failed after calling `AddPendingChannel`
such that the Brontide has the most up-to-date view of the active
channels.
This commit makes the `updateNextRevocation` to return an error and
further feed it through the request's error chan so it'd be handled by
the caller.
This commit adds a new itest case to check the race condition found in
issue lightningnetwork#7401. In order to control the funding manager's state, a new dev
config for the funding manager is introduced to specify a duration we
should hold before processing remote node's channel_ready message.

A new development config, `DevConfig` is introduced in `lncfg` and will
only have effect if built with flag `integration`. This can also be
extended for future integration tests if more dev-only flags are needed.
This commit now sends messages to `chanStream` for both pending and
active channels. If the message is sent to a pending channel, it will be
queued in `chanStream`. Once the channel link becomes active, the early
messages will be processed.
@yyforyongyu yyforyongyu force-pushed the fix-channel-ready-race branch from 17353ec to f9d4212 Compare August 8, 2023 17:29
Copy link
Copy Markdown
Collaborator

@morehouse morehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK f9d4212

@Roasbeef Roasbeef merged commit 8f693fe into lightningnetwork:master Aug 9, 2023
@yyforyongyu yyforyongyu deleted the fix-channel-ready-race branch August 9, 2023 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

brontide funding Related to the opening of new channels with funding transactions on the blockchain

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug]: race with funding_locked and update_fee/update_add_htlc

6 participants