multi: prevent goroutine leak in brontide#10012
multi: prevent goroutine leak in brontide#10012yyforyongyu merged 3 commits intolightningnetwork:masterfrom
Conversation
|
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
65fb58f to
82a2d85
Compare
82a2d85 to
34d53d3
Compare
abced28 to
b870fb7
Compare
discovery/gossiper.go
Outdated
| maxPrematureUpdates, | ||
| lru.WithDeleteCallback( | ||
| func(k uint64, cmsg *cachedNetworkMsg) { | ||
| // for every network message which is |
There was a problem hiding this comment.
Hmm, I see we've changed approaches. Don't we still want to ensure that the goroutine created to wait the response is cleaned up as soon as possible?
A premature channel update is an update for a channel we don't know about. It's of common use in the itests due to lack of instant block propagation (one node gets the block first, sends the ann before the other has seen the block. It can even be a zombie edge.
In the common case, we hear of the block then we can return an error and the goroutine exits. However, if it's a zombie, and stays that way for weeks, then these goroutines will still pile up. Or if it's a nearly valid, but fake channel update, we'll never actually process it.
There was a problem hiding this comment.
Thinking about it a more, this doest at least restrict the amount of these goroutines waiting for a premature update to be processed to maxPrematureUpdates, which currently is 100.
There was a problem hiding this comment.
exactly I think that's the cleanest solution
There was a problem hiding this comment.
also added the safeguard to exit the goroutine after a timeout, I think this combination should be a good temporary fix until we have the actor model in place.
There was a problem hiding this comment.
why do we need this if we already have remoteGossipMsgTimeout?
There was a problem hiding this comment.
the remoteGossipMsgTimeout is just a safeguard to prevent issues if developers forget to write into the error channel to prevent potential leaks in the future. So it is an extra safetynet to prevent goroutine leaks because we cannot guarantee the error Chan is used currently in the future, that's why I was adding the comment that it is just temporary until we can replace this code design with the actor model proposed by roasbeef
There was a problem hiding this comment.
removed the goroutine
|
running this on my node, and memory usage + goroutines are stable now, so I think that approach is good to go |
6692785 to
de88a18
Compare
yyforyongyu
left a comment
There was a problem hiding this comment.
Note that this bug is introduced in #9875, which adds a goroutine to log the error - i think we should just remove the goroutine instead, as it only logs the error but not processing it. By making a dramatic change to the tool we use lru and another timeout mechanism is an overkill imo.
discovery/gossiper.go
Outdated
| maxPrematureUpdates, | ||
| lru.WithDeleteCallback( | ||
| func(k uint64, cmsg *cachedNetworkMsg) { | ||
| // for every network message which is |
There was a problem hiding this comment.
why do we need this if we already have remoteGossipMsgTimeout?
peer/brontide.go
Outdated
| "msg %T: %v", msg, | ||
| err) | ||
| } | ||
|
|
There was a problem hiding this comment.
Think we might as well remove this goroutine as it does nothing but logging the error?
There was a problem hiding this comment.
as described below, yes we can do that but that would not solve the problem we have here it would just shadow it and down the road when using the response we would run into this issue again.
de88a18 to
32af9d5
Compare
The idea behind this change is to fix the issue we have by never writing in the errorChannel which we assume should happen also see the comment here: Lines 3137 to 3139 in 1d2e547 Moreover it is just logging for now, but should be enhanced in the future, see also the TODO to punish the Peer potentially. So the introduction of the goroutine in itself did not really introduce the issue by itself but rather revealed a deeper problem we are trying to fix with this PR. I leave it to the majority but I would like to see this change be merged into LND. |
32af9d5 to
ce879b7
Compare
starius
left a comment
There was a problem hiding this comment.
Posted a proposal and a question to verify this code doesn't introduce a dead lock.
ce879b7 to
b4fb92a
Compare
Def - but I think the root problem is we are firing goroutines that without controlling their lifecycle, which is an anti-go thing as we should never start a goroutine without knowing how it will stop. In addition if we are just looking for temporary solution here I don't see why we can't just remove it, since we wanna have a more comprehensive fix via the actor? |
b7c81d7 to
fc0dcfb
Compare
|
Ok I think you have a valid point, removed the goroutine when processing network responses and also fixed a potential race condition. |
discovery/gossiper.go
Outdated
| // premature ChannelUpdates. These are pointers and we might in the | ||
| // meantime receive new premature ChannelUpdate for this exact channel | ||
| // which will read also the same premature ChannelUpdates currently in | ||
| // the LRU cache. |
There was a problem hiding this comment.
hmmm...I think the cache uses sync map under the hood?
There was a problem hiding this comment.
ok I think you are right, because we only read the *cachedNetworkMsg value we are concurrent safe.
7959ccf to
024ae3a
Compare
|
/gemini review |
024ae3a to
699c097
Compare
We cannot rely on a response currently so we avoid spawning goroutines. This is just a temporary fix to avoid the goroutine leak.
699c097 to
e6aff21
Compare
This makes sure that goroutines do not pile up in case premature channel updates are received which are never processed but get deleted form the prematureUpdate LRU cache in case the maximum limit of 100 premature updates is reached and therefore old chan update message get deleted.
Depends on: lightninglabs/neutrino#322