netann: channel status manager#2411
Conversation
91c8911 to
6b2de75
Compare
87828fd to
29f32c8
Compare
56edc99 to
0d5e1e4
Compare
To me this is the most crucial part, as we don't have much testing on the server or insight into the existing behavior apart from the integration tests. Apart from that, the primary difference is that the state machine only performs the passive enable -> disable transition, while the current one will perform the transition in both directions. My hunch is that this is the primary thing leading to the instability we see on the network, and would probably have strange interactions with #2080 if left unchanged. Another difference is that all state changes are serialized via the With those things in mind, the goals in my mind are to reduce the state space, consolidate the logic into one unit, and provide a way of throughly testing it. It's also a good chance to work on our server bloat :) |
c5b1c6d to
5b5c09b
Compare
5b5c09b to
10a0050
Compare
245b180 to
405081e
Compare
|
Testing framework has been significantly reworked, and additional state machine tests for the |
89b5452 to
b76ba69
Compare
b76ba69 to
2df928e
Compare
|
@cfromknecht looks like the latest version introduced a test failure: |
|
@wpaulino indeed still looking into it |
2df928e to
a3c2820
Compare
Roasbeef
left a comment
There was a problem hiding this comment.
Running latest version on the faucet now, will observe le logs to see if anything peculiar happens over the next day or two.
halseth
left a comment
There was a problem hiding this comment.
Mostly just nits on code and commit structure at this point, looks really good 👮
There was a problem hiding this comment.
It just bloats the exposed config, and seems like all three could be derived from just setting the longest one, even during integration tests.
960c216 to
d213d6d
Compare
|
@wpaulino @halseth @Roasbeef comments addressed, commits restructured, and additional unit tests added. See this force-push for the majority of the changes, ptal |
wpaulino
left a comment
There was a problem hiding this comment.
Clean code with excellent test coverage. LGTM 💥
Exposes the three parameters that dictate the behavior of the channel status manager: * --chan-enable-timeout * --chan-disable-timeout * --chan-status-sample-interval
Found that his can sometimes cause a panic with a negative waitgroup counter.
This commit hooks up the new netann.ChanStatusManager, replacing the prior method which used the watchChannelStatus goroutine.
d213d6d to
de28217
Compare
Roasbeef
left a comment
There was a problem hiding this comment.
LGTM 🦄
I think we'll be saving the entire network a good bit of bandwidth once this lands in 0.6 ;)
| // in the map. If for some reason the channel isn't closed, the state | ||
| // will be repopulated on subsequent calls to RequestEnable or | ||
| // RequestDisable via a db lookup, or on startup. | ||
| delete(m.chanStates, outpoint) |
There was a problem hiding this comment.
👍
I think I ran into this issue in a prior diff of this PR. It appears to be fixed now.
| for _, c := range allChannels { | ||
| // We'll skip any private channels, as they aren't used for | ||
| // routing within the network by other nodes. | ||
| if c.ChannelFlags&lnwire.FFAnnounceChannel == 0 { |
There was a problem hiding this comment.
Not a blocker, but this could have been hidden behind this new interface. With that path, then the ChanStatusManager doesn't need to know how to interpret the bits of the channel flag, or what an advertised channel even is.
| // channel updates for channels going inactive. | ||
| eve, err := net.NewNode("Eve", []string{"--chan-disable-timeout=2s"}) | ||
| eve, err := net.NewNode("Eve", []string{ | ||
| "--minbackoff=10s", |
There was a problem hiding this comment.
I have a feeling that we'll need to adjust these values due to Travis flakiness in the future...
| // wait for close tx conf. | ||
| p.finalizeChanClosure(chanCloser) | ||
|
|
||
| // The channel reannounce delay has elapsed, broadcast the |
There was a problem hiding this comment.
Nice, I think this'll probably make the biggest difference in our current handling, so we don't immediately enable and span the network due to a flappy peer.
In this PR, we introduce a new subsystem called the
netann.ChanStatusManagerresponsible for managing the announcement of channel updates which toggle a channel's disabled bit. Most of the logic has been moved into a newnetannpackage, which offers greater unit testability of the subsystem's behavior.The
netann.ChanStatusManagerexposes two methods:Together, these allow other subsystems to request a toggle in a particular direction. The
netann.ChanStatusManagerthen handles the task of dropping duplicate requests, and also ensuring that it's network behavior follows a well-defined state machine.State Machine Description
At any point, the
netann.ChanStatusManagercategorizes known channels into three distinctChanStatusstates:Channels will always start in either
ChanStatusActiveorChanStatusInactiveon startup or after we detect a new outpoint (perhaps for a newly created channel) by examining the last channel update we have on disk.One key distinction from the existing design is that the
netann.ChanStatusManageronly uses long polling to detect inactive channels, meaning that channels can ONLY be reenabled via an explict call toRequestEnablefor an outpoint. This allows us to use a more accurate metric for enabling channels: connection duration.Once a channel has been detected as inactive within the switch, presumably because of a disconnection, this kicks off a timed progression from
ChanStatusActive->ChanStatusPendingInactive->ChanStatusInactive.If
RequestEnableis not received before the progression terminates, a new announcement setting the disable bit will be broadcast to the network.Otherwise, any call to
RequestEnablebefore the progression finishes will cancel the disable from being sent, and leaves the channel in the its still-enabled state on the network. This allows thenetann.ChanStatusManagerto tolerate short-lived reconnections, without causing the node to spam the network with channel updates.Using the exposed configurations, users can tune this to a specific threshold, e.g. require ~95% uptime over a 20 minute interval, in order for the channel to remain enabled.
Supersedes #2080
Depends on: