routing: make routing retry behaviour consistent#1706
Conversation
acaa557 to
a384b01
Compare
62545db to
104a197
Compare
104a197 to
7a9a9fd
Compare
There was a problem hiding this comment.
Can the change be made to nextHopChannel instead?
There was a problem hiding this comment.
nextHopChannel does a map lookup. It is probably better to populate this and also the other maps nodeIndex, chanIndex and prevHopMap when the route is unmarshalled in SendToRoute. Will try it out and see what it looks like.
There was a problem hiding this comment.
Yes, I think it looks better now.
There was a problem hiding this comment.
can make source a parameter to the method instead. Will give us more flexibility for using it for arbitrary routes, and makes it clear that it is not using any other state of the ChannelRouter.
There was a problem hiding this comment.
Actually, is the Route struct fully specifying the path now, or must source be specified to give it meaning? If the latter is the case, I think it might be beneficial to add source to the Route struct.
There was a problem hiding this comment.
Yes, I've been asking this myself too. But adding it to Route will have wide spread consequences. It think it should then also be added to the RPC interface and also internally always be populated. The code should be prepared to received a route for which the local node is not the source. Checks and errors need to be added. Do you think it is worth it? Now routes implicitly originate from the local node and I don't see much problem with this.
There was a problem hiding this comment.
error should be printed here.
There was a problem hiding this comment.
feels like this is an unnecessary closure? Should instead let caller of r.applyChannelUpdate print the error as it sees fit.
There was a problem hiding this comment.
The goal was to remove code duplication. Four identical code blocks that called r.applyChannelUpdate and logged the error existed.
7a9a9fd to
dcb930b
Compare
b106b18 to
c1ccce3
Compare
There was a problem hiding this comment.
I like this addition, very clean to keep only the needed information around to create a route!
My main concern about this is that the number of different ways of representing a channel in the codebase is getting a bit out of hand... I think we Have Hop, ChannelHop, Edge, ChannelEdgePolicy and probably a few more. With the addition of ShortHop do you think there is a possibility to remove one of the existing ones, or reuse an existing one for this purpose?
There was a problem hiding this comment.
Yes, I agree. What I think is a problem at the moment, are those database-connected structs. This becomes especially visible in SendToRoute. Possibly a route is passed in that does not match what is in the database, but still there is no reason not to be able to properly perform the payment. So what we see happening for example is that we populate a Route structure and leave some fields empty that are not read anyway.
In my opinion, the dependency on a db link or not should be made more explicit by using either a datastructure that always has the field populated or a datastructure that doesn't have the field at all.
I can imagine that when we start looking at the code in this way and slowly working out these db links where not needed, the number of data structures will reduce too.
There was a problem hiding this comment.
I followed my own suggestion and removed the db connection from Route struct. Now instead of adding a new struct (ShortHop), I could remove one (ChannelHop).
Two things are lost though:
-
Sanity check for bandwidth in
newRoute. This used to have a purpose, becausefindPathdidn't accumulate fees, so it could be that a route fromfindPathwould be rejected innewRoute. Since we now have backwards search, this is no longer necessary. -
Reporting back channel capacity in
QueryRoutes. In my opinion, this wasn't very relevant data for a route anyway (we are also not reporting policy there for example), but need to look at possible compatibility problems.
Have a look at it. The first series of commits hasn't been squashed yet, so view them all together (everything up till "routing: make routing retry behaviour consistent")
There was a problem hiding this comment.
Possible to represent the route returned from path finding as ShortHops also, such that we can reuse this method?
There was a problem hiding this comment.
In my opinion, the idea of a path finding function is to return just the path and nothing else. Utilizing the path for a payment and calculating the required fees and time locks is a different responsibility. Of course in an intelligent path finding function, fees and locks are taken into account, but not necessarily. It could also be a "fixed path finder" (always the same path) and then it doesn't need to look at anything at all and just return a static route.
That responsibility of filling in the fees etc lies with newRoute currently. If findPath would return ShortHops, the AmountToForward fields would need to be filled already. This implies the fees are set already too and breaks the separation of concerns. Looking at it strictly, the only thing that findPath should return is a list of channel ids. That's enough for newRoute to build the complete route, although extra db lookups would be needed to re-fetch the edge policies.
There is some reuse possible by having NewRouteFromShortHops being used in newRoute. This removes the code duplication for setting up the prev/next maps and simplifies the function. I've committed this.
There was a problem hiding this comment.
this boolan was never set true?
There was a problem hiding this comment.
indeed, it was always false
There was a problem hiding this comment.
In the past when we got an unknown next peer we pruned the node after the node that sent the error: 93b04b3#diff-26935be4146394ebeabd03f27e80f0bfL1750
Later on, we changed to simply prune that edge, and not the peer as a whole: bd9f1b5
We do this as otherwise, a faulty link to a peer would cause us to blacklist that peer, rather than just that faulty link.
8bf14c6 to
8b0cc58
Compare
|
Needed to update |
13bb6c9 to
e1fb4af
Compare
da0c3f7 to
51b7ea0
Compare
Roasbeef
left a comment
There was a problem hiding this comment.
I really dig the code simplification here! It also removes a number of hacks which attempted to address the symptoms of a greater bug elsewhere in the codebase, rather than attack the root cause directly . I've left a few comments, some providing a bit of historical context of the prior iterations of the mission control portions of the code. Aside from that it's mostly styling nits.
I'll start to run this on a testnet node to get a feel for the changes, and if no issues, will get this merged!
There was a problem hiding this comment.
In the past when we got an unknown next peer we pruned the node after the node that sent the error: 93b04b3#diff-26935be4146394ebeabd03f27e80f0bfL1750
Later on, we changed to simply prune that edge, and not the peer as a whole: bd9f1b5
We do this as otherwise, a faulty link to a peer would cause us to blacklist that peer, rather than just that faulty link.
There was a problem hiding this comment.
Nice! Historically, this was a mistake to introduce, as it patched a symptom of the actual root cause, which were incorrect fee calculations during path finding, and also a bug in the link where we applied the incoming channel policy rather than the outgoing channel policy.
There was a problem hiding this comment.
But do you mean it isn't necessary anymore to have this 'second chance' logic? Even though the fee calc. is fixed now, it still prevents a node from keeping us busy with sending different errors all the time.
There was a problem hiding this comment.
I would say keep behavior for now (maybe add a TODO for eventual removal), as seems out of scope for this PR, and I think it will be something we should consider when moving to smarter mission control.
There was a problem hiding this comment.
Historically, it was a band-aide (erroneously) added due to issues with incorrect fees in routing, and also mishandling of fees in the switch.
There was a problem hiding this comment.
So instead of pruning the vertex, we can just prune the edge now? The node still cannot keep us in an endless loop, as at some point all his channels will be exhausted with (second) errors.
There was a problem hiding this comment.
IIRC, there was an issue at the time with c-lightning either sending invalid channel update (we properly validate them now) or a pointer corruption bug causing it to send back an incorrect short channel ID, so we'd keep trying to prune something that wasn't actually in attempted route. On the lnd side, IIRC, we would send back the wrong channel update, so: we'd try to route, lnd would fail but with a chan update not in the route, we then try again.
Either way the issues should be resolved now on mainnet at least, so after we merge this in and get some more real world testing, then we can remove this logic all together perhaps. So just prune the edge instead.
There was a problem hiding this comment.
Thinking back, I think that's the case. Will investigate the git history a bit.
There was a problem hiding this comment.
I think the final node should never report errors that have to do with the next channel/hop, as it doesn't exist. The logic here to prune the channel to the final node could make sense in that light, as the final hop is sending an error that is not logical. Or even prune the final node completely?
There was a problem hiding this comment.
It's not that they prune error with the next channel/hop, but that they send an error back for the final hop itself. In this case, we need to check the prior channel. Not a blocker, but I think this can be slightly re-written to check if the error source is the final destination, if so then we can return the final hop immediately.
There was a problem hiding this comment.
What I mean is this: A final hop can't return any failure. For example FailUnknownNextPeer cannot come from the final hop. In our current handling of failures, we don't need that failed channel id for any of the 'final hop' failures. If a final hop does return such an error (like FailUnknownNextPeer), isn't that reason for pruning the node?
There was a problem hiding this comment.
Of course it can, for example: FailUnknownPaymenthash
There was a problem hiding this comment.
Sorry, bad English with the word any. Not all failures are expected from a final hop. FailUnknownPaymenthash sure, but not FailUnknownNextPeer
51b7ea0 to
1efb0b8
Compare
|
@Roasbeef style comments processed. I replied to your comments on the logic. |
There was a problem hiding this comment.
I would say keep behavior for now (maybe add a TODO for eventual removal), as seems out of scope for this PR, and I think it will be something we should consider when moving to smarter mission control.
I think we want to keep this behaviour, because it prevents nodes from keeping us busy indefinitely processing errors on their channels. |
1efb0b8 to
89ef2be
Compare
|
@halseth comments processed |
halseth
left a comment
There was a problem hiding this comment.
Apart from the remaining TODOs, this looks good to me now 👍
There was a problem hiding this comment.
IIRC, there was an issue at the time with c-lightning either sending invalid channel update (we properly validate them now) or a pointer corruption bug causing it to send back an incorrect short channel ID, so we'd keep trying to prune something that wasn't actually in attempted route. On the lnd side, IIRC, we would send back the wrong channel update, so: we'd try to route, lnd would fail but with a chan update not in the route, we then try again.
Either way the issues should be resolved now on mainnet at least, so after we merge this in and get some more real world testing, then we can remove this logic all together perhaps. So just prune the edge instead.
There was a problem hiding this comment.
It's not that they prune error with the next channel/hop, but that they send an error back for the final hop itself. In this case, we need to check the prior channel. Not a blocker, but I think this can be slightly re-written to check if the error source is the final destination, if so then we can return the final hop immediately.
|
Once the next release is cut, we can land this so our network monitoring systems can also start to use this new consistent retry logic. |
89ef2be to
98806ed
Compare
To remove code duplicated at all call sites to check err and log.
Previously not all route fields were properly populated. Example: prev and next hop maps.
This is a small preparatory step towards moving mission control logic out of router and reusing the acquired routing result data.
98806ed to
2b0858a
Compare
|
Pushed small fixup that makes separation of concerns slightly better |
|
LGTM after squash! 👍 |
Fixes the following issues: - If the channel update of FailFeeInsufficient contains an invalid channel update, it is not possible to properly add to the failed channels set. - FailAmountBelowMinimum may apply a channel update, but does not retry. - FailIncorrectCltvExpiry immediately prunes the vertex without trying one more time. In this commit, the logic for all three policy related errors is aligned.
8581159 to
b6ce03e
Compare
|
squashed |
This PR builds on top of #1888 - review that one first.
Fixes the following issues:
id, it is not possible to properly add to the failed channels set.
trying one more time.
In this PR, the logic for all three policy related errors is
aligned.