routing: make routing retry behaviour consistent by joostjager · Pull Request #1706 · lightningnetwork/lnd

joostjager · 2018-08-09T12:07:26Z

This PR builds on top of #1888 - review that one first.

Fixes the following issues:

If the channel update of FailFeeInsufficient contains an invalid channel
id, it is not possible to properly add to the failed channels set.
FailAmountBelowMinimum may apply a channel update, but does not retry.
FailIncorrectCltvExpiry immediately prunes the vertex without
trying one more time.

In this PR, the logic for all three policy related errors is
aligned.

halseth · 2018-08-10T08:12:12Z

Can the change be made to nextHopChannel instead?

nextHopChannel does a map lookup. It is probably better to populate this and also the other maps nodeIndex, chanIndex and prevHopMap when the route is unmarshalled in SendToRoute. Will try it out and see what it looks like.

Yes, I think it looks better now.

halseth · 2018-08-10T08:14:07Z

can make source a parameter to the method instead. Will give us more flexibility for using it for arbitrary routes, and makes it clear that it is not using any other state of the ChannelRouter.

Actually, is the Route struct fully specifying the path now, or must source be specified to give it meaning? If the latter is the case, I think it might be beneficial to add source to the Route struct.

Yes, I've been asking this myself too. But adding it to Route will have wide spread consequences. It think it should then also be added to the RPC interface and also internally always be populated. The code should be prepared to received a route for which the local node is not the source. Checks and errors need to be added. Do you think it is worth it? Now routes implicitly originate from the local node and I don't see much problem with this.

halseth · 2018-08-10T08:18:05Z

error should be printed here.

halseth · 2018-08-10T08:18:35Z

feels like this is an unnecessary closure? Should instead let caller of r.applyChannelUpdate print the error as it sees fit.

The goal was to remove code duplication. Four identical code blocks that called r.applyChannelUpdate and logged the error existed.

halseth · 2018-09-05T11:22:36Z

I like this addition, very clean to keep only the needed information around to create a route!

My main concern about this is that the number of different ways of representing a channel in the codebase is getting a bit out of hand... I think we Have Hop, ChannelHop, Edge, ChannelEdgePolicy and probably a few more. With the addition of ShortHop do you think there is a possibility to remove one of the existing ones, or reuse an existing one for this purpose?

Yes, I agree. What I think is a problem at the moment, are those database-connected structs. This becomes especially visible in SendToRoute. Possibly a route is passed in that does not match what is in the database, but still there is no reason not to be able to properly perform the payment. So what we see happening for example is that we populate a Route structure and leave some fields empty that are not read anyway.

In my opinion, the dependency on a db link or not should be made more explicit by using either a datastructure that always has the field populated or a datastructure that doesn't have the field at all.

I can imagine that when we start looking at the code in this way and slowly working out these db links where not needed, the number of data structures will reduce too.

I followed my own suggestion and removed the db connection from Route struct. Now instead of adding a new struct (ShortHop), I could remove one (ChannelHop).

Two things are lost though:

Sanity check for bandwidth in newRoute. This used to have a purpose, because findPath didn't accumulate fees, so it could be that a route from findPath would be rejected in newRoute. Since we now have backwards search, this is no longer necessary.

Reporting back channel capacity in QueryRoutes. In my opinion, this wasn't very relevant data for a route anyway (we are also not reporting policy there for example), but need to look at possible compatibility problems.

Have a look at it. The first series of commits hasn't been squashed yet, so view them all together (everything up till "routing: make routing retry behaviour consistent")

halseth · 2018-09-05T11:23:58Z

Possible to represent the route returned from path finding as ShortHops also, such that we can reuse this method?

In my opinion, the idea of a path finding function is to return just the path and nothing else. Utilizing the path for a payment and calculating the required fees and time locks is a different responsibility. Of course in an intelligent path finding function, fees and locks are taken into account, but not necessarily. It could also be a "fixed path finder" (always the same path) and then it doesn't need to look at anything at all and just return a static route.

That responsibility of filling in the fees etc lies with newRoute currently. If findPath would return ShortHops, the AmountToForward fields would need to be filled already. This implies the fees are set already too and breaks the separation of concerns. Looking at it strictly, the only thing that findPath should return is a list of channel ids. That's enough for newRoute to build the complete route, although extra db lookups would be needed to re-fetch the edge policies.

There is some reuse possible by having NewRouteFromShortHops being used in newRoute. This removes the code duplication for setting up the prev/next maps and simplifies the function. I've committed this.

halseth · 2018-09-05T11:34:12Z

this boolan was never set true?

indeed, it was always false

In the past when we got an unknown next peer we pruned the node after the node that sent the error: 93b04b3#diff-26935be4146394ebeabd03f27e80f0bfL1750

Later on, we changed to simply prune that edge, and not the peer as a whole: bd9f1b5

We do this as otherwise, a faulty link to a peer would cause us to blacklist that peer, rather than just that faulty link.

joostjager · 2018-09-08T14:13:46Z

Needed to update router_test, because node alias is not guaranteed to be available in a Route struct. Another reason why it would be better to be strict about what info is really required.

Roasbeef

I really dig the code simplification here! It also removes a number of hacks which attempted to address the symptoms of a greater bug elsewhere in the codebase, rather than attack the root cause directly . I've left a few comments, some providing a bit of historical context of the prior iterations of the mission control portions of the code. Aside from that it's mostly styling nits.

I'll start to run this on a testnet node to get a feel for the changes, and if no issues, will get this merged!

Roasbeef · 2018-11-01T23:54:06Z

In the past when we got an unknown next peer we pruned the node after the node that sent the error: 93b04b3#diff-26935be4146394ebeabd03f27e80f0bfL1750

Later on, we changed to simply prune that edge, and not the peer as a whole: bd9f1b5

We do this as otherwise, a faulty link to a peer would cause us to blacklist that peer, rather than just that faulty link.

Roasbeef · 2018-11-01T23:57:20Z

Nice! Historically, this was a mistake to introduce, as it patched a symptom of the actual root cause, which were incorrect fee calculations during path finding, and also a bug in the link where we applied the incoming channel policy rather than the outgoing channel policy.

But do you mean it isn't necessary anymore to have this 'second chance' logic? Even though the fee calc. is fixed now, it still prevents a node from keeping us busy with sending different errors all the time.

I would say keep behavior for now (maybe add a TODO for eventual removal), as seems out of scope for this PR, and I think it will be something we should consider when moving to smarter mission control.

Roasbeef · 2018-11-02T00:04:59Z

Historically, it was a band-aide (erroneously) added due to issues with incorrect fees in routing, and also mishandling of fees in the switch.

So instead of pruning the vertex, we can just prune the edge now? The node still cannot keep us in an endless loop, as at some point all his channels will be exhausted with (second) errors.

IIRC, there was an issue at the time with c-lightning either sending invalid channel update (we properly validate them now) or a pointer corruption bug causing it to send back an incorrect short channel ID, so we'd keep trying to prune something that wasn't actually in attempted route. On the lnd side, IIRC, we would send back the wrong channel update, so: we'd try to route, lnd would fail but with a chan update not in the route, we then try again.

Either way the issues should be resolved now on mainnet at least, so after we merge this in and get some more real world testing, then we can remove this logic all together perhaps. So just prune the edge instead.

Roasbeef · 2018-11-02T00:19:21Z

Thinking back, I think that's the case. Will investigate the git history a bit.

I think the final node should never report errors that have to do with the next channel/hop, as it doesn't exist. The logic here to prune the channel to the final node could make sense in that light, as the final hop is sending an error that is not logical. Or even prune the final node completely?

It's not that they prune error with the next channel/hop, but that they send an error back for the final hop itself. In this case, we need to check the prior channel. Not a blocker, but I think this can be slightly re-written to check if the error source is the final destination, if so then we can return the final hop immediately.

What I mean is this: A final hop can't return any failure. For example FailUnknownNextPeer cannot come from the final hop. In our current handling of failures, we don't need that failed channel id for any of the 'final hop' failures. If a final hop does return such an error (like FailUnknownNextPeer), isn't that reason for pruning the node?

Of course it can, for example: FailUnknownPaymenthash

Sorry, bad English with the word any. Not all failures are expected from a final hop. FailUnknownPaymenthash sure, but not FailUnknownNextPeer

joostjager · 2018-11-12T08:41:02Z

@Roasbeef style comments processed. I replied to your comments on the logic.

halseth · 2018-11-13T10:20:41Z

I would say keep behavior for now (maybe add a TODO for eventual removal), as seems out of scope for this PR, and I think it will be something we should consider when moving to smarter mission control.

joostjager · 2018-11-13T12:05:08Z

I would say keep behavior for now (maybe add a TODO for eventual removal), as seems out of scope for this PR, and I think it will be something we should consider when moving to smarter mission control.

I think we want to keep this behaviour, because it prevents nodes from keeping us busy indefinitely processing errors on their channels.

joostjager · 2018-11-13T12:20:18Z

@halseth comments processed

halseth

Apart from the remaining TODOs, this looks good to me now 👍

Roasbeef

LGTM 🚀

Roasbeef · 2018-11-27T03:57:15Z

IIRC, there was an issue at the time with c-lightning either sending invalid channel update (we properly validate them now) or a pointer corruption bug causing it to send back an incorrect short channel ID, so we'd keep trying to prune something that wasn't actually in attempted route. On the lnd side, IIRC, we would send back the wrong channel update, so: we'd try to route, lnd would fail but with a chan update not in the route, we then try again.

Either way the issues should be resolved now on mainnet at least, so after we merge this in and get some more real world testing, then we can remove this logic all together perhaps. So just prune the edge instead.

Roasbeef · 2018-11-27T04:00:01Z

It's not that they prune error with the next channel/hop, but that they send an error back for the final hop itself. In this case, we need to check the prior channel. Not a blocker, but I think this can be slightly re-written to check if the error source is the final destination, if so then we can return the final hop immediately.

Roasbeef · 2018-11-27T04:04:10Z

Once the next release is cut, we can land this so our network monitoring systems can also start to use this new consistent retry logic.

To remove code duplicated at all call sites to check err and log.

Previously not all route fields were properly populated. Example: prev and next hop maps.

This is a small preparatory step towards moving mission control logic out of router and reusing the acquired routing result data.

joostjager · 2018-11-29T13:02:59Z

Pushed small fixup that makes separation of concerns slightly better

halseth · 2018-12-03T12:00:33Z

LGTM after squash! 👍

Fixes the following issues: - If the channel update of FailFeeInsufficient contains an invalid channel update, it is not possible to properly add to the failed channels set. - FailAmountBelowMinimum may apply a channel update, but does not retry. - FailIncorrectCltvExpiry immediately prunes the vertex without trying one more time. In this commit, the logic for all three policy related errors is aligned.

joostjager · 2018-12-03T12:24:30Z

squashed

Roasbeef

LGTM 🕹

joostjager force-pushed the errorprocessing branch 2 times, most recently from acaa557 to a384b01 Compare August 9, 2018 12:33

joostjager changed the title ~~routing: fix nil ptr exception and routing retry behaviour~~ routing: fix routing retry behaviour Aug 9, 2018

joostjager force-pushed the errorprocessing branch 2 times, most recently from 62545db to 104a197 Compare August 9, 2018 13:44

joostjager changed the title ~~routing: fix routing retry behaviour~~ routing: make routing retry behaviour consistent Aug 9, 2018

joostjager force-pushed the errorprocessing branch from 104a197 to 7a9a9fd Compare August 9, 2018 13:52

halseth reviewed Aug 10, 2018

View reviewed changes

joostjager force-pushed the errorprocessing branch from 7a9a9fd to dcb930b Compare August 13, 2018 07:59

joostjager mentioned this pull request Aug 13, 2018

htlcswitch: implement strict forwarding for locally dispatched payments #1527

Closed

joostjager force-pushed the errorprocessing branch 6 times, most recently from b106b18 to c1ccce3 Compare August 13, 2018 19:43

Roasbeef added bug Unintended code behaviour enhancement Improvements to existing features / behaviour routing P2 should be fixed if one has time bug fix and removed bug Unintended code behaviour labels Aug 14, 2018

Roasbeef added this to the 0.5.1 milestone Aug 15, 2018

joostjager mentioned this pull request Aug 16, 2018

routing: prune based on channel sets instead of channels #1734

Merged

halseth reviewed Sep 5, 2018

View reviewed changes

joostjager force-pushed the errorprocessing branch 3 times, most recently from 8bf14c6 to 8b0cc58 Compare September 7, 2018 14:16

joostjager force-pushed the errorprocessing branch from 13bb6c9 to e1fb4af Compare September 8, 2018 18:48

joostjager force-pushed the errorprocessing branch 4 times, most recently from da0c3f7 to 51b7ea0 Compare October 24, 2018 19:20

Roasbeef reviewed Nov 2, 2018

View reviewed changes

joostjager force-pushed the errorprocessing branch from 51b7ea0 to 1efb0b8 Compare November 12, 2018 08:31

halseth reviewed Nov 13, 2018

View reviewed changes

joostjager force-pushed the errorprocessing branch from 1efb0b8 to 89ef2be Compare November 13, 2018 12:16

halseth reviewed Nov 20, 2018

View reviewed changes

Comment thread routing/missioncontrol.go Outdated

Roasbeef previously approved these changes Nov 27, 2018

View reviewed changes

joostjager dismissed Roasbeef’s stale review via 98806ed November 28, 2018 20:50

joostjager force-pushed the errorprocessing branch from 89ef2be to 98806ed Compare November 28, 2018 20:50

joostjager added 5 commits November 29, 2018 10:31

routing: move logging into applyChannelUpdate

dd7e2e9

To remove code duplicated at all call sites to check err and log.

routing: remove unused pruneVertexFailure parameters

aca136a

routing: remove pruneVertexFailure function

7103796

routing: use complete route in test

ac04729

Previously not all route fields were properly populated. Example: prev and next hop maps.

routing: move failed channels map into payment session

6ba1144

This is a small preparatory step towards moving mission control logic out of router and reusing the acquired routing result data.

joostjager force-pushed the errorprocessing branch from 98806ed to 2b0858a Compare November 29, 2018 12:38

joostjager mentioned this pull request Nov 29, 2018

routing: prune single direction #2243

Merged

joostjager force-pushed the errorprocessing branch from 8581159 to b6ce03e Compare December 3, 2018 12:23

Roasbeef approved these changes Dec 4, 2018

View reviewed changes

Roasbeef merged commit 5075394 into lightningnetwork:master Dec 4, 2018

Conversation

joostjager commented Aug 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joostjager commented Sep 8, 2018

Uh oh!

Roasbeef left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joostjager Nov 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joostjager commented Nov 12, 2018

Uh oh!

Uh oh!

joostjager commented Aug 9, 2018 •

edited

Loading

joostjager Nov 12, 2018 •

edited

Loading