Skip to content

routing: make routing retry behaviour consistent#1706

Merged
Roasbeef merged 6 commits into
lightningnetwork:masterfrom
joostjager:errorprocessing
Dec 4, 2018
Merged

routing: make routing retry behaviour consistent#1706
Roasbeef merged 6 commits into
lightningnetwork:masterfrom
joostjager:errorprocessing

Conversation

@joostjager
Copy link
Copy Markdown
Contributor

@joostjager joostjager commented Aug 9, 2018

This PR builds on top of #1888 - review that one first.

Fixes the following issues:

  • If the channel update of FailFeeInsufficient contains an invalid channel
    id, it is not possible to properly add to the failed channels set.
  • FailAmountBelowMinimum may apply a channel update, but does not retry.
  • FailIncorrectCltvExpiry immediately prunes the vertex without
    trying one more time.

In this PR, the logic for all three policy related errors is
aligned.

@joostjager joostjager force-pushed the errorprocessing branch 2 times, most recently from acaa557 to a384b01 Compare August 9, 2018 12:33
@joostjager joostjager changed the title routing: fix nil ptr exception and routing retry behaviour routing: fix routing retry behaviour Aug 9, 2018
@joostjager joostjager force-pushed the errorprocessing branch 2 times, most recently from 62545db to 104a197 Compare August 9, 2018 13:44
@joostjager joostjager changed the title routing: fix routing retry behaviour routing: make routing retry behaviour consistent Aug 9, 2018
Comment thread routing/router.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the change be made to nextHopChannel instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextHopChannel does a map lookup. It is probably better to populate this and also the other maps nodeIndex, chanIndex and prevHopMap when the route is unmarshalled in SendToRoute. Will try it out and see what it looks like.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it looks better now.

Comment thread routing/router.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make source a parameter to the method instead. Will give us more flexibility for using it for arbitrary routes, and makes it clear that it is not using any other state of the ChannelRouter.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, is the Route struct fully specifying the path now, or must source be specified to give it meaning? If the latter is the case, I think it might be beneficial to add source to the Route struct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've been asking this myself too. But adding it to Route will have wide spread consequences. It think it should then also be added to the RPC interface and also internally always be populated. The code should be prepared to received a route for which the local node is not the source. Checks and errors need to be added. Do you think it is worth it? Now routes implicitly originate from the local node and I don't see much problem with this.

Comment thread routing/router.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error should be printed here.

Comment thread routing/router.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like this is an unnecessary closure? Should instead let caller of r.applyChannelUpdate print the error as it sees fit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal was to remove code duplication. Four identical code blocks that called r.applyChannelUpdate and logged the error existed.

Comment thread routing/router.go Outdated
@joostjager joostjager force-pushed the errorprocessing branch 6 times, most recently from b106b18 to c1ccce3 Compare August 13, 2018 19:43
@Roasbeef Roasbeef added bug Unintended code behaviour enhancement Improvements to existing features / behaviour routing P2 should be fixed if one has time bug fix and removed bug Unintended code behaviour labels Aug 14, 2018
@Roasbeef Roasbeef added this to the 0.5.1 milestone Aug 15, 2018
Comment thread routing/pathfind.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this addition, very clean to keep only the needed information around to create a route!

My main concern about this is that the number of different ways of representing a channel in the codebase is getting a bit out of hand... I think we Have Hop, ChannelHop, Edge, ChannelEdgePolicy and probably a few more. With the addition of ShortHop do you think there is a possibility to remove one of the existing ones, or reuse an existing one for this purpose?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. What I think is a problem at the moment, are those database-connected structs. This becomes especially visible in SendToRoute. Possibly a route is passed in that does not match what is in the database, but still there is no reason not to be able to properly perform the payment. So what we see happening for example is that we populate a Route structure and leave some fields empty that are not read anyway.

In my opinion, the dependency on a db link or not should be made more explicit by using either a datastructure that always has the field populated or a datastructure that doesn't have the field at all.

I can imagine that when we start looking at the code in this way and slowly working out these db links where not needed, the number of data structures will reduce too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed my own suggestion and removed the db connection from Route struct. Now instead of adding a new struct (ShortHop), I could remove one (ChannelHop).

Two things are lost though:

  • Sanity check for bandwidth in newRoute. This used to have a purpose, because findPath didn't accumulate fees, so it could be that a route from findPath would be rejected in newRoute. Since we now have backwards search, this is no longer necessary.

  • Reporting back channel capacity in QueryRoutes. In my opinion, this wasn't very relevant data for a route anyway (we are also not reporting policy there for example), but need to look at possible compatibility problems.

Have a look at it. The first series of commits hasn't been squashed yet, so view them all together (everything up till "routing: make routing retry behaviour consistent")

Comment thread routing/pathfind.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to represent the route returned from path finding as ShortHops also, such that we can reuse this method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the idea of a path finding function is to return just the path and nothing else. Utilizing the path for a payment and calculating the required fees and time locks is a different responsibility. Of course in an intelligent path finding function, fees and locks are taken into account, but not necessarily. It could also be a "fixed path finder" (always the same path) and then it doesn't need to look at anything at all and just return a static route.

That responsibility of filling in the fees etc lies with newRoute currently. If findPath would return ShortHops, the AmountToForward fields would need to be filled already. This implies the fees are set already too and breaks the separation of concerns. Looking at it strictly, the only thing that findPath should return is a list of channel ids. That's enough for newRoute to build the complete route, although extra db lookups would be needed to re-fetch the edge policies.

There is some reuse possible by having NewRouteFromShortHops being used in newRoute. This removes the code duplication for setting up the prev/next maps and simplifies the function. I've committed this.

Comment thread routing/router.go Outdated
Comment thread routing/router.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this boolan was never set true?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, it was always false

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past when we got an unknown next peer we pruned the node after the node that sent the error: 93b04b3#diff-26935be4146394ebeabd03f27e80f0bfL1750

Later on, we changed to simply prune that edge, and not the peer as a whole: bd9f1b5

We do this as otherwise, a faulty link to a peer would cause us to blacklist that peer, rather than just that faulty link.

@joostjager joostjager force-pushed the errorprocessing branch 3 times, most recently from 8bf14c6 to 8b0cc58 Compare September 7, 2018 14:16
@joostjager
Copy link
Copy Markdown
Contributor Author

Needed to update router_test, because node alias is not guaranteed to be available in a Route struct. Another reason why it would be better to be strict about what info is really required.

@joostjager joostjager force-pushed the errorprocessing branch 4 times, most recently from da0c3f7 to 51b7ea0 Compare October 24, 2018 19:20
Copy link
Copy Markdown
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dig the code simplification here! It also removes a number of hacks which attempted to address the symptoms of a greater bug elsewhere in the codebase, rather than attack the root cause directly . I've left a few comments, some providing a bit of historical context of the prior iterations of the mission control portions of the code. Aside from that it's mostly styling nits.

I'll start to run this on a testnet node to get a feel for the changes, and if no issues, will get this merged!

Comment thread routing/router.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past when we got an unknown next peer we pruned the node after the node that sent the error: 93b04b3#diff-26935be4146394ebeabd03f27e80f0bfL1750

Later on, we changed to simply prune that edge, and not the peer as a whole: bd9f1b5

We do this as otherwise, a faulty link to a peer would cause us to blacklist that peer, rather than just that faulty link.

Comment thread routing/router_test.go Outdated
Comment thread routing/router.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Historically, this was a mistake to introduce, as it patched a symptom of the actual root cause, which were incorrect fee calculations during path finding, and also a bug in the link where we applied the incoming channel policy rather than the outgoing channel policy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do you mean it isn't necessary anymore to have this 'second chance' logic? Even though the fee calc. is fixed now, it still prevents a node from keeping us busy with sending different errors all the time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say keep behavior for now (maybe add a TODO for eventual removal), as seems out of scope for this PR, and I think it will be something we should consider when moving to smarter mission control.

Comment thread routing/missioncontrol.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically, it was a band-aide (erroneously) added due to issues with incorrect fees in routing, and also mishandling of fees in the switch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So instead of pruning the vertex, we can just prune the edge now? The node still cannot keep us in an endless loop, as at some point all his channels will be exhausted with (second) errors.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, there was an issue at the time with c-lightning either sending invalid channel update (we properly validate them now) or a pointer corruption bug causing it to send back an incorrect short channel ID, so we'd keep trying to prune something that wasn't actually in attempted route. On the lnd side, IIRC, we would send back the wrong channel update, so: we'd try to route, lnd would fail but with a chan update not in the route, we then try again.

Either way the issues should be resolved now on mainnet at least, so after we merge this in and get some more real world testing, then we can remove this logic all together perhaps. So just prune the edge instead.

Comment thread routing/router.go Outdated
Comment thread routing/router.go Outdated
Comment thread routing/router.go Outdated
Comment thread routing/router.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking back, I think that's the case. Will investigate the git history a bit.

Copy link
Copy Markdown
Contributor Author

@joostjager joostjager Nov 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the final node should never report errors that have to do with the next channel/hop, as it doesn't exist. The logic here to prune the channel to the final node could make sense in that light, as the final hop is sending an error that is not logical. Or even prune the final node completely?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that they prune error with the next channel/hop, but that they send an error back for the final hop itself. In this case, we need to check the prior channel. Not a blocker, but I think this can be slightly re-written to check if the error source is the final destination, if so then we can return the final hop immediately.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is this: A final hop can't return any failure. For example FailUnknownNextPeer cannot come from the final hop. In our current handling of failures, we don't need that failed channel id for any of the 'final hop' failures. If a final hop does return such an error (like FailUnknownNextPeer), isn't that reason for pruning the node?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course it can, for example: FailUnknownPaymenthash

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, bad English with the word any. Not all failures are expected from a final hop. FailUnknownPaymenthash sure, but not FailUnknownNextPeer

@joostjager
Copy link
Copy Markdown
Contributor Author

@Roasbeef style comments processed. I replied to your comments on the logic.

Comment thread routing/router.go Outdated
Comment thread routing/router.go Outdated
Comment thread routing/router.go Outdated
Comment thread routing/router.go Outdated
Comment thread routing/router_test.go Outdated
Comment thread routing/router.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say keep behavior for now (maybe add a TODO for eventual removal), as seems out of scope for this PR, and I think it will be something we should consider when moving to smarter mission control.

Comment thread routing/router.go Outdated
Comment thread routing/missioncontrol.go Outdated
Comment thread routing/router.go Outdated
Comment thread routing/router.go Outdated
@joostjager
Copy link
Copy Markdown
Contributor Author

I would say keep behavior for now (maybe add a TODO for eventual removal), as seems out of scope for this PR, and I think it will be something we should consider when moving to smarter mission control.

I think we want to keep this behaviour, because it prevents nodes from keeping us busy indefinitely processing errors on their channels.

@joostjager
Copy link
Copy Markdown
Contributor Author

@halseth comments processed

Copy link
Copy Markdown
Contributor

@halseth halseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the remaining TODOs, this looks good to me now 👍

Comment thread routing/missioncontrol.go Outdated
Roasbeef
Roasbeef previously approved these changes Nov 27, 2018
Copy link
Copy Markdown
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Comment thread routing/router.go Outdated
Comment thread routing/missioncontrol.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, there was an issue at the time with c-lightning either sending invalid channel update (we properly validate them now) or a pointer corruption bug causing it to send back an incorrect short channel ID, so we'd keep trying to prune something that wasn't actually in attempted route. On the lnd side, IIRC, we would send back the wrong channel update, so: we'd try to route, lnd would fail but with a chan update not in the route, we then try again.

Either way the issues should be resolved now on mainnet at least, so after we merge this in and get some more real world testing, then we can remove this logic all together perhaps. So just prune the edge instead.

Comment thread routing/router.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that they prune error with the next channel/hop, but that they send an error back for the final hop itself. In this case, we need to check the prior channel. Not a blocker, but I think this can be slightly re-written to check if the error source is the final destination, if so then we can return the final hop immediately.

@Roasbeef
Copy link
Copy Markdown
Member

Once the next release is cut, we can land this so our network monitoring systems can also start to use this new consistent retry logic.

To remove code duplicated at all call sites to check err and log.
Previously not all route fields were properly populated. Example: prev
and next hop maps.
This is a small preparatory step towards moving mission control logic
out of router and reusing the acquired routing result data.
@joostjager
Copy link
Copy Markdown
Contributor Author

Pushed small fixup that makes separation of concerns slightly better

@halseth
Copy link
Copy Markdown
Contributor

halseth commented Dec 3, 2018

LGTM after squash! 👍

Fixes the following issues:
- If the channel update of FailFeeInsufficient contains an invalid channel
  update, it is not possible to properly add to the failed channels set.
- FailAmountBelowMinimum may apply a channel update, but does not retry.
- FailIncorrectCltvExpiry immediately prunes the vertex without
  trying one more time.

In this commit, the logic for all three policy related errors is
aligned.
@joostjager
Copy link
Copy Markdown
Contributor Author

squashed

Copy link
Copy Markdown
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🕹

@Roasbeef Roasbeef merged commit 5075394 into lightningnetwork:master Dec 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug fix enhancement Improvements to existing features / behaviour P2 should be fixed if one has time routing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants