Reconnect to peer when switching networks by roeierez · Pull Request #2058 · lightningnetwork/lnd

roeierez · 2018-10-16T20:41:28Z

This PR solves the following use case: 

connect to other node from a node running on a mobile device and open a channel.
switch networks from wifi to cellular
it takes more than 1 minute for the mobile node to re-connect to the remote node.

My Investigation shows that when switching networks the connection immediately closes on the mobile but stays open on the remote node until timed out. 
During this time further attempts to connect are done from the mobile but the remote node rejects them preferring the local connection using the following logic: https://github.com/lightningnetwork/lnd/blob/master/server.go#L2047
It looks like there is an implicit assumption at this point that the existing peer local connection is outbound, which is not the case I am describing (in my case both existing and new connection are inbound), therefore resulting in subsequent rejects.
This PR solves this issue by allowing a subsequent connection of the same type (inbound, outbound) to replace the old connection and enable both nodes to connect immediately.
It also maintain the logic to deterministic choose a connection using public keys comparison when both nodes try to connect simultaneously.

cfromknecht · 2019-01-11T05:21:44Z

Hi @roeierez! Thanks for the detailed description of the issue, indeed I think does solve the issue you're having. If i understand the changes, your proposal is to only apply the connection dropping if we already have a connected peer that is of the opposite direction from the fresh connection?

One side-effect of this is that a peer could continue to make inbound requests to me, and cause me to spin up and tear down peers endlessly. This isn't necessarily a problem on its own, though it will lead to higher resource usage than is otherwise possible with the current logic that always favors the connection that wins the tiebreaker.

At worst however, it would seem this is no worse than a peer that continually connects and instantly disconnects. We do have plans to extend lnd's internal DOS engine to be able to monitor and account for such behavior, though I think we can address that when that time comes.

In the meantime, would mind giving this a rebase to have the tests run on the current master?

roeierez · 2019-01-11T09:57:53Z

Rebased.

your proposal is to only apply the connection dropping if we already have a connected peer that is of the opposite direction from the fresh connection?

Yes, that's right.

cfromknecht · 2019-01-12T03:17:18Z

Thanks @roeierez will run this on my node to test out the behavior

mandelmonkey · 2019-01-31T07:29:32Z

I tested this out doing the following

set up a node using this PR
connect mobile lnd to yalls testnet node and the node setup in 1.
3.connect to wifi, opened channels to the two nodes above
paid invoices succesfully on yalls and step 1.node
5.switched to 4g network, un able to pay invoice on either
6.waited 1 min, yalls channel is offline, step 1. node is online and can pay invoice on step 1 node but not yalls
waited 5 mins both channels online

so seems the node with the PR channel comes online faster than a node not running it

junderw · 2019-01-31T08:22:53Z

This is a must have for mobile neutrino wallets.

halseth

LGTM 💯

cfromknecht

LGTM 🔥Thanks for the fix @roeierez! Seems the general consensus is that this helps a bunch w/ switching networks, so I'm in favor of giving this a shot

roeierez · 2019-02-13T21:24:28Z

Thanks for the fix

@cfromknecht Sure, happy to help. I also totally understand/appreciate the extra caution here. I, myself running/testing this PR for a long time and haven't experienced any problem. Please LMK if anything arises as a result of this fix, I will do my best to help.

cfromknecht · 2019-02-13T22:56:58Z

Awesome thanks! Yeah there be dragons in this area. Server/peer changes always require the most real world testing, as sometimes the interactions compound in unexpected ways

weaklysubjective · 2019-02-14T22:24:30Z

@cfromknecht speaking of real world testing i happen to do that for my app. I'm using 0.5.1-beta commit, gRPC and applied this patch, as well as changed defaultBackoff to 'Seconds'. I used a iphone 5s on LTE (Node B) to connect to another iphone 6S Plus (Node A) also on LTE. 5S initiated the connection, 6SPlus is remote.

I had 6S profiled on Xcode and i could see 6S the remote learning of the timeout/disconnect first, and sets its peer count accordingly.
It takes a minute more (always seems to be a minute which might indicate the timeout read) for the initiating node 5S to recognize the severed connection and then it resets its peer count accordingly.
There are no channels between the nodes
LND is listening on both nodes at this time, its just the networking level connectivity is gone.
During this interval Node B can send coins to Node A (mainly because sending coins is not a network level operation ?)
Attempts to make payments by Node B (because it still thinks the peer is connected), goes as far as starting the funding workflow, and then fails - obviously due to the lack of networking level connectivity. Unless the app catches this, the user is clueless because they would think the channel is being opened. An application can reasonably easily catch this by correlating the timestamp when the funding workflow started and an unconfirmed tx was inserted (if ever). If no tx was inserted then an app would know channel opening failed (due to whatever reason)
It doesn't look like Node B that originally initiated the connection is reconnecting (as the peer count doesn't go up)
Connection is not timed out or broken if the node is a 6s plus also on LTE connected to a WIFI node on iPad. Connection is stable as long as the peers are in the foreground. If one peer is backgrounded, LND peer stays connected for about 5 minutes (which may be a timeout of sorts).
Config #8 is not true for the LTE node B (iphone 5s) ..which disconnects even if peer is on a wifi.
I am unable to comment on the effect or lack thereof of this PR, because in all my testing the 5S (node B) consistently drops connection and never reconnects.
It is possible a lot of these issues show up in mobile (especially iPhones) because Apple is probably using Multi-path TCP. When connections are switched not so much from WIFI to LTE or 4G, rather from one radio to another, for efficient energy use or a plethora of other reasons, we then lose connectivity.
As to why this multi-path is an issue in some iphones and not in others, is still a mystery to me as i continue to investigate.

Appreciate all you guys hard work and i hope this helps.
Long time member of the community on slack as Paul.

Roasbeef

Commits not compliant with the contributor guidelines: https://github.com/lightningnetwork/lnd/blob/master/docs/code_contribution_guidelines.md#ModelGitCommitMessages (next section too)

Also we might want to consider modifying the write timeout also, it's at 50s atm

Roasbeef · 2019-02-14T23:50:51Z

Comments here and below no longer match the rationale behind the logic.

I changed the comments to be more specific and to contain description for the additional logic.

Roasbeef · 2019-02-14T23:52:03Z

already has -> already have

weaklysubjective · 2019-02-14T23:57:06Z

@Roasbeef - even though i wrote above (point #2 that it takes a minute), i must admit i wasn't sure of this. TBH it actually took 50 secs, i can second your comment.

Roasbeef · 2019-02-15T00:02:04Z

Yeah so that controls how long the remote node will wait before it gives up on writing to the user socket. The current value was set arbitrarily, and looking at it afresh it seems rather high. Only if a network has excessive packet loss, or if we're attempting to write a very very large amount through a constrained netowlr would we hit that value IMO.

…

On Thu, Feb 14, 2019, 3:57 PM weaklysubjective ***@***.*** wrote: @Roasbeef <https://github.com/Roasbeef> - even though i wrote above (point #2 <#2> that it takes a minute), i must admit i wasn't sure of this. TBH it actually took 50 secs, i can second your comment. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2058 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA87Lrp0ToXjNGnBY6eH3GrsoBuFiihwks5vNffVgaJpZM4Xh4N1> .

weaklysubjective · 2019-02-15T00:11:09Z

will try lowering the value, however that will only help for the node to pick up connection loss sooner than later. It still doesn't reconnect.

This commit modified the condition to whether drop an existing connection to a peer when a new connection to this peer is established. The previous algorithm used public keys comparison for this decision which determines that between every two nodes only one of them will ever drop the connection in such cases. The problematic case is when a node disconnects and reconnects in a short interval which is the case of mobile devices. In such case it takes as much as the "timeout" configured value for the remote node to detect the "disconnection" (and try to reconnect if this connection is persistent). In the case this node is also the one that has the "smaller" public key the reconnect attempts of the other node will be rejected causing it impossible to fast reconnect. The solution is to only drop the connection if if we already have a connected peer that is of the opposite direction from the this new connection. By doing so the "initiator" will be enabled to replace the connection and recconnect immediately.

roeierez · 2019-02-18T17:13:00Z

Commits not compliant with the contributor guidelines:

Sorry for that, I read these two sections and did my best to follow them:

Commit message now contains the header and prefix needed.
Commit message is wrapped by 72 characters with two paragraphs.
Squashed to one commit.
Please LMK if anything else is needed.

Roasbeef

LGTM ✨

Roasbeef · 2019-03-26T23:13:05Z

Have been running into this lately on mainnet running some patches that increase the stability of peer connections. Gave this a spin and it resolved the issues I was seeing!

halseth added p2p Code related to the peer-to-peer behaviour networking mobile labels Oct 29, 2018

Roasbeef requested a review from cfromknecht December 4, 2018 04:07

roeierez force-pushed the fix_reconnect branch from b243a7d to 6148335 Compare January 11, 2019 09:03

halseth reviewed Jan 16, 2019

View reviewed changes

Comment thread server.go Outdated

halseth approved these changes Jan 31, 2019

View reviewed changes

cfromknecht approved these changes Feb 13, 2019

View reviewed changes

Roasbeef requested changes Feb 14, 2019

View reviewed changes

Roasbeef reviewed Feb 14, 2019

View reviewed changes

roeierez force-pushed the fix_reconnect branch from 3a72921 to 6c256ae Compare February 18, 2019 17:08

Roasbeef approved these changes Mar 26, 2019

View reviewed changes

Roasbeef merged commit 1a8e4b0 into lightningnetwork:master Mar 26, 2019

Conversation

roeierez commented Oct 16, 2018

Uh oh!

cfromknecht commented Jan 11, 2019

Uh oh!

roeierez commented Jan 11, 2019

Uh oh!

cfromknecht commented Jan 12, 2019

Uh oh!

Uh oh!

mandelmonkey commented Jan 31, 2019

Uh oh!

junderw commented Jan 31, 2019

Uh oh!

halseth left a comment

Choose a reason for hiding this comment

Uh oh!

cfromknecht left a comment

Choose a reason for hiding this comment

Uh oh!

roeierez commented Feb 13, 2019

Uh oh!

cfromknecht commented Feb 13, 2019

Uh oh!

weaklysubjective commented Feb 14, 2019

Uh oh!

Roasbeef left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Roasbeef Feb 14, 2019

Choose a reason for hiding this comment

Uh oh!

roeierez Feb 18, 2019

Choose a reason for hiding this comment

Uh oh!

Roasbeef Feb 14, 2019

Choose a reason for hiding this comment

Uh oh!

roeierez Feb 18, 2019

Choose a reason for hiding this comment

Uh oh!

weaklysubjective commented Feb 14, 2019

Uh oh!

Roasbeef commented Feb 15, 2019 via email

Uh oh!

weaklysubjective commented Feb 15, 2019

Uh oh!

roeierez commented Feb 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Roasbeef left a comment

Choose a reason for hiding this comment

Uh oh!

Roasbeef commented Mar 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Roasbeef left a comment •

edited

Loading

roeierez commented Feb 18, 2019 •

edited

Loading