Skip to content

Reconnect to peer when switching networks#2058

Merged
Roasbeef merged 1 commit into
lightningnetwork:masterfrom
roeierez:fix_reconnect
Mar 26, 2019
Merged

Reconnect to peer when switching networks#2058
Roasbeef merged 1 commit into
lightningnetwork:masterfrom
roeierez:fix_reconnect

Conversation

@roeierez
Copy link
Copy Markdown
Contributor

This PR solves the following use case:


  1. connect to other node from a node running on a mobile device and open a channel.
  2. switch networks from wifi to cellular
  3. it takes more than 1 minute for the mobile node to re-connect to the remote node.

My Investigation shows that when switching networks the connection immediately closes on the mobile but stays open on the remote node until timed out.

During this time further attempts to connect are done from the mobile but the remote node rejects them preferring the local connection using the following logic: https://github.com/lightningnetwork/lnd/blob/master/server.go#L2047
It looks like there is an implicit assumption at this point that the existing peer local connection is outbound, which is not the case I am describing (in my case both existing and new connection are inbound), therefore resulting in subsequent rejects.
This PR solves this issue by allowing a subsequent connection of the same type (inbound, outbound) to replace the old connection and enable both nodes to connect immediately.
It also maintain the logic to deterministic choose a connection using public keys comparison when both nodes try to connect simultaneously.

@halseth halseth added p2p Code related to the peer-to-peer behaviour networking mobile labels Oct 29, 2018
@Roasbeef Roasbeef requested a review from cfromknecht December 4, 2018 04:07
@cfromknecht
Copy link
Copy Markdown
Contributor

Hi @roeierez! Thanks for the detailed description of the issue, indeed I think does solve the issue you're having. If i understand the changes, your proposal is to only apply the connection dropping if we already have a connected peer that is of the opposite direction from the fresh connection?

One side-effect of this is that a peer could continue to make inbound requests to me, and cause me to spin up and tear down peers endlessly. This isn't necessarily a problem on its own, though it will lead to higher resource usage than is otherwise possible with the current logic that always favors the connection that wins the tiebreaker.

At worst however, it would seem this is no worse than a peer that continually connects and instantly disconnects. We do have plans to extend lnd's internal DOS engine to be able to monitor and account for such behavior, though I think we can address that when that time comes.

In the meantime, would mind giving this a rebase to have the tests run on the current master?

@roeierez
Copy link
Copy Markdown
Contributor Author

Rebased.

your proposal is to only apply the connection dropping if we already have a connected peer that is of the opposite direction from the fresh connection?

Yes, that's right.

@cfromknecht
Copy link
Copy Markdown
Contributor

Thanks @roeierez will run this on my node to test out the behavior

Comment thread server.go Outdated
@mandelmonkey
Copy link
Copy Markdown

I tested this out doing the following

  1. set up a node using this PR
  2. connect mobile lnd to yalls testnet node and the node setup in 1.
    3.connect to wifi, opened channels to the two nodes above
  3. paid invoices succesfully on yalls and step 1.node
    5.switched to 4g network, un able to pay invoice on either
    6.waited 1 min, yalls channel is offline, step 1. node is online and can pay invoice on step 1 node but not yalls
  4. waited 5 mins both channels online

so seems the node with the PR channel comes online faster than a node not running it

@junderw
Copy link
Copy Markdown
Contributor

junderw commented Jan 31, 2019

This is a must have for mobile neutrino wallets.

Copy link
Copy Markdown
Contributor

@halseth halseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 💯

Copy link
Copy Markdown
Contributor

@cfromknecht cfromknecht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🔥Thanks for the fix @roeierez! Seems the general consensus is that this helps a bunch w/ switching networks, so I'm in favor of giving this a shot

@roeierez
Copy link
Copy Markdown
Contributor Author

Thanks for the fix

@cfromknecht Sure, happy to help. I also totally understand/appreciate the extra caution here. I, myself running/testing this PR for a long time and haven't experienced any problem. Please LMK if anything arises as a result of this fix, I will do my best to help.

@cfromknecht
Copy link
Copy Markdown
Contributor

Awesome thanks! Yeah there be dragons in this area. Server/peer changes always require the most real world testing, as sometimes the interactions compound in unexpected ways

@weaklysubjective
Copy link
Copy Markdown

@cfromknecht speaking of real world testing i happen to do that for my app. I'm using 0.5.1-beta commit, gRPC and applied this patch, as well as changed defaultBackoff to 'Seconds'. I used a iphone 5s on LTE (Node B) to connect to another iphone 6S Plus (Node A) also on LTE. 5S initiated the connection, 6SPlus is remote.

  1. I had 6S profiled on Xcode and i could see 6S the remote learning of the timeout/disconnect first, and sets its peer count accordingly.
  2. It takes a minute more (always seems to be a minute which might indicate the timeout read) for the initiating node 5S to recognize the severed connection and then it resets its peer count accordingly.
  3. There are no channels between the nodes
  4. LND is listening on both nodes at this time, its just the networking level connectivity is gone.
  5. During this interval Node B can send coins to Node A (mainly because sending coins is not a network level operation ?)
  6. Attempts to make payments by Node B (because it still thinks the peer is connected), goes as far as starting the funding workflow, and then fails - obviously due to the lack of networking level connectivity. Unless the app catches this, the user is clueless because they would think the channel is being opened. An application can reasonably easily catch this by correlating the timestamp when the funding workflow started and an unconfirmed tx was inserted (if ever). If no tx was inserted then an app would know channel opening failed (due to whatever reason)
  7. It doesn't look like Node B that originally initiated the connection is reconnecting (as the peer count doesn't go up)
  8. Connection is not timed out or broken if the node is a 6s plus also on LTE connected to a WIFI node on iPad. Connection is stable as long as the peers are in the foreground. If one peer is backgrounded, LND peer stays connected for about 5 minutes (which may be a timeout of sorts).
  9. Config #8 is not true for the LTE node B (iphone 5s) ..which disconnects even if peer is on a wifi.
  10. I am unable to comment on the effect or lack thereof of this PR, because in all my testing the 5S (node B) consistently drops connection and never reconnects.
  11. It is possible a lot of these issues show up in mobile (especially iPhones) because Apple is probably using Multi-path TCP. When connections are switched not so much from WIFI to LTE or 4G, rather from one radio to another, for efficient energy use or a plethora of other reasons, we then lose connectivity.
  12. As to why this multi-path is an issue in some iphones and not in others, is still a mystery to me as i continue to investigate.

Appreciate all you guys hard work and i hope this helps.
Long time member of the community on slack as Paul.

Copy link
Copy Markdown
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commits not compliant with the contributor guidelines: https://github.com/lightningnetwork/lnd/blob/master/docs/code_contribution_guidelines.md#ModelGitCommitMessages (next section too)

Also we might want to consider modifying the write timeout also, it's at 50s atm

Comment thread server.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments here and below no longer match the rationale behind the logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the comments to be more specific and to contain description for the additional logic.

Comment thread server.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already has -> already have

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@weaklysubjective
Copy link
Copy Markdown

@Roasbeef - even though i wrote above (point #2 that it takes a minute), i must admit i wasn't sure of this. TBH it actually took 50 secs, i can second your comment.

@Roasbeef
Copy link
Copy Markdown
Member

Roasbeef commented Feb 15, 2019 via email

@weaklysubjective
Copy link
Copy Markdown

will try lowering the value, however that will only help for the node to pick up connection loss sooner than later. It still doesn't reconnect.

This commit modified the condition to whether drop an existing
connection to a peer when a new connection to this peer is
established.
The previous algorithm used public keys comparison for this decision
which determines that between every two nodes only one of them will
ever drop the connection in such cases.
The problematic case is when a node disconnects and reconnects in a
short interval which is the case of mobile devices.
In such case it takes as much as the "timeout" configured value for
the remote node to detect the "disconnection" (and try to reconnect
if this connection is persistent). In the case this node is also the
one that has the "smaller" public key the reconnect attempts of the
other node will be rejected causing it impossible to fast reconnect.

The solution is to only drop the connection if if we already have a
connected peer that is of the opposite direction from the this new
connection. By doing so the "initiator" will be enabled to replace
the connection and recconnect immediately.
@roeierez
Copy link
Copy Markdown
Contributor Author

roeierez commented Feb 18, 2019

Commits not compliant with the contributor guidelines:

Sorry for that, I read these two sections and did my best to follow them:

  1. Commit message now contains the header and prefix needed.
  2. Commit message is wrapped by 72 characters with two paragraphs.
  3. Squashed to one commit.
    Please LMK if anything else is needed.

Copy link
Copy Markdown
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✨

@Roasbeef
Copy link
Copy Markdown
Member

Have been running into this lately on mainnet running some patches that increase the stability of peer connections. Gave this a spin and it resolved the issues I was seeing!

@Roasbeef Roasbeef merged commit 1a8e4b0 into lightningnetwork:master Mar 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mobile networking p2p Code related to the peer-to-peer behaviour

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants