Reconnect to peer when switching networks#2058
Conversation
|
Hi @roeierez! Thanks for the detailed description of the issue, indeed I think does solve the issue you're having. If i understand the changes, your proposal is to only apply the connection dropping if we already have a connected peer that is of the opposite direction from the fresh connection? One side-effect of this is that a peer could continue to make inbound requests to me, and cause me to spin up and tear down peers endlessly. This isn't necessarily a problem on its own, though it will lead to higher resource usage than is otherwise possible with the current logic that always favors the connection that wins the tiebreaker. At worst however, it would seem this is no worse than a peer that continually connects and instantly disconnects. We do have plans to extend lnd's internal DOS engine to be able to monitor and account for such behavior, though I think we can address that when that time comes. In the meantime, would mind giving this a rebase to have the tests run on the current master? |
b243a7d to
6148335
Compare
|
Rebased.
Yes, that's right. |
|
Thanks @roeierez will run this on my node to test out the behavior |
|
I tested this out doing the following
so seems the node with the PR channel comes online faster than a node not running it |
|
This is a must have for mobile neutrino wallets. |
cfromknecht
left a comment
There was a problem hiding this comment.
LGTM 🔥Thanks for the fix @roeierez! Seems the general consensus is that this helps a bunch w/ switching networks, so I'm in favor of giving this a shot
@cfromknecht Sure, happy to help. I also totally understand/appreciate the extra caution here. I, myself running/testing this PR for a long time and haven't experienced any problem. Please LMK if anything arises as a result of this fix, I will do my best to help. |
|
Awesome thanks! Yeah there be dragons in this area. Server/peer changes always require the most real world testing, as sometimes the interactions compound in unexpected ways |
|
@cfromknecht speaking of real world testing i happen to do that for my app. I'm using 0.5.1-beta commit, gRPC and applied this patch, as well as changed defaultBackoff to 'Seconds'. I used a iphone 5s on LTE (Node B) to connect to another iphone 6S Plus (Node A) also on LTE. 5S initiated the connection, 6SPlus is remote.
Appreciate all you guys hard work and i hope this helps. |
There was a problem hiding this comment.
Commits not compliant with the contributor guidelines: https://github.com/lightningnetwork/lnd/blob/master/docs/code_contribution_guidelines.md#ModelGitCommitMessages (next section too)
Also we might want to consider modifying the write timeout also, it's at 50s atm
There was a problem hiding this comment.
Comments here and below no longer match the rationale behind the logic.
There was a problem hiding this comment.
I changed the comments to be more specific and to contain description for the additional logic.
|
Yeah so that controls how long the remote node will wait before it gives up
on writing to the user socket. The current value was set arbitrarily, and
looking at it afresh it seems rather high. Only if a network has excessive
packet loss, or if we're attempting to write a very very large amount
through a constrained netowlr would we hit that value IMO.
…On Thu, Feb 14, 2019, 3:57 PM weaklysubjective ***@***.*** wrote:
@Roasbeef <https://github.com/Roasbeef> - even though i wrote above
(point #2 <#2> that it takes
a minute), i must admit i wasn't sure of this. TBH it actually took 50
secs, i can second your comment.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2058 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA87Lrp0ToXjNGnBY6eH3GrsoBuFiihwks5vNffVgaJpZM4Xh4N1>
.
|
|
will try lowering the value, however that will only help for the node to pick up connection loss sooner than later. It still doesn't reconnect. |
This commit modified the condition to whether drop an existing connection to a peer when a new connection to this peer is established. The previous algorithm used public keys comparison for this decision which determines that between every two nodes only one of them will ever drop the connection in such cases. The problematic case is when a node disconnects and reconnects in a short interval which is the case of mobile devices. In such case it takes as much as the "timeout" configured value for the remote node to detect the "disconnection" (and try to reconnect if this connection is persistent). In the case this node is also the one that has the "smaller" public key the reconnect attempts of the other node will be rejected causing it impossible to fast reconnect. The solution is to only drop the connection if if we already have a connected peer that is of the opposite direction from the this new connection. By doing so the "initiator" will be enabled to replace the connection and recconnect immediately.
3a72921 to
6c256ae
Compare
Sorry for that, I read these two sections and did my best to follow them:
|
|
Have been running into this lately on mainnet running some patches that increase the stability of peer connections. Gave this a spin and it resolved the issues I was seeing! |
This PR solves the following use case:
My Investigation shows that when switching networks the connection immediately closes on the mobile but stays open on the remote node until timed out.
During this time further attempts to connect are done from the mobile but the remote node rejects them preferring the local connection using the following logic: https://github.com/lightningnetwork/lnd/blob/master/server.go#L2047
It looks like there is an implicit assumption at this point that the existing peer local connection is outbound, which is not the case I am describing (in my case both existing and new connection are inbound), therefore resulting in subsequent rejects.
This PR solves this issue by allowing a subsequent connection of the same type (inbound, outbound) to replace the old connection and enable both nodes to connect immediately.
It also maintain the logic to deterministic choose a connection using public keys comparison when both nodes try to connect simultaneously.