Maintain permanent connections for active channels by rustyrussell · Pull Request #1322 · ElementsProject/lightning

rustyrussell · 2018-04-04T03:35:21Z

ZmnSCPxj

Mostly OK, but this uses the same mechanism that connect command does.

Should we worry about the case:

An "important" peer with a channel disappears forever.
gossipd goes on a loop attempting to retry that peer forever.
User (or automated pilot) attempts a connect command.

It seems that 3 above will never return, as we never send a gossip_peer_connection_failed to trigger the connect command to return with failure.

An alternative is for connect JSON-RPC to return with a timeout separately from the timeout in gossipd.

This is worrisome as naive implementation of pilot program might just connect-fundchannel without bothering to check listpeers.

ZmnSCPxj · 2018-04-04T06:08:05Z


 static void peer_closing_complete(struct channel *channel, const u8 *msg)
 {
+	u8 *msg;


msg is parameter of this function and is also a local variable.

cdecker · 2018-04-04T12:36:41Z

Other than the msg param being shadowed, this is good to go :-)

Indeed clang seems to be a bit more pedantic in this case:

lightningd/closing_control.c:98:6: error: redefinition of 'msg' with a different type: 'u8 *' (aka 'unsigned char *') vs 'const u8 *'
      (aka 'const unsigned char *')
        u8 *msg;
            ^
lightningd/closing_control.c:96:70: note: previous definition is here
static void peer_closing_complete(struct channel *channel, const u8 *msg)
                                                                     ^
lightningd/closing_control.c:113:8: error: implicit declaration of function 'towire_gossipctl_peer_important' is invalid in C99
      [-Werror,-Wimplicit-function-declaration]
        msg = towire_gossipctl_peer_important(NULL, &channel->peer->id, false);
              ^
lightningd/closing_control.c:113:6: error: incompatible integer to pointer conversion assigning to 'const u8 *'
      (aka 'const unsigned char *') from 'int' [-Werror,-Wint-conversion]
        msg = towire_gossipctl_peer_important(NULL, &channel->peer->id, false);
            ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3 errors generated.

rustyrussell · 2018-04-05T04:22:16Z

Should we worry about the case:

An "important" peer with a channel disappears forever.

gossipd goes on a loop attempting to retry that peer forever.

User (or automated pilot) attempts a connect command.

User will get an error:

if (find_reaching(daemon, id)) {
	/* FIXME: Perhaps kick timer in this case? */
	status_trace("try_reach_peer: already trying to reach %s",
		     type_to_string(tmpctx, struct pubkey, id));
	return false;
}

Perhaps we should probably stop the multiple connect attempts now for manual connects, which would allow us to fail faster.

But generally, I think trying forever is correct in this case: if we have an HTLC we'll eventually force close, and when we implement the json_close timeout logic that would do the same thing.

rustyrussell · 2018-04-05T04:22:50Z

Fixed shadowed variable, missing #include.

ZmnSCPxj · 2018-04-05T06:08:03Z

The code quoted is called by reach_peer, which only sends a message to masterd if the quoted code returns true. If it returns false, nothing is done and the connect command stalls.

ZmnSCPxj · 2018-04-05T12:22:24Z

Note that kicking the timer in the case "peer is important, peer is unavailable, user issued connect command to peer" is probably a Good Thing. Consider the case where the peer becomes unavailable for a while, thus triggering the exponential backoff until the 5-minute limit. Then the user learns that the peer is now available via some out-of-band method (phone call, IRC message, smoke signals, telepathy, psychohistorical trends of society). It would be better if the user at least has the ability to trigger connection with that peer immediately, than wait for the next 5-minute cycle.

So I propose that we split the reaching system into two parts:

A subsystem that makes a single attempt to try, and then trigger everyone interested in connecting to that peer, informing them of success or failure. It sends gossip_peer_connected gossip_peer_already_connected and gossip_peer_connection_failed to the master.
Another sub-system that handles the reconnect to "important" peers with exponential backoff limited to 5 minutes. After a timeout (struct oneshot instance) is triggered, it simply frees the timeout object and triggers the single-connect-try subsystem for the important peer if it is not connected. On any failure connection to an "important" peer it clears any timeout it has for that peer and creates a new timeout object.

A connect command would then simply trigger the single-connect-try subsystem, possibly doing its own retries independently of the auto-reconnect subsystem.

rustyrussell · 2018-04-06T00:59:53Z

Good point. I agree, let's make connect a one-shot, and have a separate system for important peers.

ZmnSCPxj · 2018-04-06T07:47:14Z

+		/* FIXME: Exponential backoff! */
+		status_trace("...will try again in %u seconds", 5);
+		/* If important_id freed, this will be removed too */
+		new_reltimer(&reach->daemon->timers, imp,


Should we not keep the returned struct oneshot in the struct important_peerid? And before making a new timer, we should free any existing timers? Otherwise connect commands may cause multiple overlapping timers to exist (I think?).

/* Kick the timer if present, or start a new one if not present. */ tal_free(imp->timer); imp->timer = new_reltimer(&reach->daemon->timers, imp, time_from_sec(5), retry_important, imp);

The retry_important function also needs to clear this timer:

imp->timer = tal_free(imp->timer);

Good point, we will end up with one timer per connect for every important peer. Will fix, though not here.

rustyrussell · 2018-04-06T21:24:13Z

OK, fixed python whitespace and invalid assertion.

And added a commit which prevents multiple timers running for multiple connect commands to the same (important) peer; thanks @ZmnSCPxj !

ZmnSCPxj · 2018-04-07T00:55:14Z

+	/* If this is an important peer, free any outstanding timer */
+	imp = important_peerid_map_get(&daemon->important_peerids, &id);
+	if (imp)
+		imp->reconnect_timer = tal_free(imp->reconnect_timer);


At first glance imp->reconnect_timer seems not initialized when new important_peerid is created. However important_peerid is created only one place, and, it calls retry_important immediately, which sets imp->reconnect_timer to NULL, so no issue here.

ZmnSCPxj · 2018-04-07T00:57:50Z

ACK f9f38b1

ZmnSCPxj · 2018-04-07T01:08:00Z

It is failing in CI, but I am uncertain why; the CI error pertains to openingd, not gossipd...? It seems the funder openingd died because the fundee openingd disconnected, while the fundee can see its own openingd disconnected from itself but did not die. Some edge case of connecting? Perhaps a reconnect try is still spuriously triggering?

rustyrussell · 2018-04-09T07:24:55Z

Fixing that turned out to be more surgery than I anticipated! Added WIP since I'm pushing an update to check CI...

rustyrussell · 2018-04-10T10:49:52Z

The fast reconnect seems to be triggering a heap of fun bugs. Will spend more time diagnosing tomorrow.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

These don't have a maximum number of reconnect attempts, and ensure that we try to reconnect when the peer dies. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…place. And on channel_fail_permanent and closing (the two places we drop to chain), we tell gossipd it's no longer important. Fixes: ElementsProject#1316 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Rather than using a flag in reaching/peer; we make it self-contained as the next patch puts it straight into a timer callback. Also remove unused 'succeeded' field from struct peer. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

ZmnSCPxj · 2018-04-26T02:22:54Z

-	} else {
-		command_fail(c->cmd, "%s", err);
+	/* We can have multiple connect commands: complete them all */
+	while ((c = find_connect(ld, &id)) != NULL) {


find_connect loops over the connect commands, and this loop also loops over find_connect. More efficient to just use list_safe_for_each and resolve each matching command. Acceptable for now though, we really need this merged.

Agreed, it's N^2 but as you say, N small and we don't really care, so simplicity wins.

rustyrussell · 2018-04-26T04:20:27Z

Travis seems confused, didn't start? Will rebase to fix conflict...

rustyrussell · 2018-04-26T04:48:39Z

This time, fixed conflict properly...

…d don't retry 1. Lifetime of 'struct reaching' now only while we're actively doing connect. 2. Always free after a single attempt: if it's an important peer, retry on a timer. 3. Have a single response message to master, rather than relying on peer_connected on success and other msgs on failure. 4. If we are actively connecting and we get another command for the same id, just increment the counter The result is much simpler in the master daemon, and much nicer for reconnection: if they say to connect they get an immediate response, rather than waiting for 10 retries. Even if it's an important peer, it fires off another reconnect attempt, unless it's actively connecting now. This removes exponential backoff: that's restored in next patch. It also doesn't handle multiple addresses for a single peer. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…tion. It only lasts until the next block, but it's weird. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

(This was sitting in my gossip-enchancement patch queue, but it simplifies this set too, so I moved it here). In 9471196 we added an explicit gossip_index so when gossipd gets peers back from other daemons, it knows what gossip it has sent (since gossipd can send gossip after the other daemon is already complete). This solution is insufficient for the more general case where gossipd wants to send other messages reliably, so replace it with the other solution: have gossipd drain the "gossip fd" which the daemon returns. This turns out to be quite simple, and is probably how I should have done it originally :( Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This means that openingd and closingd now forward our gossip. But the real reason we want to do this is that it gives an easy way for gossipd to kill any active daemon, by closing its fd: previously closingd and openingd didn't read the fd, so tended not to notice. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This comes in useful for the next patch. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Currently we intuit it from the fd being closed, but that may happen out of order with when the master thinks it's dead. So now if the gossip fd closes we just ignore it, and we'll get a notification from the master when the peer is disconnected. The notification is slightly ugly in that we have to disable it for a channel when we manually hand the channel back to gossipd. Note: as stands, this is racy with reconnects. See the next patch. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

When we get a reconnection, kill the current remote peer, and wait for the master to tell us it's dead. Then we hand it the new peer. Previously, we would end up with gossipd holding multiple peers, and the logging was really hard to interpret; I'm not completely convinced that we did the right thing when one terminated, either. Note that this now means we can have peers with neither ->local nor ->remote populated, so we check that more carefully. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We start at 1 second, back off to 5 minutes. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

If a channel is active (ie. not onchaind) and has an owner, this should be equivalent. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

At least say whether we failed to connect at all, or failed cryptographic handshake, or failed reading/writing init messages. The errno can be "Operation now in progress" if the other end closes the socket on us: this happens when we handshake with the wrong key and it hangs up on us. Fixing this would require work on ccan/io though. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

@ZmnSCPxj

When a connect fails, if it's an important peer, we set a timer. If we have a manual connect command, this means we do this again, leading to another timer. For a manual command, free any existing timer; the normal fail logic will start another if necessary. Reported-by: @ZmnSCPxj Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

If channeld dies for some reason (eg, reconnect) and we didn't yet announce the channel, we can miss doing so. This is unusual, because if lightningd restarts it rearms the callback which gives us funding_locked, so it only happens if just channel dies before sending the announcement message. This problem applies to both temporary announcement (for gossipd) and the real one. For the temporary one, simply re-send on startup, and remote the error msg gossipd gives if it sees a second one. For the real one, we need a flag to tell us the depth is sufficient; the peer will ignore re-sends anyway. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

@ZmnSCPxj

…o_chain. Reported-by: @ZmnSCPxj Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

@ZmnSCPxj

Old gossip is rarely interesting. Reported-by: @ZmnSCPxj Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…tningd. Christian points out that this is the pattern used elsewhere, for example. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell · 2018-04-26T05:49:05Z

Holiday with kids means less time to code :)

Or, put another way: parenting is my chosen career, which demanded a full week on-site. Thus my coding hobby suffered a small setback.

Thanks for your patience, everyone!

rustyrussell requested review from ZmnSCPxj and cdecker April 4, 2018 03:35

ZmnSCPxj reviewed Apr 4, 2018

View reviewed changes

ZmnSCPxj mentioned this pull request Apr 4, 2018

connect: Add tries and tryrate arguments. #1320

Closed

ZmnSCPxj suggested changes Apr 4, 2018

View reviewed changes

rustyrussell force-pushed the guilt/permconnect branch from 917b59f to 3078658 Compare April 5, 2018 04:22

ZmnSCPxj reviewed Apr 6, 2018

View reviewed changes

ZmnSCPxj mentioned this pull request Apr 6, 2018

Add support for local socket connections #1323

Closed

rustyrussell force-pushed the guilt/permconnect branch from 19a1b01 to f9f38b1 Compare April 6, 2018 21:23

ZmnSCPxj approved these changes Apr 7, 2018

View reviewed changes

rustyrussell changed the title ~~Maintain permanent connections for active channels~~ WIP: Maintain permanent connections for active channels Apr 9, 2018

rustyrussell force-pushed the guilt/permconnect branch 4 times, most recently from d7df56d to 0d8dffd Compare April 10, 2018 07:17

rustyrussell force-pushed the guilt/permconnect branch 4 times, most recently from f4460ca to 6eccc3f Compare April 12, 2018 01:37

rustyrussell added 5 commits April 25, 2018 22:05

gossipd: exponential backoff for reconnect (5 minute ceiling).

3524a32

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: explicitly track which peers are important.

f000f52

These don't have a maximum number of reconnect attempts, and ensure that we try to reconnect when the peer dies. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

openingd: tell gossipd that the peer is important once funding tx in …

832fbf3

…place. And on channel_fail_permanent and closing (the two places we drop to chain), we tell gossipd it's no longer important. Fixes: ElementsProject#1316 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

lightningd: tell gossipd that peers we load from db are important.

e0686ba

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

ZmnSCPxj approved these changes Apr 26, 2018

View reviewed changes

rustyrussell force-pushed the guilt/permconnect branch 2 times, most recently from 76ea288 to c172384 Compare April 26, 2018 04:48

rustyrussell added 15 commits April 26, 2018 14:20

closing: don't go into temporary failure because we completed negotia…

39574c0

…tion. It only lasts until the next block, but it's weird. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

subd: record which ones connect to a peer.

174e578

This comes in useful for the next patch. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: use exponential backoff on reconnect for important peers.

60e9d66

We start at 1 second, back off to 5 minutes. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

json_listpeers: use channel connected flag for JSON.

c34ac8b

If a channel is active (ie. not onchaind) and has an owner, this should be equivalent. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

lightningd: move "tell gossipd peer is no longer important" to drop_t…

69fa1cc

…o_chain. Reported-by: @ZmnSCPxj Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

common: typo fix.

e9da2e1

Old gossip is rarely interesting. Reported-by: @ZmnSCPxj Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: don't handle multiple connect requests, combine them in ligh…

a553b44

…tningd. Christian points out that this is the pattern used elsewhere, for example. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell force-pushed the guilt/permconnect branch from c172384 to a553b44 Compare April 26, 2018 04:51

rustyrussell merged commit 83e8475 into ElementsProject:master Apr 26, 2018

This was referenced Apr 26, 2018

gossip: Use gossiped node_announcement to locate nodes. #1418

Merged

test_lightningd.py: Make test_penalty_inhtlc not fail due to reconnect. #1422

Closed

Conversation

rustyrussell commented Apr 4, 2018 • edited by ZmnSCPxj Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZmnSCPxj left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZmnSCPxj Apr 4, 2018

Choose a reason for hiding this comment

Uh oh!

cdecker commented Apr 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustyrussell commented Apr 5, 2018

Uh oh!

rustyrussell commented Apr 5, 2018

Uh oh!

ZmnSCPxj commented Apr 5, 2018

Uh oh!

ZmnSCPxj commented Apr 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustyrussell commented Apr 6, 2018

Uh oh!

ZmnSCPxj Apr 6, 2018

Choose a reason for hiding this comment

Uh oh!

rustyrussell Apr 6, 2018

Choose a reason for hiding this comment

Uh oh!

rustyrussell commented Apr 6, 2018

Uh oh!

ZmnSCPxj Apr 7, 2018

Choose a reason for hiding this comment

Uh oh!

ZmnSCPxj commented Apr 7, 2018

Uh oh!

ZmnSCPxj commented Apr 7, 2018

Uh oh!

rustyrussell commented Apr 9, 2018

Uh oh!

rustyrussell commented Apr 10, 2018

Uh oh!

ZmnSCPxj Apr 26, 2018

Choose a reason for hiding this comment

Uh oh!

rustyrussell Apr 26, 2018

Choose a reason for hiding this comment

Uh oh!

rustyrussell commented Apr 26, 2018

Uh oh!

rustyrussell commented Apr 26, 2018

Uh oh!

rustyrussell commented Apr 26, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rustyrussell commented Apr 4, 2018 •

edited by ZmnSCPxj

Loading

ZmnSCPxj left a comment •

edited

Loading

cdecker commented Apr 4, 2018 •

edited

Loading

ZmnSCPxj commented Apr 5, 2018 •

edited

Loading