Maintain permanent connections for active channels#1322
Conversation
There was a problem hiding this comment.
Mostly OK, but this uses the same mechanism that connect command does.
Should we worry about the case:
- An "important" peer with a channel disappears forever.
- gossipd goes on a loop attempting to retry that peer forever.
- User (or automated pilot) attempts a
connectcommand.
It seems that 3 above will never return, as we never send a gossip_peer_connection_failed to trigger the connect command to return with failure.
An alternative is for connect JSON-RPC to return with a timeout separately from the timeout in gossipd.
This is worrisome as naive implementation of pilot program might just connect-fundchannel without bothering to check listpeers.
|
|
||
| static void peer_closing_complete(struct channel *channel, const u8 *msg) | ||
| { | ||
| u8 *msg; |
There was a problem hiding this comment.
msg is parameter of this function and is also a local variable.
|
Other than the Indeed |
User will get an error: Perhaps we should probably stop the multiple connect attempts now for manual connects, which would allow us to fail faster. But generally, I think trying forever is correct in this case: if we have an HTLC we'll eventually force close, and when we implement the json_close timeout logic that would do the same thing. |
917b59f to
3078658
Compare
|
Fixed shadowed variable, missing |
|
The code quoted is called by |
|
Note that kicking the timer in the case "peer is important, peer is unavailable, user issued connect command to peer" is probably a Good Thing. Consider the case where the peer becomes unavailable for a while, thus triggering the exponential backoff until the 5-minute limit. Then the user learns that the peer is now available via some out-of-band method (phone call, IRC message, smoke signals, telepathy, psychohistorical trends of society). It would be better if the user at least has the ability to trigger connection with that peer immediately, than wait for the next 5-minute cycle. So I propose that we split the reaching system into two parts:
A |
|
Good point. I agree, let's make connect a one-shot, and have a separate system for important peers. |
| /* FIXME: Exponential backoff! */ | ||
| status_trace("...will try again in %u seconds", 5); | ||
| /* If important_id freed, this will be removed too */ | ||
| new_reltimer(&reach->daemon->timers, imp, |
There was a problem hiding this comment.
Should we not keep the returned struct oneshot in the struct important_peerid? And before making a new timer, we should free any existing timers? Otherwise connect commands may cause multiple overlapping timers to exist (I think?).
/* Kick the timer if present, or start a new one if not present. */
tal_free(imp->timer);
imp->timer = new_reltimer(&reach->daemon->timers, imp, time_from_sec(5), retry_important, imp);
The retry_important function also needs to clear this timer:
imp->timer = tal_free(imp->timer);
There was a problem hiding this comment.
Good point, we will end up with one timer per connect for every important peer. Will fix, though not here.
19a1b01 to
f9f38b1
Compare
|
OK, fixed python whitespace and invalid assertion. And added a commit which prevents multiple timers running for multiple connect commands to the same (important) peer; thanks @ZmnSCPxj ! |
| /* If this is an important peer, free any outstanding timer */ | ||
| imp = important_peerid_map_get(&daemon->important_peerids, &id); | ||
| if (imp) | ||
| imp->reconnect_timer = tal_free(imp->reconnect_timer); |
There was a problem hiding this comment.
At first glance imp->reconnect_timer seems not initialized when new important_peerid is created. However important_peerid is created only one place, and, it calls retry_important immediately, which sets imp->reconnect_timer to NULL, so no issue here.
|
ACK f9f38b1 |
|
It is failing in CI, but I am uncertain why; the CI error pertains to openingd, not gossipd...? It seems the funder openingd died because the fundee openingd disconnected, while the fundee can see its own openingd disconnected from itself but did not die. Some edge case of connecting? Perhaps a reconnect try is still spuriously triggering? |
|
Fixing that turned out to be more surgery than I anticipated! Added WIP since I'm pushing an update to check CI... |
d7df56d to
0d8dffd
Compare
|
The fast reconnect seems to be triggering a heap of fun bugs. Will spend more time diagnosing tomorrow. |
f4460ca to
6eccc3f
Compare
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
These don't have a maximum number of reconnect attempts, and ensure that we try to reconnect when the peer dies. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…place. And on channel_fail_permanent and closing (the two places we drop to chain), we tell gossipd it's no longer important. Fixes: ElementsProject#1316 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rather than using a flag in reaching/peer; we make it self-contained as the next patch puts it straight into a timer callback. Also remove unused 'succeeded' field from struct peer. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| } else { | ||
| command_fail(c->cmd, "%s", err); | ||
| /* We can have multiple connect commands: complete them all */ | ||
| while ((c = find_connect(ld, &id)) != NULL) { |
There was a problem hiding this comment.
find_connect loops over the connect commands, and this loop also loops over find_connect. More efficient to just use list_safe_for_each and resolve each matching command. Acceptable for now though, we really need this merged.
There was a problem hiding this comment.
Agreed, it's N^2 but as you say, N small and we don't really care, so simplicity wins.
|
Travis seems confused, didn't start? Will rebase to fix conflict... |
76ea288 to
c172384
Compare
|
This time, fixed conflict properly... |
…d don't retry 1. Lifetime of 'struct reaching' now only while we're actively doing connect. 2. Always free after a single attempt: if it's an important peer, retry on a timer. 3. Have a single response message to master, rather than relying on peer_connected on success and other msgs on failure. 4. If we are actively connecting and we get another command for the same id, just increment the counter The result is much simpler in the master daemon, and much nicer for reconnection: if they say to connect they get an immediate response, rather than waiting for 10 retries. Even if it's an important peer, it fires off another reconnect attempt, unless it's actively connecting now. This removes exponential backoff: that's restored in next patch. It also doesn't handle multiple addresses for a single peer. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…tion. It only lasts until the next block, but it's weird. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This was sitting in my gossip-enchancement patch queue, but it simplifies this set too, so I moved it here). In 9471196 we added an explicit gossip_index so when gossipd gets peers back from other daemons, it knows what gossip it has sent (since gossipd can send gossip after the other daemon is already complete). This solution is insufficient for the more general case where gossipd wants to send other messages reliably, so replace it with the other solution: have gossipd drain the "gossip fd" which the daemon returns. This turns out to be quite simple, and is probably how I should have done it originally :( Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This means that openingd and closingd now forward our gossip. But the real reason we want to do this is that it gives an easy way for gossipd to kill any active daemon, by closing its fd: previously closingd and openingd didn't read the fd, so tended not to notice. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This comes in useful for the next patch. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Currently we intuit it from the fd being closed, but that may happen out of order with when the master thinks it's dead. So now if the gossip fd closes we just ignore it, and we'll get a notification from the master when the peer is disconnected. The notification is slightly ugly in that we have to disable it for a channel when we manually hand the channel back to gossipd. Note: as stands, this is racy with reconnects. See the next patch. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When we get a reconnection, kill the current remote peer, and wait for the master to tell us it's dead. Then we hand it the new peer. Previously, we would end up with gossipd holding multiple peers, and the logging was really hard to interpret; I'm not completely convinced that we did the right thing when one terminated, either. Note that this now means we can have peers with neither ->local nor ->remote populated, so we check that more carefully. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We start at 1 second, back off to 5 minutes. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If a channel is active (ie. not onchaind) and has an owner, this should be equivalent. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
At least say whether we failed to connect at all, or failed cryptographic handshake, or failed reading/writing init messages. The errno can be "Operation now in progress" if the other end closes the socket on us: this happens when we handshake with the wrong key and it hangs up on us. Fixing this would require work on ccan/io though. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When a connect fails, if it's an important peer, we set a timer. If we have a manual connect command, this means we do this again, leading to another timer. For a manual command, free any existing timer; the normal fail logic will start another if necessary. Reported-by: @ZmnSCPxj Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If channeld dies for some reason (eg, reconnect) and we didn't yet announce the channel, we can miss doing so. This is unusual, because if lightningd restarts it rearms the callback which gives us funding_locked, so it only happens if just channel dies before sending the announcement message. This problem applies to both temporary announcement (for gossipd) and the real one. For the temporary one, simply re-send on startup, and remote the error msg gossipd gives if it sees a second one. For the real one, we need a flag to tell us the depth is sufficient; the peer will ignore re-sends anyway. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…o_chain. Reported-by: @ZmnSCPxj Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Old gossip is rarely interesting. Reported-by: @ZmnSCPxj Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…tningd. Christian points out that this is the pattern used elsewhere, for example. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
c172384 to
a553b44
Compare
|
Holiday with kids means less time to code :) Or, put another way: parenting is my chosen career, which demanded a full week on-site. Thus my coding hobby suffered a small setback. Thanks for your patience, everyone! |
Fixes: #1281