Skip to content

Maintain permanent connections for active channels#1322

Merged
rustyrussell merged 21 commits into
ElementsProject:masterfrom
rustyrussell:guilt/permconnect
Apr 26, 2018
Merged

Maintain permanent connections for active channels#1322
rustyrussell merged 21 commits into
ElementsProject:masterfrom
rustyrussell:guilt/permconnect

Conversation

@rustyrussell
Copy link
Copy Markdown
Contributor

@rustyrussell rustyrussell commented Apr 4, 2018

Fixes: #1281

@rustyrussell rustyrussell requested review from ZmnSCPxj and cdecker April 4, 2018 03:35
Copy link
Copy Markdown
Contributor

@ZmnSCPxj ZmnSCPxj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly OK, but this uses the same mechanism that connect command does.

Should we worry about the case:

  1. An "important" peer with a channel disappears forever.
  2. gossipd goes on a loop attempting to retry that peer forever.
  3. User (or automated pilot) attempts a connect command.

It seems that 3 above will never return, as we never send a gossip_peer_connection_failed to trigger the connect command to return with failure.

An alternative is for connect JSON-RPC to return with a timeout separately from the timeout in gossipd.

This is worrisome as naive implementation of pilot program might just connect-fundchannel without bothering to check listpeers.

Comment thread lightningd/closing_control.c Outdated

static void peer_closing_complete(struct channel *channel, const u8 *msg)
{
u8 *msg;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msg is parameter of this function and is also a local variable.

@cdecker
Copy link
Copy Markdown
Member

cdecker commented Apr 4, 2018

Other than the msg param being shadowed, this is good to go :-)

Indeed clang seems to be a bit more pedantic in this case:

lightningd/closing_control.c:98:6: error: redefinition of 'msg' with a different type: 'u8 *' (aka 'unsigned char *') vs 'const u8 *'
      (aka 'const unsigned char *')
        u8 *msg;
            ^
lightningd/closing_control.c:96:70: note: previous definition is here
static void peer_closing_complete(struct channel *channel, const u8 *msg)
                                                                     ^
lightningd/closing_control.c:113:8: error: implicit declaration of function 'towire_gossipctl_peer_important' is invalid in C99
      [-Werror,-Wimplicit-function-declaration]
        msg = towire_gossipctl_peer_important(NULL, &channel->peer->id, false);
              ^
lightningd/closing_control.c:113:6: error: incompatible integer to pointer conversion assigning to 'const u8 *'
      (aka 'const unsigned char *') from 'int' [-Werror,-Wint-conversion]
        msg = towire_gossipctl_peer_important(NULL, &channel->peer->id, false);
            ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3 errors generated.

@rustyrussell
Copy link
Copy Markdown
Contributor Author

Should we worry about the case:

  1. An "important" peer with a channel disappears forever.
  2. gossipd goes on a loop attempting to retry that peer forever.
  3. User (or automated pilot) attempts a connect command.

User will get an error:

if (find_reaching(daemon, id)) {
	/* FIXME: Perhaps kick timer in this case? */
	status_trace("try_reach_peer: already trying to reach %s",
		     type_to_string(tmpctx, struct pubkey, id));
	return false;
}

Perhaps we should probably stop the multiple connect attempts now for manual connects, which would allow us to fail faster.

But generally, I think trying forever is correct in this case: if we have an HTLC we'll eventually force close, and when we implement the json_close timeout logic that would do the same thing.

@rustyrussell
Copy link
Copy Markdown
Contributor Author

Fixed shadowed variable, missing #include.

@ZmnSCPxj
Copy link
Copy Markdown
Contributor

ZmnSCPxj commented Apr 5, 2018

The code quoted is called by reach_peer, which only sends a message to masterd if the quoted code returns true. If it returns false, nothing is done and the connect command stalls.

@ZmnSCPxj
Copy link
Copy Markdown
Contributor

ZmnSCPxj commented Apr 5, 2018

Note that kicking the timer in the case "peer is important, peer is unavailable, user issued connect command to peer" is probably a Good Thing. Consider the case where the peer becomes unavailable for a while, thus triggering the exponential backoff until the 5-minute limit. Then the user learns that the peer is now available via some out-of-band method (phone call, IRC message, smoke signals, telepathy, psychohistorical trends of society). It would be better if the user at least has the ability to trigger connection with that peer immediately, than wait for the next 5-minute cycle.

So I propose that we split the reaching system into two parts:

  1. A subsystem that makes a single attempt to try, and then trigger everyone interested in connecting to that peer, informing them of success or failure. It sends gossip_peer_connected gossip_peer_already_connected and gossip_peer_connection_failed to the master.
  2. Another sub-system that handles the reconnect to "important" peers with exponential backoff limited to 5 minutes. After a timeout (struct oneshot instance) is triggered, it simply frees the timeout object and triggers the single-connect-try subsystem for the important peer if it is not connected. On any failure connection to an "important" peer it clears any timeout it has for that peer and creates a new timeout object.

A connect command would then simply trigger the single-connect-try subsystem, possibly doing its own retries independently of the auto-reconnect subsystem.

@rustyrussell
Copy link
Copy Markdown
Contributor Author

Good point. I agree, let's make connect a one-shot, and have a separate system for important peers.

Comment thread gossipd/gossip.c Outdated
/* FIXME: Exponential backoff! */
status_trace("...will try again in %u seconds", 5);
/* If important_id freed, this will be removed too */
new_reltimer(&reach->daemon->timers, imp,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not keep the returned struct oneshot in the struct important_peerid? And before making a new timer, we should free any existing timers? Otherwise connect commands may cause multiple overlapping timers to exist (I think?).

/* Kick the timer if present, or start a new one if not present. */
tal_free(imp->timer);
imp->timer = new_reltimer(&reach->daemon->timers, imp, time_from_sec(5), retry_important, imp);

The retry_important function also needs to clear this timer:

imp->timer = tal_free(imp->timer);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we will end up with one timer per connect for every important peer. Will fix, though not here.

@rustyrussell
Copy link
Copy Markdown
Contributor Author

OK, fixed python whitespace and invalid assertion.

And added a commit which prevents multiple timers running for multiple connect commands to the same (important) peer; thanks @ZmnSCPxj !

Comment thread gossipd/gossip.c
/* If this is an important peer, free any outstanding timer */
imp = important_peerid_map_get(&daemon->important_peerids, &id);
if (imp)
imp->reconnect_timer = tal_free(imp->reconnect_timer);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance imp->reconnect_timer seems not initialized when new important_peerid is created. However important_peerid is created only one place, and, it calls retry_important immediately, which sets imp->reconnect_timer to NULL, so no issue here.

@ZmnSCPxj
Copy link
Copy Markdown
Contributor

ZmnSCPxj commented Apr 7, 2018

ACK f9f38b1

@ZmnSCPxj
Copy link
Copy Markdown
Contributor

ZmnSCPxj commented Apr 7, 2018

It is failing in CI, but I am uncertain why; the CI error pertains to openingd, not gossipd...? It seems the funder openingd died because the fundee openingd disconnected, while the fundee can see its own openingd disconnected from itself but did not die. Some edge case of connecting? Perhaps a reconnect try is still spuriously triggering?

@rustyrussell rustyrussell changed the title Maintain permanent connections for active channels WIP: Maintain permanent connections for active channels Apr 9, 2018
@rustyrussell
Copy link
Copy Markdown
Contributor Author

Fixing that turned out to be more surgery than I anticipated! Added WIP since I'm pushing an update to check CI...

@rustyrussell rustyrussell force-pushed the guilt/permconnect branch 4 times, most recently from d7df56d to 0d8dffd Compare April 10, 2018 07:17
@rustyrussell
Copy link
Copy Markdown
Contributor Author

The fast reconnect seems to be triggering a heap of fun bugs. Will spend more time diagnosing tomorrow.

@rustyrussell rustyrussell force-pushed the guilt/permconnect branch 4 times, most recently from f4460ca to 6eccc3f Compare April 12, 2018 01:37
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
These don't have a maximum number of reconnect attempts, and ensure
that we try to reconnect when the peer dies.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…place.

And on channel_fail_permanent and closing (the two places we drop to
chain), we tell gossipd it's no longer important.

Fixes: ElementsProject#1316
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rather than using a flag in reaching/peer; we make it self-contained
as the next patch puts it straight into a timer callback.

Also remove unused 'succeeded' field from struct peer.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
} else {
command_fail(c->cmd, "%s", err);
/* We can have multiple connect commands: complete them all */
while ((c = find_connect(ld, &id)) != NULL) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_connect loops over the connect commands, and this loop also loops over find_connect. More efficient to just use list_safe_for_each and resolve each matching command. Acceptable for now though, we really need this merged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's N^2 but as you say, N small and we don't really care, so simplicity wins.

@rustyrussell
Copy link
Copy Markdown
Contributor Author

Travis seems confused, didn't start? Will rebase to fix conflict...

@rustyrussell rustyrussell force-pushed the guilt/permconnect branch 2 times, most recently from 76ea288 to c172384 Compare April 26, 2018 04:48
@rustyrussell
Copy link
Copy Markdown
Contributor Author

This time, fixed conflict properly...

…d don't retry

1. Lifetime of 'struct reaching' now only while we're actively doing connect.
2. Always free after a single attempt: if it's an important peer, retry
   on a timer.
3. Have a single response message to master, rather than relying on
   peer_connected on success and other msgs on failure.
4. If we are actively connecting and we get another command for the same
   id, just increment the counter

The result is much simpler in the master daemon, and much nicer for
reconnection: if they say to connect they get an immediate response,
rather than waiting for 10 retries.  Even if it's an important peer,
it fires off another reconnect attempt, unless it's actively
connecting now.

This removes exponential backoff: that's restored in next patch.  It
also doesn't handle multiple addresses for a single peer.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…tion.

It only lasts until the next block, but it's weird.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This was sitting in my gossip-enchancement patch queue, but it simplifies
this set too, so I moved it here).

In 9471196 we added an explicit gossip_index so when gossipd gets
peers back from other daemons, it knows what gossip it has sent (since
gossipd can send gossip after the other daemon is already complete).

This solution is insufficient for the more general case where gossipd
wants to send other messages reliably, so replace it with the other
solution: have gossipd drain the "gossip fd" which the daemon returns.

This turns out to be quite simple, and is probably how I should have
done it originally :(

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This means that openingd and closingd now forward our gossip.  But the real
reason we want to do this is that it gives an easy way for gossipd to kill
any active daemon, by closing its fd: previously closingd and openingd didn't
read the fd, so tended not to notice.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This comes in useful for the next patch.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Currently we intuit it from the fd being closed, but that may happen out
of order with when the master thinks it's dead.

So now if the gossip fd closes we just ignore it, and we'll get a
notification from the master when the peer is disconnected.

The notification is slightly ugly in that we have to disable it for
a channel when we manually hand the channel back to gossipd.

Note: as stands, this is racy with reconnects.  See the next patch.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When we get a reconnection, kill the current remote peer, and wait for the
master to tell us it's dead.  Then we hand it the new peer.

Previously, we would end up with gossipd holding multiple peers, and
the logging was really hard to interpret; I'm not completely convinced
that we did the right thing when one terminated, either.

Note that this now means we can have peers with neither ->local nor ->remote
populated, so we check that more carefully.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We start at 1 second, back off to 5 minutes.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If a channel is active (ie. not onchaind) and has an owner, this should
be equivalent.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
At least say whether we failed to connect at all, or failed cryptographic
handshake, or failed reading/writing init messages.

The errno can be "Operation now in progress" if the other end closes the
socket on us: this happens when we handshake with the wrong key and it
hangs up on us.  Fixing this would require work on ccan/io though.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When a connect fails, if it's an important peer, we set a timer.  If
we have a manual connect command, this means we do this again, leading
to another timer.

For a manual command, free any existing timer; the normal fail logic
will start another if necessary.

Reported-by: @ZmnSCPxj
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If channeld dies for some reason (eg, reconnect) and we didn't yet announce
the channel, we can miss doing so.  This is unusual, because if lightningd
restarts it rearms the callback which gives us funding_locked, so it only
happens if just channel dies before sending the announcement message.

This problem applies to both temporary announcement (for gossipd) and
the real one.  For the temporary one, simply re-send on startup, and
remote the error msg gossipd gives if it sees a second one.  For the
real one, we need a flag to tell us the depth is sufficient; the peer
will ignore re-sends anyway.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…o_chain.

Reported-by: @ZmnSCPxj
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Old gossip is rarely interesting.

Reported-by: @ZmnSCPxj
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…tningd.

Christian points out that this is the pattern used elsewhere, for example.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell merged commit 83e8475 into ElementsProject:master Apr 26, 2018
@rustyrussell
Copy link
Copy Markdown
Contributor Author

Holiday with kids means less time to code :)

Or, put another way: parenting is my chosen career, which demanded a full week on-site. Thus my coding hobby suffered a small setback.

Thanks for your patience, everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants