BOLT#2: Add message retransmission sub-system by avemeva · Pull Request #156 · lightningnetwork/lnd

avemeva · 2017-03-02T12:07:11Z

Issue: #137

I know that we discussed that funding manager messages shouldn't be included in retransmission, but after some thinking about this I decided to include them anyway, because the specification is not about us, but more about convergency with other lightning network clients. If in specification the funding messages are included it means that other clients might be built with this logic in mind.

Instead, I believe that we should tolerantly ignore the funding messages that was already proceed, or expired.

Roasbeef · 2017-03-02T22:40:32Z

As the keys are stored on disk in big-endian order, the sorting isn't necessary here. When one performs an in order scan, the items will be retrieved in chronological order by the sequence number.

See the comment above about re-working the schema to eliminate this sorting step.

Roasbeef · 2017-03-02T22:53:09Z

Keeping with the theme of only storing generic LN data with channeldb (the graph, invoices, etc.), I think this entire file should instead be moved to reside in the root lnd directory.

Roasbeef · 2017-03-09T00:16:51Z

Hmm, what's implemented here currently isn't quire what we discussed offline. As is now, a sorting step is inserted before retrieving all the messages for a peer as the messages are stored in distinct buckets.

Instead, what I originally described was this:

A single top-level bucket that maps: index -> code || msg. Concatenating the message code to the stored data allows the db logic to properly parse the wire message without trial and error or needing an additional index.

Within this top-level bucket, another bucket would be stored which acts as an index into the top-level bucket. This bucket will be used to locate which messages can be deleted from the log in response to a retrieved ACK message.

The mapping for items in this bucket would be: messageCode -> {index_1, index_2, index_3, etc.}. So when receiving a new message, you check for the existence of the message code in this index bucket, then delete all the indexes from the top-level bucket that are returned.

Similarly, when adding a new message to the top-level bucket, another compile-time constant set of mappings needs to be consulted to determine which message ACKs the message being stored. So in addition to storing it in the top-level bucket, you'd also append to the record for the messageCode mappings.

Switching to the schema above eliminates the unnecessary sorting logic and also still retains the message order required to properly perform retransmissions.

Roasbeef · 2017-03-09T00:21:49Z

The current set of tests seems to pass relatively reliably without these added sleeps. Rather than adding additional sleeps, with the topology notification code merged in, we can add hooks into the integration testing framework to properly wait for messages to propagate before attempting to dispatch payments through newly opened channels.

Roasbeef · 2017-03-09T00:22:17Z

Naming suggestion: MessageCode.

Roasbeef · 2017-03-09T00:26:23Z

What's the purpose of this method? It doesn't look to be used anywhere with the PR currently.

I am using it during debugging and I thought if might be useful to others, but essentially this is just copying of the lnd logs in the temp directory.

Roasbeef · 2017-03-09T00:28:31Z

Naming suggestion: GetUnackedMessages .

Roasbeef · 2017-03-09T00:28:54Z

Missing spaces between the commas at the end of this sentence.

Roasbeef · 2017-03-09T00:32:38Z

With the modification to the schema I suggested, I think this method would be changed to something along the lines of an Ack method and instead take a single lnwire.MessageCode.

Roasbeef · 2017-03-09T00:35:05Z

With the comment above, this message would be simplified a good bit as it would attempt to perform a single unconditional delete from the database.

The mapping here (what get's deleted on receipt of a message) would be moved into the storage layer as it would need to be consulted each time a message is written.

Roasbeef

Nice work on the latest iteration!

This PR is getting pretty close, I'm going to move to some local testing of the functionality while the latest comments are being addressed.

Roasbeef · 2017-03-15T03:37:04Z

have been found -> has been found

Roasbeef · 2017-03-15T03:38:54Z

id is a unique slice of bytes identifying a peer. This value is typically a peer's identity public key serialized in compressed format

I re-thinked the meaning of this field a bit due to the recent changes in discovery PR. I think it would be better to keep id as not something coupled with peer at all, but mention that usually it is compressed pub key.

What changes in the discovery PR? Peers within the network are identified globally by their public keys.

In any case, the comment should be replaced with the first sentence of my suggestion:

id is a unique slice of bytes identifying a peer.

Roasbeef · 2017-03-15T03:39:33Z

Minor nit: there's an extra new line here.

Roasbeef · 2017-03-15T03:42:47Z

The test should also assert the deep equality of the message read from disk vs the original message.

Roasbeef · 2017-03-15T03:42:57Z

lenght -> length

Roasbeef · 2017-03-15T04:08:03Z

If I'm reading the current spec draft correctly, CloseRequest and FundingLocked should be omitted. They're not ACK'd by a RevokeAndAck message.

Hmm, maybe I am missing something, but from the spec:

funding_locked: acknowledged by update_ messages, commitment_signed, revoke_and_ack or shutdown messages.
shutdown: acknowledged by closing_signed or revoke_and_ack

Roasbeef · 2017-03-15T04:14:49Z

Why was the disconnect removed?

If we're unable to create the peer, then it must be removed from the connmgr's set of pending persistent connections, hence the use of Disconnect here.

Honestly, we need to revisit the current connmgr integration for coherency as it was put together rather quickly in order to get the functionality out the door.

In this case p is nil which cause a panic if error occurred on this stage. Maybe we should always return instance of peer from newPeer function? In this case we can return the previous logic.

Ahh, nice find! In the future, I'd prefer for fixes like this to be either included in the PR in a distinct commit, or entirely within it's own PR. Otherwise, it's easy to miss amidst all the other changes within the PR.

Sure, I will create an additional PR for that.

Roasbeef · 2017-03-15T04:15:27Z

Similar comment here about reverting this line diff.

oops, nice catch!

Roasbeef · 2017-03-15T04:15:30Z

Similar comment here about reverting this line diff.

Roasbeef · 2017-03-15T04:15:52Z

Nice catch! i'd missed this during my initial review.

coveralls · 2017-03-15T09:08:39Z

Coverage increased (+0.09%) to 67.801% when pulling 19239c7541bdc368fe75652bd5b75b3b6dd7199c on AndrewSamokhvalov:retransmission_subsystem into d723aad on lightningnetwork:master.

in this commit lnwire message header encode/decode tests were added, without it newcommer programmer may change the type inside message header and spend hours on debugging of integration test trying to understand why his node can't start and interact properly.

Issue: lightningnetwork#137 In this commit retranmission subsystem and boltdb mesage storage were added. Retransmission subsystem described in details in BOLT #2 (Message Retransmission) section. This subsystem keeps records of all messages that were sent to other peer and waits the ACK message to be received from other side and after that removes all acked messaged from the storage.

Issue: lightningnetwork#137 In this commit the retransmission subsystem was included in lnd, now upon peer reconnection we fetch all messages from message storage that were not acked and send them again to remote side.

avemeva · 2017-03-16T14:02:26Z

I have added the PR with addition of the stable gometalinter, because otherwise travis will fail.

coveralls · 2017-03-16T14:03:46Z

Coverage increased (+0.09%) to 67.801% when pulling f8b2624 on AndrewSamokhvalov:retransmission_subsystem into d723aad on lightningnetwork:master.

Roasbeef · 2017-03-17T03:06:22Z

I've started to test this PR locally and noticed that it currently goes about implementing the retransmission is missing a key feature.

The description in the original issue stated that the retransmission sub-system should actually sit between the server and the peer. In the design as originally described, if the peer isn't currently online (sendToPeer can't locate the target peer), then the message would be queued on disk to be retransmitted once the peer comes online again as it's "outside" the interaction of the peer itself. With this architecture, sub-systems never directly call peer.queueMsg, instead they are passed directly or indirectly the server's sendToPeer message which transparently handles committing the relevant messages to disk if the peer if offline.

Such behavior would allow sub-systems like the fundingManager to opaquely gain access to a reliable messaging stream to the remote peer regardless whether the peer is/was online or not.

As an example, let's say we're nearing the completion of a funding workflow. The ultimate block finalizing the channel arrives, the fundingManager notifies the relevant systems, then goes send the FundingLocked message to the channel counterparty so the channel itself can start to be updated. However, let's say that the channel peer went offline right before the final block was mined. As implemented in this PR, the call to sendToPeer will fail (as the peer isn't online) and the fundingLocked message will never be sent.

This PR is 80% of the way there, functionality wise. To get that last 20%, the following behavior needs to be implemented:

When the server is handling sendToPeer, if the target peer isn't online, then the server should write directly to the MessageStore of the target peer.
Any routing/discovery messages should be omitted from the behavior above.

Roasbeef · 2017-03-17T03:19:38Z

Here's an alternative to what's described above:

Rather then the fundingManager relying on the existence of a persistent messaging queue. It could instead, handle reliable completion of the funding workflow itself.
In this case, the fundingManager would gain some persistent state which records if the final step in the state machine has been completed or not.
The final step is the reliable reliably sending the FundingLocked message to complete a funding workflow.
The fundingManager maintains this state for all funding workflow which enter the final, waiting-for-channel-confirmation state.
Upon startup, for all funding workflows in this final limbo state, a channel barrier for the ChannelPoint is created.
The fundingManager either registers with the server for a notification of once the peer is online. Upon dispatch of the notification, the FundingLocked message is sent.

Roasbeef · 2017-03-17T03:21:52Z

Decided that what I've described w.r.t the fundingManager is a special case for sub-systems within the codebase atm. I'll continue testing this PR and will implement the revision of the functionality I described above myself.

Roasbeef · 2017-03-17T03:31:11Z

+			"to the peer(%v)", len(messages), p)
+
+		for _, message := range messages {
+			// Sending over sendToPeer will cause block because of


Can you insert a logging message here that just logs the MessageCode itself? Thanks!

Roasbeef · 2017-03-17T03:37:23Z

+func (rt *retransmitter) Ack(msg lnwire.Message) error {
+	switch msg.Command() {
+
+	case lnwire.CmdSingleFundingResponse:


For now, all funding message should be omitted from retransmission other than the FundingLocked message. While testing locally I just hit a bug that causes the funding manager to deadlock if lnd is restarted mid a >1 conf required channel opening.

Atm, the spec is incorrect. No funding messages until the point in which either side is committed to a funding transaction should be retransmitted at all.

Roasbeef · 2017-03-17T03:44:43Z

+		)
+	case lnwire.CmdCloseComplete:
+		return rt.remove(
+			lnwire.CmdCloseRequest,


Atm CloseComplete is never sent within the daemon. Therefore, this entry should be removed. Otherwise, on restart the node will keep sending the same CloseRequest message indefinitely upon each restart. The responding node will simply ignore the message as the the channel has already been closed.

Roasbeef · 2017-03-17T03:47:29Z

+		lnwire.CmdCloseRequest:
+		return rt.remove(
+			lnwire.CmdFundingLocked,
+			lnwire.CmdRevokeAndAck,


For now, all instances of RevokeAndAck should be omitted from retransmission. As is now, because we still use an "initial revocation window" of 1, peer restarts will cause lnd to send the initial RevokedAndAck twice with the same revocation values. This'll cause the channel to fail down the line as a state transition will re-use the same preimage rather than going to the next leaf node in the tree.

20:52:08 2017-03-16 [INF] PEER: retransmission subsystem resends 1 messages to the peer(020dbf0df13b994e562c9ac52098b86afd2b2099463370fc78124ab3c88ef87a6b@127.0.0.1:10019) 20:52:08 2017-03-16 [INF] CRTR: Synchronizing channel graph with 020dbf0df13b994e562c9ac52098b86afd2b2099463370fc78124ab3c88ef87a6b 20:52:08 2017-03-16 [TRC] PEER: writeMessage to 020dbf0df13b994e562c9ac52098b86afd2b2099463370fc78124ab3c88ef87a6b@127.0.0.1:10019: (*lnwire.RevokeAndAck)(0xc42053f2d0)({ ChannelPoint: (wire.OutPoint) 397d9ba617b1b2d81a8249e3de3749f4dd2efd792ce34b758ed97326b68bf0b9:0, Revocation: ([32]uint8) (len=32 cap=32) { 00000000 6a 00 62 86 31 55 b1 4d 8f 20 e6 53 f2 8c f7 78 |j.b.1U.M. .S...x| 00000010 1c b2 72 d3 07 86 57 2d 5d bc 55 4f b4 a4 c8 a1 |..r...W-].UO....| }, NextRevocationKey: (*btcec.PublicKey)(0xc420318160)({ Curve: (elliptic.Curve) <nil>, X: (*big.Int)(0xc420318180)(105223291483128089908537415774962877536378315872169081183677829390620736225739), Y: (*big.Int)(0xc4203181a0)(5542066621571236556856056711647061449395836182811543325992215950193357130663) }), NextRevocationHash: ([32]uint8) (len=32 cap=32) { 00000000 46 b5 6c 1c 0e 0d 50 d4 a1 3c 97 c6 8c 8e 5d 6e |F.l...P..<....]n| 00000010 15 5b 62 f1 de 12 ec af 4a 11 a2 21 b2 4e a1 89 |.[b.....J..!.N..| } }) ...... 20:52:08 2017-03-16 [TRC] PEER: writeMessage to 020dbf0df13b994e562c9ac52098b86afd2b2099463370fc78124ab3c88ef87a6b@127.0.0.1:10019: (*lnwire.RevokeAndAck)(0xc420559f80)({ ChannelPoint: (wire.OutPoint) 397d9ba617b1b2d81a8249e3de3749f4dd2efd792ce34b758ed97326b68bf0b9:0, Revocation: ([32]uint8) (len=32 cap=32) { 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| }, NextRevocationKey: (*btcec.PublicKey)(0xc42053bec0)({ Curve: (elliptic.Curve) <nil>, X: (*big.Int)(0xc42053be40)(105223291483128089908537415774962877536378315872169081183677829390620736225739), Y: (*big.Int)(0xc42053be60)(5542066621571236556856056711647061449395836182811543325992215950193357130663) }), NextRevocationHash: ([32]uint8) (len=32 cap=32) { 00000000 46 b5 6c 1c 0e 0d 50 d4 a1 3c 97 c6 8c 8e 5d 6e |F.l...P..<....]n| 00000010 15 5b 62 f1 de 12 ec af 4a 11 a2 21 b2 4e a1 89 |.[b.....J..!.N..| } })

In this state, the state machines of both channels will actually enter a negative feedback cycle, continually failing as the wrong revocation message is being sent over and over again. As a result, the channels are no longer usable after a single restart.

halseth · 2017-05-03T20:25:41Z

+// and may need to be re-established from time to time and reconnection
+// introduces doubt as to what has been received such logic is needed to be sure
+// that peers are in consistent state in terms of message communication.
+type retransmitter struct {


just noticed the filename has a typo, should probably be retransmission.go

Roasbeef · 2017-07-12T00:57:10Z

Closing this as it has been replaced by #231. We might possibly integrate some sections of this into the project at a later point though.

avemeva force-pushed the retransmission_subsystem branch 5 times, most recently from c18e4ec to 480beb6 Compare March 2, 2017 13:19

avemeva changed the title ~~Add a message retransmission sub-system BOLT#2 #137~~ BOLT#2: Add a message retransmission sub-system #137 Mar 2, 2017

avemeva changed the title ~~BOLT#2: Add a message retransmission sub-system #137~~ BOLT#2: Add message retransmission sub-system Mar 2, 2017

avemeva force-pushed the retransmission_subsystem branch from 480beb6 to bddc034 Compare March 2, 2017 13:33

Roasbeef requested changes Mar 9, 2017

View reviewed changes

avemeva force-pushed the retransmission_subsystem branch 2 times, most recently from 08df3e9 to 31cb691 Compare March 13, 2017 19:46

Roasbeef requested changes Mar 15, 2017

View reviewed changes

avemeva force-pushed the retransmission_subsystem branch 2 times, most recently from c37d1ba to 19239c7 Compare March 15, 2017 09:04

avemeva added 10 commits March 16, 2017 13:46

tests: add timeout in 'revoked uncooperative close retribution' test

fbff9d4

linter: fix new warnings

2942c9e

gotest: use stable version of metalinter

72eabf3

gotest: add additional port

d621463

fundingmanager: fix print info

6391f63

utxonursery: change print -> fatal

08f4ece

lnwire: add lnwire.MessageCode type which represent the lnwire command

69be2de

lnd: add retransmission subsystem

f8b2624

Issue: lightningnetwork#137 In this commit the retransmission subsystem was included in lnd, now upon peer reconnection we fetch all messages from message storage that were not acked and send them again to remote side.

avemeva force-pushed the retransmission_subsystem branch from 19239c7 to f8b2624 Compare March 16, 2017 13:56

Roasbeef reviewed Mar 17, 2017

View reviewed changes

Roasbeef mentioned this pull request Apr 14, 2017

Implement Fully Asynchronous Channel Opening #182

Closed

4 tasks

halseth reviewed May 3, 2017

View reviewed changes

Roasbeef closed this Jul 12, 2017

vpecinka mentioned this pull request Jul 23, 2018

Pending channel with negative blocks_til_maturity #1610

Closed

Conversation

avemeva commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avemeva Mar 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Roasbeef left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avemeva Mar 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avemeva Mar 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Mar 15, 2017

Uh oh!

avemeva commented Mar 16, 2017

Uh oh!

coveralls commented Mar 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

avemeva commented Mar 2, 2017 •

edited

Loading

avemeva Mar 13, 2017 •

edited

Loading

avemeva Mar 15, 2017 •

edited

Loading

avemeva Mar 16, 2017 •

edited

Loading

coveralls commented Mar 16, 2017 •

edited

Loading

Roasbeef commented Mar 17, 2017 •

edited

Loading

Roasbeef commented Mar 17, 2017 •

edited

Loading

Roasbeef Mar 17, 2017 •

edited

Loading

Roasbeef Mar 17, 2017 •

edited

Loading