htlcswitch: reliable HTLC tracking#2265
Conversation
cfromknecht
left a comment
There was a problem hiding this comment.
@halseth did an initial pass, great commits!
There was a problem hiding this comment.
use ErrRouterShuttingDown?
There was a problem hiding this comment.
do we need to hold the lock around ClearForTakeoff? this limits it to single writer, and we lose out on all benefits of using Batch
There was a problem hiding this comment.
The reason I did this was to handle the case where ClearForTakeoff is called concurrently with a HTLC for the same payment has gets settled. Without the lock we risk getting ErrPaymentInFlight returned, but before we add the pending payment to the map the HTLC is settled.
I do see the lost benefit of batch... Time to bring out the multimutex? 😁
There was a problem hiding this comment.
Used a multimutex, was the only way to really ensure the CircuitMap and ControlTower would always stay in sync.
There was a problem hiding this comment.
what if this overwrites an existing entry? seems the prior caller would never receive anything on their channels. do we need to support multiple subscribers?
There was a problem hiding this comment.
The thought here was only to handle the restart case, and otherwise make the caller responsible for not sending the HTLC more than once. But I definitely see the advantages of just handling duplicate entries, will see if can be easily handled!
There was a problem hiding this comment.
and otherwise make the caller responsible for not sending the HTLC more than once
Isn't that hard to do if they have to call SendHTLC to see if they've already sent/re-register for ntfns?
Can you describe the scenario we're trying to solve?
There was a problem hiding this comment.
Went back to return results to all callers, so no overwriting would happen!
4b35216 to
8fa632c
Compare
There was a problem hiding this comment.
Am i understanding this right, in that we may reuse paymentID if one was found in ClearForTakeoff?
There was a problem hiding this comment.
Nvm we'd use the new one only if the payment was grounded which I think is okay
There was a problem hiding this comment.
the side-effects of this transaction are not idempotent, which means we may see invalid reads from any of these subcommands. specifically, pid and preimage should be set to zero values each time the function is invoked
There was a problem hiding this comment.
Shouldn't matter, since pid is only used in case of ErrPaymentInFlight || nil and preimage only in case of ErrAlreadyPaid. But probably doesn't hurt to do what you suggest.
| return ErrAlreadyPaid | ||
|
|
||
| default: | ||
| takeoffErr = ErrUnknownPaymentStatus |
There was a problem hiding this comment.
This commit should be reverted, and leave the batch error handling as is. If the function returns an error, then each will be run in isolation (again). This is the intent behind capturing the errors, but allowing the transaction to return nil
https://github.com/etcd-io/bbolt/blob/master/db.go#L787
There was a problem hiding this comment.
Ah, I see TIL! Will revert and maybe add a comment explaining this 👍
There was a problem hiding this comment.
I was kinda puzzled by why it was done this way tbh, probably should have asked before changing it :p
|
Discussed a bit offline with @cfromknecht, have some ideas on how to simplify this quite a bit, and will push an update in a few days. Stay tuned. |
31127d1 to
4aa084d
Compare
If we already know the preimage, do we really want to pay to it? If we allow that, we risk any intermediate node also knowing it, being able to claim the payment before it reaches the receiver. This PR originally allowed this, by keeping a map from |
This is also true, but the new scenario joost is referring to is that we send the payment, then get used as an intermediary, restart, then report the payment succeeding even though we haven't heard back on the exact status of the payment.
Would be nice to find a way to avoid the redundant storage, while also retaining the strictness of the original proposal. |
| @@ -1,4 +1,4 @@ | |||
| package htlcswitch | |||
| package routing | |||
There was a problem hiding this comment.
IMO this code should really stay in the htlcswitch as it's properties are very intertwined with the operation of the circuit map. It is a helper object of the switch, which should be used by w/e packages need to interact with SendHTLC
There was a problem hiding this comment.
Even if it's not used by the switch directly, it is exposed so that users of the htlcswitch package properly adhere to the semantics required by SendHTLC.
|
It seems that what we want is the ability to distinguish between a previous attempt that completed and that we have the preimage in our cache for some other reason. This could be solved as in the original proposal with a persistent map We could reuse the To send a payment, we would get a new unique Sending the corresponding HTLC to the switch, it would commit the circuit, using the unique pid as key. Here we could add a check to the new " When the result comes back, we would tear down the circuit and store the preimage (in case of success) in the same db transaction. The outgoing payment would therefore be considered settled, since the preimage now is found in the bucket. In case the payment fails we could consider deleting the This approach is essentially what I did in an earlier branch (master...halseth:reliable-payments-reused-buckets), except for storing the preimages in their own bucket. That approach was awkward since the switch needed to be aware of the whole lmk what you think. tl;dr: We use the |
9f4c767 to
38a8610
Compare
There was a problem hiding this comment.
I think the commits lookup preimage in cache, ignore ErrDuplicateAdd and use multimutex should be squashed in a single commit.
The first two introduce functionality that doesn't really work reliable until the last is in place too. Checking out the intermediate commits doesn't give useful behaviour.
They are all small too and (at least to me) it would make review easier. To see a set of changes that together does something meaningful.
There was a problem hiding this comment.
Consider moving the control tower up to the router level already in this pr. To finalize the required switch changes for reliable payments.
There was a problem hiding this comment.
Done. Note that there still might be some switch changes left (most notably making SendHTLC async), but I figured this PR was big enough as is now.
There was a problem hiding this comment.
Can the multi mutex be avoided by looking up the preimage after commiting the circuit? If it is present then, the circuit could be teared down again. Or left in half open state, where it will be cleaned up later.
There was a problem hiding this comment.
Very good idea, don!
|
Pasting my own notes used during review here for reference: |
This commit moves the responsibility of generating a unique payment ID from the switch to the router. This will make it easier for the router to keep track of which HTLCs were successfully forwarded onto the network, as it can replay the same HTLCs as long as the paymentIDs are kept. The router is expected to maintain a map from paymentID->HTLC, such that they can be replayed on restart. This also lets the router check the status of a sent payment after a restart, simply by resending it.
This commit introduces a new method waitForPaymentResult, which handles the result of a payment sent on the network. This method encompasses a part of the logic that previously was located in handleLocalDispatch, such as error parsing. While doing this, we make the handleLocalDisaptch method write the result of the payment to the pending payment map, even if no pending payment is lingering, to handle the case where we after a restart would retrieve a payment result, before the pending payment was added to the map.
This commit extracts the logic from the forward-method into SendHTLC itself. This is done such that we later can inspect whether we already have a preimage before routing the packet onto the network.
As a step towards idempotent sends, we ignore any duplicate add message
when forwarding the HTLC. After a restart the caller can resend the
HTLC, and be sure that
1) if the payment is already in flight, we will go straight into
waiting for a result comes back.
2) if this is the first time this HTLC is sent it will be routed
onto the network as before.
NOTE: The ControlTower currently doesn't allow us to re-send the HTLC,
but this will later be allowed.
Give the switch access to the preimage cache, which we'll use to ensure we don't resend HTLCs that have already been settled.
As a step towards reliable tracking of preimages, we lookup the payment hash in the PreimageCache before attempting to route the HTLC onto the network. This lays the foundation for re-sends of the same HTLC, as the caller doesn't have to know about the result of the previous send. If the previous send is still in flight, we will get a DuplicateAdd error and go straight into waiting for the result. If the payment is not in flight and the preimage is not known the HTLC will be routed onto the network. If the payment did already succeed, the preimage will be in the cache, and we'll return early, avoiding routing the payment again. This is important to not forward a payment that already succeeded, since it might lead to loss of funds. NOTE: This assumes that the preimage is ALWAYS added to the PreimageCache before the circuit is torn down. NOTE: The ControlTower currently doesn't allow reapeated sends to the same hash, but this will later be allowed.
Since the only thing needed by the control tower to determine whether an HTLC is in flight is the payment hash, we take that instead of the whole HTLC.
This commit moves the check for in-flight payments to a given payment hash from the Switch to the Router. This is done as a preparation to let the router replay payments in flight after a restart, but currently the behavior is unchanged (since we don't persist active payments across restarts yet).
TestSwitchSendPaymentKnownPreimage checks that the switch immediately returns when we attempt to send a payment where the preimage is already found in the preimage cache. TestSwtichSendDuplicatePayment checks that if a payment is resent after a restart, it will still progress as if it was the first time it was sent. TestSwitchSendDuplicateSettledPayment makes sure the switch handle the case where it attempts to resend an HTLC after a restart, but concurrently a settle for this HTLC comes back. TestSwichSendDuplicateFailedPayment ensures that if we resend a payment that failed, the failure will be pending in case of a re-send.
…UniquePaymentID Documents the now new behaviour and expectations and SendHTLC and adds a test that exercises this behaviour. TestSwitchSendHTLCUniquePaymentID tests that the switch happily forwards the same HTLC as long as the paymentID used is unique.
38a8610 to
aa846b2
Compare
|
After discussion with @joostjager we came to a design that should be both simpler and easier to review. This is done by splitting up the |
|
|
||
| // Set the result packet and signal that to the callers it is ready. | ||
| payment.result = pkt | ||
| close(payment.ready) |
There was a problem hiding this comment.
Should this happen before teardown? To prevent this:
Teardown circuit
CommitCircuit (succeeds)
get or create pending payment in memory (create)
get or create pending payment in memory + signal result
async handoff to link if commit succeeded (pay twice)
wait for pending payment result
6ffe4eb to
e07dc19
Compare
We wait for the circuit to be torn down before deleting the payment from the pending payment map. We do this to avoid a concurrent re-send of this pid being dropped after deleting the payment, causing it to never receive the result.
|
Blocked by #2501 |
|
Chatted with Johan offline, and I think we went down this rabbit hole due to one of my earlier comments that was a bit vague. We discussed a simpler version that hopefully should be easier to implement and also reason about during review as it involves much less shuffling around of the existing control flow and doesn't introduce any new dependancies. |
|
Closing in favor of #2762. |
This PR is a primary step towards fixing #2183
We handle a few problems regarding atomicity of payments within the switch. Since the writing of a payment hash to the
ControlTowerand switchCircuitMapis not atomic, we risked these getting out of sync, and potentially leaving preimages in limbo. This PR attempts to fix a few of these issues by giving thehtlcswitchthe responsibility for storing the preimage a payment succeeds, and immediately returning them if we notice a replay of a payment. We do this by using theOutgoingPaymentbucket within the DB also for in-fligh payments (not only for completed payments as before).We ensure all outgoing payments are given a unique ID, since this is what is used to determine if a payment is already committed to the switch circuit.
To ensure the payment DB and
CircuitMapstay in sync, we acquire a mutex tied to the particularpaymentIDbefore using them.This is also a step towards idempotent payments, since we'll now return a
pendingPaymentas normal even if the HTLC is already sent.TODO