Reapply #8644#9242
Conversation
|
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
6065746 to
27144ba
Compare
|
Waiting to push fix commit until CI completes for the reapplication. |
7a40c4a to
1e7b192
Compare
|
Looks like there are still a couple of itests failing. Will keep working on this next week. |
|
The error message |
|
This looks relevant, re some of the errors I see in the latest CI run: https://stackoverflow.com/a/42303225 |
|
Perhaps part of the issue is with the
Based on the SO link above, we might also be lacking some needed indexes. |
|
With closing the channel and a couple of other tests, I'm seeing logs similar to: when I reproduce locally, as well as in the CI logs. I'm going to pull on that thread first... On the test config side, also seeing these: I think the first issue above is with the code, the second is a config issue, and the other config issue in my comment above are the three major failures still happening. I think the |
This looks like a case where we |
Yep, looking into why that isn't caught by the panic/recover mechanism. |
|
It was actually a lack of error checking in |
1e67a84 to
899ae59
Compare
|
Looks better as far as the errors on closing channels. Will keep working tomorrow to eliminate the other errors. |
Hmm, so we don't have great visibility into how much memory these CI machines have. Perhaps we need to modify the connection settings to reduce the number of active connections, and also tune params like @djkazic has been working on a postgres+lnd tuning/perf guide, that I think we can eventually check directly into lnd. |
|
This is also very funky: lnd/kvdb/sqlbase/readwrite_bucket.go Lines 336 to 363 in e3cc4d7 We do two queries to just delete: select to see if exists, then delete. Instead of just trying to delete. Stepping back a minute: perhaps the issue is with this flawed KV abstraction we have. Perhaps we should just re-create a better hierarchical KV table from scratch. We use |
|
Here's another instance of duplicated work in lnd/kvdb/sqlbase/readwrite_bucket.go Lines 149 to 187 in e3cc4d7 We select to see if it exists, then potentially do the insert again. Instead, we can just do an |
|
I think the way the sequence is implemented may also be problematic: we have the sequence field directly in the table, which means table locks may need to be held. The sequence gets incremented a lot for stuff like payments, or invoice. We may be able to instead split that out into another table that can be updated independently of the main table: lnd/kvdb/sqlbase/readwrite_bucket.go Lines 412 to 437 in e3cc4d7 |
|
I've been able to reduce (but not fully eliminate) the I've also tried treating these errors and In addition, I've found one more place where we get the I pushed these changes above for discussion. My next step is to try to reduce the number of conflicts based on @Roasbeef's suggestions above. I'm going on vacation for the rest of the week until next Tuesday, so will keep working on this then. |
|
I think treating the OOM errors as serialization errors ended up being a mistake. Going to take that out and push when this run is done. In addition, I'm trying doubling the |
|
Thanks! I'll fix that today and rebase again. |
8e7a3b3 to
a732c1c
Compare
|
I've added that to the list of cases where we can retry. This happens when we try to commit after a transaction is already aborted due to a serialization error and we weren't able to catch it. |
a732c1c to
5b473b9
Compare
This reverts commit 67419a7.
To make this itest work reliably with multiple parallel SQL transactions, we need to count both the settle and final HTLC events. Otherwise, sometimes the final events from earlier forwards are counted before the forward events from later forwards, causing a miscount of the settle events. If we expect both the settle and final event for each forward, we don't miscount.
5b473b9 to
0c66389
Compare
|
Rebased and added logging for |
yyforyongyu
left a comment
There was a problem hiding this comment.
LGTM👍 Thanks for all the investigations and fixes!
|
|
||
| var ( | ||
| mu sync.Mutex | ||
| called int |
There was a problem hiding this comment.
nit: could use atmoic.Uint instead
There was a problem hiding this comment.
I used a mutex because the underlying TimeScheduler already provides a facility to do so, but I can pass a nil in its place and use an atomic if that's what people prefer? I don't have a strong opinion.
Roasbeef
left a comment
There was a problem hiding this comment.
LGTM 🦆
Well done re your diligence and tenacity with this PR saga @aakselrod! I think we can leave some of those other SQL statement/schema level optimizations for another time.
Only remaining question is what should we do about merge order here? We have 2 modules that were updated, and I think 3 replaces in this PR. Should we just merge as is, then make another PR to remove the replaces after a tag? Or do incremental merges and tags, with this as the final merge?
| replace github.com/gogo/protobuf => github.com/gogo/protobuf v1.3.2 | ||
|
|
||
| // Use local kvdb package until new version is tagged. | ||
| replace github.com/lightningnetwork/lnd/kvdb => ./kvdb |
There was a problem hiding this comment.
Should we merge the changes to kvdb, then tag, then merge this PR?
| replace github.com/ulikunitz/xz => github.com/ulikunitz/xz v0.5.11 | ||
|
|
||
| // Use local sqldb package until new version is tagged. | ||
| replace github.com/lightningnetwork/lnd/sqldb => ../sqldb |
There was a problem hiding this comment.
Leaving comment so we remember the replace is here.
|
@Roasbeef Thanks! I think the merge order is up to you and I'm happy to do it either way. My input is that I prefer to merge the whole PR and then tag/remove the replaces, because I don't like merging code changes that are hidden until they're activated by tag. This way, the entire code change goes in together, its total functionality is obvious, and the tagging/replace removal after the fact won't change the underlying code at all. This also helps avoid making any mistakes in refactoring. But if you prefer, I'm still happy to take out the replaces and refactor to 2-3 PRs, where the code changes for the first 1-2 are hidden until tagged. I see the advantage of not messing around with the |
0c66389 to
c9d217b
Compare
|
Made one quick change to release notes (added my name) in case you decide to merge this PR as-is. Sorry! |
|
Think we can merge this and tag, then I will clean up #9367 to remove the replaces. |
Change Description
Fix #9229 by reapplying #8644 and
batchpackagebatchrequests into their own transactions for postgres db backend to reduce serialization errorschanneldbpackagecurrent transaction is abortederrors as serialization errors in case we hit a serialization error and ignore it, and get this error in a subsequent call to postgresdb-instancepostgres flags inMakefileper @djkazic's recommendationsmaxconnectionsparameter for postgres DBs to 20 instead of 50 by defaultSteps to Test
See the failing itests prior to the fix, and the passing itests after the fix.
Pull Request Checklist
Testing
Code Style and Documentation
[skip ci]in the commit message for small changes.📝 Please see our Contribution Guidelines for further guidance.