channeldb: acquire mutexes in the same order throughout#8400
Conversation
|
Important Auto Review SkippedAuto reviews are limited to the following labels: llm-review. Please add one of these labels to enable auto reviews. Please check the settings in the CodeRabbit UI or the To trigger a single review, invoke the WalkthroughThe update introduces critical changes to the Changes
Assessment against linked issues
PoemIn the graph where channels flow, Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
ziggie1984
left a comment
There was a problem hiding this comment.
Thank you for this quick fix 🙏
I wonder if these kind of conditions can be detected by the unit-race test of the golang suite ? Are we missing a test or so ?
3f94763 to
c7f1d61
Compare
|
Thanks for the idea @ziggie1984 ! Let me know what you think of the new stress test I have added |
|
oops - race detected in new test. Will fix & comment here once it is fixed |
03cf965 to
1ec1c58
Compare
|
ok - all good now @ziggie1984 |
|
@coderabbitai review |
There was a problem hiding this comment.
Review Status
Actionable comments generated: 2
Configuration used: CodeRabbit UI
Commits
Files that changed from the base of the PR and between ba4021c and 1ec1c58457a574b4409db0fc51c1eedd5bb90c8a.Files selected for processing (3)
- channeldb/graph.go (9 hunks)
- channeldb/graph_test.go (1 hunks)
- docs/release-notes/release-notes-0.18.0.md (1 hunks)
Additional comments: 8
channeldb/graph.go (8)
- 179-181: The comment on lines 179-181 explains the importance of the lock order between
cacheMuand the database lock to prevent deadlocks. This is a critical piece of information for future maintainers.- 1337-1338: The
delChannelEdgeUnsafemethod is called with the correct parameters. Ensure that all calls to this method are made with thecacheMulock already held, as per the method's contract.- 1592-1593: Similar to the previous comment, verify that the
cacheMulock is acquired before callingdelChannelEdgeUnsafe.- 1736-1737: Again, ensure that the
cacheMulock is acquired before this call todelChannelEdgeUnsafe.- 2093-2095: The
cacheMulock is correctly acquired and deferred to be unlocked. This is in line with the new locking strategy to prevent deadlocks.- 2148-2148: The
markEdgeLiveUnsafemethod is called within a transaction. Verify that this method is always called with thecacheMulock held.- 2483-2484: The comment on lines 2483-2484 clearly states the precondition for calling
delChannelEdgeUnsafe. This is good for maintainability and preventing misuse of the method.- 3635-3636: The comment on lines 3635-3636 serves as a clear warning to maintain the correct lock acquisition order when calling
markEdgeLiveUnsafe. This is crucial for avoiding deadlocks.
1ec1c58 to
ae978bc
Compare
|
Basic question: |
The caller must hold the mutex before they call them. An alternative, would be like |
Roasbeef
left a comment
There was a problem hiding this comment.
Looksl ike things are still dead-locking w/ the postgres backend:
panic: test timed out after 3h0m0s
running tests:
TestAllowMoreAttempts (2h59m50s)
TestDecidePaymentStatus (2h59m52s)
TestHTLCsExtraData (2h59m53s)
TestNeedWaitAttempts (2h59m50s)
TestPaymentSetState (2h59m50s)
TestPaymentStatusActions (2h59m52s)
TestRegistrable (2h59m49s)
TestSerializeAndDeserializeRevLog (2h59m50s)
TestStressTestChannelGraphAPI (2h59m55s)
TestStressTestChannelGraphAPI/0 (2h59m49s)
TestStressTestChannelGraphAPI/5 (2h59m49s)
TestStressTestChannelGraphAPI/8 (2h59m49s)
TestStressTestChannelGraphAPI/9 (2h59m49s)
goroutine 149545 [running]:
testing.(*M).startAlarm.func1()
/opt/hostedtoolcache/go/1.21.0/x64/src/testing/testing.go:2259 +0x3b9
Trying to repro locally now... Unlike the other backends, the postgres backend has an additional client-side mutex. Will see if I can get a goroutine dump. |
|
Relevant goroutine dump: https://paste.sh/qavACXtH#5SsXOJ_rcoU0gzqHYmJNHeik |
b8ea46a to
e265e64
Compare
|
Added a final commit to fix this last deadlock @Roasbeef With postgres, since we require an exclusive lock, we cant call |
124b332 to
3a06075
Compare
This commit adds a test that calls many of the ChannelGraph methods concurrently and in a random order. This test demonstrates that a deadlock currently exists in the ChannelGraph since the test does not complete. This is fixed in the next commit.
In this commit, we ensure that the channeldb cacheMu mutex is only ever aquired _before_ acquiring the main DB lock in the cases where the two locks need to be held simultaneously. With this commit, the deadlock demonstrated in the previous commit is now fixed.
In order to emphasise the fact that the ChannelGraph's cacheMu should be acquired before calling the `delChannelEdge` method, we add the `Unsafe` postfix and add a comment to alert readers.
In this commit, the FetchChanInfos ChannelGraph method is updated to take in an optional read transaction for the case where it is called from within another transaction.
3a06075 to
63a3882
Compare
In this commit, we ensure that the channeldb cacheMu mutex is only ever aquired before acquiring the main DB lock in the cases where the two locks need to be held simultaneously.
Fixes #8398
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Documentation