-
Notifications
You must be signed in to change notification settings - Fork 15.1k
KAFKA-17142: Fix deadlock caused by LogManagerTest#testLogRecoveryMetrics #16614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach introduces a new correctness issue. With this change, it's possible for older epoch entries to overwrite the newer epoch entries in the leader epoch file. Consider the following sequence: we take a snapshot of the epoch entries here; a new epoch entry is added and is flushed to disk; the scheduler then writes the snapshot to disk. This can lead to the case where the leader epoch file doesn't contain all entries up to the recovery point.
Since the issue is only in the test, I am wondering if we could fix the test directly. For example, perhaps we could introduce a NoOpScheduler and use it in the test, since the test doesn't depend on the leader epoch entries to be actually flushed to disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is another good approach.
Sorry to cause possible correctness issue. @FrankYang0529 and I had discussed the approach offline when I noticed that deadlock, and I suggest to change the production code directly. It seems to me this PR does NOT change the execution order, because the "writeToFileForTruncation" does not hold the single lock to complete the "snapshot" and "flush".
Hence, the issue you mentioned can happen even though we revert this PR. for example:
writeToFileForTruncation(run by scheduler) take a snapshot of the epoch entries in phase 1 (see comment in above code)writeToFileForTruncation(run by scheduler) then writes the snapshot to disk in phase 2 (see comment in above code)In summary: there are two follow-up:
testLogRecoveryMetricsbyNoOpSchedulerwriteToFileForTruncationback except for "snapshot". for example:@junrao WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion makes sense to me.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chia7712 Hi, thank you for pointing out the potential race issue exists even on current code.
The follow-up looks good to me.
For follow-up 2 which moves checkpoint-flush to inside the lock, one concern is potential request-handler/replica-fetcher thread blocking due to the fsync latency. (i.e. threads call truncateFromStart/EndAsyncFlush will be blocked meanwhile)
However it might not be the critical performance issue because:
Let me consider if some optimization is possible for this as an another follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chia7712 : Yes, you are right that the overwriting issue was already introduced in #15993. Moving the flush call inside the read lock fixes this issue, but it defeats the original performance optimization in #14242. @ocadaruma : What's your opinion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi all, thanks for raising the correctness issue. IMO, we can fix data correctness first, and then improve performance if it doesn't break data correctness.
I will rewrite
testLogRecoveryMetricswithNoOpSchedulerfirst and see whether need to improveLeaderEpochFileCacheperformance with its own scheduler. Thank you.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ocadaruma : Thanks for the explanation. Yes, I agree that the async flush still gives us some perf benefits. As for the fix, the two followups suggested by @chia7712 sound reasonable to me. They probably should be done in the same PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that it needs more discussion for the changes of production code. For example:
The two follow-ups are orthogonal now, and hence I prefer to fix them separately to avoid unnecessary block.
BTW, please feel free to leave more comments on the https://issues.apache.org/jira/browse/KAFKA-17167 for the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I thought the simple fix you suggested is to do the following. This will bring back the deadlock issue in the test, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it does. However, my point was - if it needs more discussion for @ocadaruma comment: "Yeah, could be an issue in some cases (e.g. deleteRecords is called frequently, and/or kafka-schedulers are busy) though.", we can improve the test before adding
writeToFileForTruncationback to production.At any rate, it seems we all agree to have the simple fix for now, and so I merge KAFKA-17166 and KAFKA-17167