MINOR: Ensure LocalLog.flush() is immune to recoveryPoint change by different thread#11814
Merged
junrao merged 1 commit intoapache:trunkfrom Mar 1, 2022
Conversation
Contributor
Author
|
cc @junrao @lbradstreet for review |
junrao
approved these changes
Feb 28, 2022
Contributor
junrao
left a comment
There was a problem hiding this comment.
@kowshik : Thanks for the PR. LGTM. Are the test failures related?
This issue could occur even with flushing on rolled segments. In general, we could have a pool of background threads for flushing rolled log segments. It's possible for the same partition's log to be rolled quickly and flushed by different threads in parallel.
Contributor
Author
|
@junrao Thanks for the review! I checked the test failures, and they look unrelated to this PR. I agree, your suggestion is a good way to simplify the code and it will be a lot more maintainable too. I have opened KAFKA-13701 to track the improvement. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue:
Imagine a scenario where two threads T1 and T2 are inside
UnifiedLog.flush()concurrently:KafkaSchedulerthread T1 -> The periodic work callsLogManager.flushDirtyLogs()which in turn callsUnifiedLog.flush(). For example, this can happen due tolog.flush.scheduler.interval.mshere.KafkaSchedulerthread T2 -> AUnifiedLog.flush()call is triggered asynchronously during segment roll here.Supposing if thread T1 advances the recovery point beyond the flush offset of thread T2, then this could trip the check within
LogSegments.values()here for thread T2, when it is called fromLocalLog.flush()here. The exception causes theKafkaSchedulerthread to die, which is not desirable.Fix:
We fix this by ensuring that
LocalLog.flush()is immune to the case where the recoveryPoint advances beyond the flush offset.Tests:
I was able to test this manually by introducing barriers in the code to help simulate the race condition. As such, this is a hard case to write an automated unit test for, so I haven't added a new test case in this PR. So I'm mostly just relying on code review and also ensure there are no regressions in existing tests.