KAFKA-10166: always write checkpoint before closing an (initialized) task#8926
Conversation
There was a problem hiding this comment.
This was another "sort-of bug": if we hit an exception in handleRevocation we wouldn't finish committing the active tasks, so commitNeeded could still be true. But of course, if we hit an exception earlier, we would have thrown it up to ConsumerCoordinator which would only save the first exception, so this didn't really do anything
There was a problem hiding this comment.
We can actually simplify the standby task shutdown a LOT
postCommit before closing a taskpostCommit before closing a task
guozhangwang
left a comment
There was a problem hiding this comment.
I made a pass over the code, overall it LGTM.
While working on another PR I realized that the stateMgr.flush actually does not need to be in prepareCommit, and postCommit is sufficient; and in that case we can just optionally call postCommit upon each commit. Anyways, just a quick FYI for something that's out of this scope.
There was a problem hiding this comment.
If we are not committing these tasks, should we call their postCommit?
|
@ableegoldman please lmk when you want to trigger jenkins builds on this PR. |
postCommit before closing a task|
retest this |
|
test this |
|
Test this please |
|
Ok to test |
|
Retest this please |
|
Oh, yeah... the magic touch ;) |
|
Seems like all the Topology testDriver tests failed, but I got a green build running locally. Do they not run with |
They are included in streams:test. Maybe try to rebase the branch and see if there's any missing committs? |
424f7e1 to
4125b36
Compare
|
Test this please |
1 similar comment
|
Test this please |
|
Java 8 and 14 builds passed, Java 11 build failed with...zero failures? |
|
…task (#8926) This should address at least some of the excessive TaskCorruptedExceptions we've been seeing lately. Basically, at the moment we only commit tasks if commitNeeded is true -- this seems obvious by definition. But the problem is we do some essential cleanup in postCommit that should always be done before a task is closed: * clear the PartitionGroup * write the checkpoint The second is actually fine to skip when commitNeeded = false with ALOS, as we will have already written a checkpoint during the last commit. But for EOS, we only write the checkpoint before a close -- so even if there is no new pending data since the last commit, we have to write the current offsets. If we don't, the task will be assumed dirty and we will run into our friend the TaskCorruptedException during (re)initialization. To fix this, we should just always call prepareCommit and postCommit at the TaskManager level. Within the task, it can decide whether or not to actually do something in those methods based on commitNeeded. One subtle issue is that we still need to avoid checkpointing a task that was still in CREATED, to avoid potentially overwriting an existing checkpoint with uninitialized empty offsets. Unfortunately we always suspend a task before closing and committing, so we lose the information about whether the task as in CREATED or RUNNING/RESTORING by the time we get to the checkpoint. For this we introduce a special flag to keep track of whether a suspended task should actually be checkpointed or not Reviewers: Guozhang Wang <wangguoz@gmail.com>
|
Cherry-picked to 2.6 since it is a blocker, cc @rhauch |
* 'trunk' of github.com:apache/kafka: KAFKA-10180: Fix security_config caching in system tests (apache#8917) KAFKA-10173: Fix suppress changelog binary schema compatibility (apache#8905) KAFKA-10166: always write checkpoint before closing an (initialized) task (apache#8926) MINOR: Rename SslTransportLayer.State."NOT_INITALIZED" enum value to "NOT_INITIALIZED" MINOR: Update Scala to 2.13.3 (apache#8931) KAFKA-9076: support consumer sync across clusters in MM 2.0 (apache#7577) MINOR: Remove Diamond and code code Alignment (apache#8107) KAFKA-10198: guard against recycling dirty state (apache#8924)
This should address at least some of the excessive TaskCorruptedExceptions we've been seeing lately. Basically, at the moment we only commit tasks if
commitNeededis true -- this seems obvious by definition. But the problem is we do some essential cleanup inpostCommitthat should always be done before a task is closed:2 is actually fine to skip when
commitNeeded = falsewith ALOS, as we will have already written a checkpoint during the last commit. But for EOS, we only write the checkpoint before a close -- so even if there is no new pending data since the last commit, we have to write the current offsets. If we don't, the task will be assumed dirty and we will run into our friend the TaskCorruptedException during (re)initialization.To fix this, we should just always call
prepareCommitandpostCommitat the TaskManager level. Within the task, it can decide whether or not to actually do something in those methods based oncommitNeeded.One subtle issue is that we still need to avoid checkpointing a task that was still in CREATED, to avoid potentially overwriting an existing checkpoint with uninitialized empty offsets. Unfortunately we always suspend a task before closing and committing, so we lose the information about whether the task as in CREATED or RUNNING/RESTORING by the time we get to the checkpoint. For this we introduce a special flag to keep track of whether a suspended task should actually be checkpointed or not