KAFKA-12523: handle TaskCorruption/TimeoutException during handleCorruption and handleRevocation by ableegoldman · Pull Request #10407 · apache/kafka

ableegoldman · 2021-03-26T02:10:44Z

Clean up handling of TaskCorruptedException in

handleRevocation: if we try to commit and get a TaskCorrupted, we should just immediately clean up the affected tasks instead of bubbling the TaskCorruptedException up through poll and trying to deal with any corrupted tasks which have since been revoked

handleCorrupted: if we get a TaskCorrupted when trying to commit the clean tasks before closing and reviving the corrupted ones, we should just include these tasks in the subsequent closeAndRevive

Left some things as followup work to keep the changes minimal and low-risk for the 2.8 release. If it looks good I'll file tickets for any TODOs and add the ticket # in the TODO before merging

Should be cherrypicked to 2.8 @vvcephei

…eRevocation

ableegoldman · 2021-03-26T02:11:42Z

Ready for review @guozhangwang @vvcephei @mjsax @cadonna

mjsax · 2021-03-28T02:29:20Z

+        corruptedActive.setChangelogOffsets(singletonMap(t1p0, 0L));
+        taskManager.handleCorruption(singleton(taskId00));
+
+        assertThat(corruptedActive.commitPrepared, is(true));


The corrupted tasks should be revived? Those this case, should this flag be reset?

Seems like we don't reset the commitPrepared during revive, ~~good point. I guess we should reset all of those in the StateMachineTask#revive~~
edit: actually I think for commitPrepared at least we should not reset it, since we just use this to verify that we did, indeed, prepare a commit. But commitNeeded should probably be cleared in StateMachineTask#revive (and ultimately in StateMachineTask#close but I don't want to mess with this in this PR since it's used very heavily in these tests, see below)

Hm...seems like we actually may not even clear commitNeeded (or the other commit-related flags) in the actual StreamTask's close or revive. We need to be clearing those in revive or closeDirty (closeClean would have cleared during postCommit) This has probably been a long lurking bug, although a minor one

Side note: seems weird that StateMachineTask has its own commitNeeded field rather than making the one in StreamTask protected and use it for active tasks (edit: there's actually a valid-ish reason for this, the StateMachineTask just mocks the behavior of most methods and rarely calls the super's method, so even if we used the same commitNeeded flag across the field we'd still have to remember to manually set/clear it in the same way in StateMachineTask any time we do so in Abstract/StreamTask.).

Looks like we use commitNeeded in kind of a risky way in the tests, eg to indirectly indicate that it was closed clean, or infer that we successfully committed, etc Cleaner/safer to not reuse this variable to mean so many different things and just introduce a closedClean, commitSuccessful, etc wherever needed...
But I don't want to mess with it in this PR so I'll just file a ticket to clean this up later if that makes sense.

mjsax · 2021-03-28T02:31:55Z

    }

+    @Test
+    public void shouldCloseAndReviveUncorruptedTasksWhenTimeoutExceptionThrownFromCommit() {


The test says: should close and revive

How do we exactly verify this? Maybe we do, but it's not clear to me from the code. Can you elaborate?

We verify the revive by asserting that it went from RUNNING back to CREATED

mjsax · 2021-03-28T02:33:16Z

+    }
+
+    @Test
+    public void shouldCloseAndReviveUncorruptedTasksWhenTimeoutExceptionThrownFromCommitDuringRevocation() {


Same question as above. (Or is the fact that we don't crash good enough as criteria?)

ditto the above (back to CREATED is the key verification, but also it should not crash)

ableegoldman · 2021-03-28T23:12:18Z

+    /**
+     * @param consumedOffsetsAndMetadataPerTask an empty map that will be filled in with the prepared offsets
+     */
+    private int commitAndFillInConsumedOffsetsAndMetadataPerTaskMap(final Collection<Task> tasksToCommit,


Just pulled all the actual contents, excluding the TimeoutException + maybeInitTaskTimeoutOrThrow handling, so we could use it in handleCorruption without that stuff

mjsax

LGTM. Feel free to merge after Jenkins passed.

guozhangwang

Had a quick question about the closeDirtyAndRevive in catch block, otherwise lgtm. Thanks for the added test coverage!

BTW, the more I review the code now, the more I feel like removing eos-alpha and having the timeout handling simpler and assume it would always affect all tasks that the thread owns :)

guozhangwang · 2021-03-29T04:43:58Z

            final Task task = tasks.task(taskId);
            if (task.isActive()) {
-                corruptedActiveTasks.put(task, task.changelogPartitions());
+                corruptedActiveTasks.add(task);


Thanks for the cleanup! I think in the past we may only mark some subset of changelog partitions as corrupted, but later we would always just mark all of them as corrupted. Just following that thought, maybe in task.markChangelogAsCorrupted we do not need to pass in parameters either but just mark all changelog partitions as corrupted?

I didn't want to go all the way with consolidating this logic since eventually we may want to have it so that only the subset of partitions/stores which are actually corrupted will need to be wiped out. So I'd prefer to leave this as-is for now and keep the places in which we infer the changelogs from the task restricted to just the TaskManager for now

guozhangwang · 2021-03-29T04:45:53Z

+            final Collection<Task> uncorruptedTasks = new HashSet<>(tasks.activeTasks());
+            uncorruptedTasks.removeAll(corruptedActiveTasks);
+            // Those tasks which just timed out can just be closed dirty without marking changelogs as corrupted
+            closeDirtyAndRevive(uncorruptedTasks, false);


If closeDirtyAndRevive throws here, then the next closeDirtyAndRevive would not be triggered. Is that okay, or do we guarantee that closeDirtyAndRevive would not throw at all now?

It seems we guarantee that closeDirtyAndRevive does not throw -- this isn't a new assumption, since prior to this it was possible for closeDirtyAndRevive to throw for standby tasks which means we would not invoke it for active tasks. We're just doing the same thing here. (Even if we did throw I think it would be ok under both ALOS or EOS, as for EOS this would cause an unclean shutdown which would mean wiping the store anyway, and for ALOS we would just be closing dirty which again is what we were about to do anyway)

ableegoldman · 2021-03-29T20:56:36Z

Only one test failed during the last build, shouldCloseAndReviveUncorruptedTasksWhenTimeoutExceptionThrownFromCommitDuringHandleCorruptedWithEOS. This was due to a strict mock and interleaved ordering between task 0_0 and 0_1 -- after I un-strictified the mock it passed 200 iterations so I think it's safe to assume the problem has been resolved. Since this was the only change in the last commit, and this PR is currently blocking the 2.8 release, I'm going to move forward with merging this.

ableegoldman · 2021-03-29T21:07:58Z

Merged to trunk and cherrypicking to 2.8 once tests pass @vvcephei

…Corruption and handleRevocation (#10407) Need to handle TaskCorruptedException and TimeoutException that can be thrown from offset commit during handleRevocation or handleCorruption Reviewers: Matthias J. Sax <mjsax@confluent.org>, Guozhang Wang <guozhang@confluent.io>

ableegoldman · 2021-03-29T21:23:55Z

Done!

ableegoldman · 2021-03-29T22:08:58Z

@guozhangwang FYI I filed https://issues.apache.org/jira/browse/KAFKA-12574 to deprecate eos-alpha, and hopefully we can remove it soon-ish

guozhangwang · 2021-03-30T17:08:17Z

Thanks @ableegoldman !

…rrupted (#10444) Minor followup to #10407 -- we need to extract the rebalanceInProgress check down into the commitAndFillInConsumedOffsetsAndMetadataPerTaskMap method which is invoked during handleCorrupted, otherwise we may attempt to commit during a a rebalance which will fail Reviewers: Matthias J. Sax <mjsax@confluent.io>

…Corruption and handleRevocation (apache#10407) Need to handle TaskCorruptedException and TimeoutException that can be thrown from offset commit during handleRevocation or handleCorruption Reviewers: Matthias J. Sax <mjsax@confluent.org>, Guozhang Wang <guozhang@confluent.io>

…rrupted (apache#10444) Minor followup to apache#10407 -- we need to extract the rebalanceInProgress check down into the commitAndFillInConsumedOffsetsAndMetadataPerTaskMap method which is invoked during handleCorrupted, otherwise we may attempt to commit during a a rebalance which will fail Reviewers: Matthias J. Sax <mjsax@confluent.io>

remove commit from handleCorruption and handle TaskCorrupted in handl…

8dc0f50

…eRevocation

ableegoldman requested review from guozhangwang, mjsax and vvcephei March 26, 2021 02:10