KAFKA-9742: Fix broken StandbyTaskEOSIntegrationTest by vvcephei · Pull Request #8330 · apache/kafka

vvcephei · 2020-03-22T23:21:49Z

Relax the requirement that tasks' reported offsetSum is less than the endOffsetSum for those
tasks. This was surfaced by a test for corrupted tasks, but it can happen with real corrupted
tasks. Rather than throw an exception on the leader, we now de-prioritize the corrupted task.
Ideally, that instance will not get assigned the task and the stateDirCleaner will make
the problem "go away". If it does get assigned the task, then it will detect the corruption and
delete the task directory before recovering the entire changelog. Thus, the estimate we provide
accurately reflects the amount of lag such a corrupted task would have to recover (the whole log).

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

vvcephei

Hey @abbccdda @ableegoldman ,

Can you take a look at this fix, which was caused by the (incredibly coincidental) simultaneous testing and merging of your PRs:
#8252
#8307

cc @mjsax

vvcephei · 2020-03-22T23:23:46Z

+        try (
+            final KafkaStreams streamInstanceOne = buildStreamWithDirtyStateDir(appId, stateDirPath + "/" + appId + "-1/", instanceLatch);
+            final KafkaStreams streamInstanceTwo = buildStreamWithDirtyStateDir(appId, stateDirPath + "/" + appId + "-2/", instanceLatch);
+        ) {


I moved this into a try-with-resources so that the instances would get closed even if the assertions fail.

vvcephei · 2020-03-22T23:25:12Z

+            streamInstanceOne.close(Duration.ZERO);
+            streamInstanceTwo.close(Duration.ZERO);


Close them both asynchronously so we can tear them down in parallel. The try-with-resources will block until they are really closed.

vvcephei · 2020-03-22T23:26:08Z

-        final String stateDirPathOne = stateDirPath + "/" + appId + "-1/";
-        final KafkaStreams streamInstanceOne =
-            buildStreamWithDirtyStateDir(appId, stateDirPathOne, instanceLatch);
+        try (


This test isn't logically changed, but I noticed a couple of problems with it that I fixed on the side.

vvcephei · 2020-03-22T23:29:25Z

+                             " smaller than offsetSum=" + offsetSum + ". This probably means the task is corrupted," +
+                             " which in turn indicates that it will need to restore from scratch, so we pin the lag" +
+                             " to the end offset of the log.");
+                taskLagTotals.put(task, endOffsetSum);


This is the fix, to return an accurate estimate instead of throwing an exception. I felt a warning was appropriate, given that this does indicate task corruption, or some other unexpected situation.

Should we report the lag as the whole log in this case? Even if the log is truncated it is not guaranteed to throw the invalid offset exception and hence task-corruption logic would not necessarily triggered.

Do we know the exact reason, why the offset sum on the client is larger than the end offset sum on the broker? Before we change this invariant we should make sure we understand why it is not satisfied. Is it our code that breaks it or some external influences that we cannot control?

@guozhangwang if the end offset is less than the checkpointed offset, how is it possible to not throw a TaskCorruptedException? I thought that was thrown after checking this exact condition?
edit: what I mean is, do we think this is a possible state? If so, we should explicitly check for it and throw TaskCorrupted if detected. (If not, it's an illegal state and thus the check here is appropriate)

Also, @vvcephei , if a single task is corrupted and we do hit the TaskCorruptedException don't we close, wipe, and recreate stores of all tasks in the thread? Seems we would need to keep track of the thread ownership, and if we detect this condition then set the lag for all tasks owned by that thread to the endOffset
edit: nevermind, I checked the TaskCorruptedExcepion PR, looks like we ended up just setting the invalid offsets to 0 instead of wiping everything. So, disregard this comment 🙂

Whoops, edited my previous comment before refreshing & seeing your response. Makes sense, although I still believe we should be checking for this and throwing TaskCorrupted if we detect this case (in the owning StreamThread, not in the leader during assignment)

I think it's could easily happen that any past, present, or future versioned member reports a task sum that's beyond the end-offset sum of what the leader thinks the task contains.

If the corruption comes from our code, we should throw. The leader fetches the end offset shortly before that code, so they are most probably up-to-date. If the corruption is caused by failure we should handle it gracefully.

Let's think about cases where this could possibly go wrong. We know it must reach or already be in standby/restoring phase. I will use logEndOffset for the actual endOffset of the partition, and endOffset for the offset a standby/restoring task is allowed to restore up to.

Newly-assigned: we read the invalid offset from the checkpoint, and determine the endOffset as either logEndOffset (normal) or min(logEndOffset, LCO) (optimized-source-changelog). In both cases, the endOffset <= logEndOffset so the invalid offset should also satisfy offset > endOffset which we can detect and throw as TaskCorrupted.
Already-assigned: in this case, we must have been the ones who wrote the invalid offset and/or have it in our checkpointableOffsets map. But we still need to fetch the endOffset to know when to stop, so the above applies here as well and we can detect the corruption.

In either case, I'd agree the leader should not throw and should just treat that task's offset sum as 0. But, we should make sure that on the other end we actually do detect this case and handle it the way the assignor expects.

@ableegoldman If the log is truncated and then immediately more records are appended to go beyond the original log-end-offset, in old versions we would not throw the exception.

After some thoughts I think it makes sense to report the lag as the whole log.

Thanks for the conversation, all. I think where it nets out for me is that we don't really expect this to happen for non-bugged code, but there are a few edge cases where it could. Thus, logging a warning actually seems reasonable.

Since the graceful handling option is still safe in any case, and since there do exist edge cases where this condition doesn't indicate a bug in Streams, we'll just keep the proposed graceful handling of reporting the lag as the whole log. Either the member in question won't get assigned the task (because it's de-prioritized wrt to other members who report valid positions), and it winds up cleaning up its state dir on its own, or it gets assigned the task, detects the corruption on its side, and recovers appropriately.

vvcephei · 2020-03-22T23:30:04Z

+    public void shouldReturnEndOffsetSumIfOffsetSumIsGreaterThanEndOffsetSum() {
        final Map<TaskId, Long> taskOffsetSums = Collections.singletonMap(taskId01, 5L);
        final Map<TaskId, Long> allTaskEndOffsetSums = Collections.singletonMap(taskId01, 1L);
        client.addPreviousTasksAndOffsetSums(taskOffsetSums);
-        assertThrows(IllegalStateException.class, () -> client.computeTaskLags(allTaskEndOffsetSums));
+        client.computeTaskLags(allTaskEndOffsetSums);
+        assertThat(client.lagFor(taskId01), equalTo(1L));


The test for the affected code of course has to change.

chia7712 · 2020-03-23T07:37:52Z

+            waitForCondition(() -> streamInstanceOne.state().equals(KafkaStreams.State.RUNNING),
+                             "Stream instance one should be up and running by now");
+            waitForCondition(() -> streamInstanceTwo.state().equals(KafkaStreams.State.RUNNING),
+                             "Stream instance one should be up and running by now");


Stream instance "one" -> "two"

cadonna · 2020-03-23T10:14:51Z

+                LOG.warn("Task " + task + " had endOffsetSum=" + endOffsetSum +
+                             " smaller than offsetSum=" + offsetSum + ". This probably means the task is corrupted," +
+                             " which in turn indicates that it will need to restore from scratch, so we pin the lag" +
+                             " to the end offset of the log.");
+                taskLagTotals.put(task, endOffsetSum);


prop: Could you explain a bit better what the warning is about? If somebody does not know the code, it is hard to understand what is going on.

abbccdda

LGTM after addressing all existing comments, thanks for the fix! @vvcephei

abbccdda · 2020-03-23T20:56:16Z

@guozhangwang @vvcephei could we merge this one?

guozhangwang · 2020-03-23T21:34:22Z

test this please

guozhangwang · 2020-03-23T21:35:46Z

LGTM. @vvcephei please feel free to merge.

vvcephei · 2020-03-23T22:14:28Z

Thanks, all. I'll make a trivial change to fix some strings in this PR and then merge.

vvcephei · 2020-03-23T22:38:08Z

Just got a successful build with clean :streams:test :streams:test-util:test locally.

KAFKA-9742: Fix broken StandbyTaskEOSIntegrationTest

269bcf6

vvcephei requested a review from mjsax March 22, 2020 23:22

vvcephei added the streams label Mar 22, 2020

vvcephei commented Mar 22, 2020

View reviewed changes

chia7712 mentioned this pull request Mar 23, 2020

KAFKA-9742: Fix StandbyTaskEOSIntegrationTest End offset #8329

Closed

3 tasks

chia7712 reviewed Mar 23, 2020

View reviewed changes

cadonna reviewed Mar 23, 2020

View reviewed changes

abbccdda reviewed Mar 23, 2020

View reviewed changes

chia7712 mentioned this pull request Mar 23, 2020

MINOR: Add method to Herder to control logging of connector configs during validation #8263

Merged

3 tasks

CR feedback

06c1735

vvcephei merged commit f538caf into apache:trunk Mar 23, 2020

vvcephei deleted the kafka-9742-fix-test branch March 23, 2020 22:58

kkonstantine mentioned this pull request Mar 24, 2020

KAFKA-9407: Return Immutable List from SchemaSourceTask #7939

Merged

3 tasks

mjsax added the tests Test fixes (including flaky tests) label Jun 12, 2020

		streamInstanceOne.close(Duration.ZERO);
		streamInstanceTwo.close(Duration.ZERO);

Conversation

vvcephei commented Mar 22, 2020

Committer Checklist (excluded from commit message)

Uh oh!

vvcephei left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna Mar 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ableegoldman Mar 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ableegoldman Mar 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abbccdda left a comment

Choose a reason for hiding this comment

Uh oh!

abbccdda commented Mar 23, 2020

Uh oh!

guozhangwang commented Mar 23, 2020

Uh oh!

guozhangwang commented Mar 23, 2020

Uh oh!

vvcephei commented Mar 23, 2020

Uh oh!

vvcephei commented Mar 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

vvcephei left a comment •

edited

Loading

cadonna Mar 23, 2020 •

edited

Loading

ableegoldman Mar 23, 2020 •

edited

Loading

ableegoldman Mar 23, 2020 •

edited

Loading