KAFKA-9481: Graceful handling TaskMigrated and TaskCorrupted by guozhangwang · Pull Request #8058 · apache/kafka

guozhangwang · 2020-02-07T00:44:06Z

Removed task field from TaskMigrated; the only caller that encodes a task id from StreamTask actually do not throw so we only log it. To handle it on StreamThread we just always enforce rebalance (and we would call onPartitionsLost to remove all tasks as dirty).
Added TaskCorruptedException with a set of task-ids. The first scenario of this is the restoreConsumer.poll which throws InvalidOffset indicating that the logs are truncated / compacted. To handle it on StreamThread we first close the corresponding tasks as dirty (if EOS is enabled we would also wipe out the state stores), and then revive them into the CREATED state.
Also fixed a bug while investigating KAFKA-9572: when suspending / closing a restoring task we should not commit the new offsets but only updating the checkpoint file.
Re-enabled the unit test.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…hangelog-reader-limit-offset-update

guozhangwang · 2020-02-07T00:47:22Z

test this please

guozhangwang

@vvcephei @ableegoldman

guozhangwang · 2020-02-07T00:49:09Z

                }
            }
+
+            stores.clear();


We need to clear the stores map now since we may re-initialize the state stores upon reviving a task.

Do we also need to clear storeToChangelogTopic, etc?

storeToChangelogTopic and sourcePartitions are passed in at construction time and final, so we cannot clear them (since they would only be initialized once).

guozhangwang · 2020-02-07T00:49:48Z

+/**
+ * Indicates a specific task is corrupted and need to be re-initialized. It can be thrown when
+ *
+ * 1) Under EOS, if the checkpoint file does not contain offsets for corresponding store's changelogs, meaning


This case 1) would be done in another PR, I just added the java-doc here to complete the scope.

guozhangwang · 2020-02-07T00:50:24Z

-        if (!assignment.containsAll(partitions)) {
-            throw new IllegalStateException("The current assignment " + assignment + " " +
-                "does not contain some of the partitions " + partitions + " for removing.");
+        if (assignment.removeAll(partitions)) {


Here I made the remove call idempotent.

guozhangwang · 2020-02-07T00:54:07Z

+    /**
+     * Revive a closed task to a created one; should never throw an exception
+     */
+    void revive();


Okay this might be a bit controversial: I tried to make the re-initialization logic as cheap as possible since otherwise we may be kicked out of the group for not calling consumer.poll in time. After looking at the source code I found the only part is the closure of the task-level sensors which I moved out of the task.close functions. So from CLOSED -> CREATED there's nothing necessary.

guozhangwang · 2020-02-07T00:55:47Z

+            final Task task = tasks.get(taskId);
+
+            // this call is idempotent so even if the task is only CREATED we can still call it
+            changelogReader.remove(task.changelogPartitions());


We have to call changelog.remove() before task.closeDirty since now we clear the stores map in closeDirty, and after that task.changelogPartitions() would return nothing.

guozhangwang · 2020-02-07T00:56:39Z

                standbyTasksToCreate.remove(task.id());
            } else /* we previously owned this task, and we don't have it anymore, or it has changed active/standby state */ {
-                final Set<TopicPartition> inputPartitions = task.inputPartitions();
+                cleanupTask(task);


Consolidated a couple of the same pattern into this private function.

guozhangwang · 2020-02-07T00:58:00Z

-                changelogReader.remove(task.changelogPartitions());
-            }
-
-            for (final TopicPartition inputPartition : inputPartitions) {


Not sure why we do not remove the input partitions previously.. is that intentional @vvcephei ? If not I'd move the removal into the block (I already did just to clarify :P)

The intent was to remove the input partitions from the map any time we remove a task from tasks. It looks like your code maintains this (in a clearer and cleaner way).

Got it. The previous code is that we would partitionToTask.remove(inputPartition); and remove the task no matter if the task is closed or not, which is a bit weird --- for standby tasks, we do not close them, but we still remove from the iterator and remove from the materialized partitionToTask. My modification is to ONLY do this logic if we are closing the task.

As long as you agree this is correct I'm relieved.

Ah, then it was my mistake before! Good catch.

guozhangwang · 2020-02-07T00:58:18Z

    }

+    private void cleanupTask(final Task task) {
+        // 1. remove the changelog partitions from changelog reader;


The order here cannot be changed so I left this comment.

guozhangwang · 2020-02-07T01:07:09Z

The unit test suite passed locally (Java8, Scala 2.12). Currently I cannot trigger a jenkins build out of it (not knowing why).

ableegoldman

Still need to look through TaskManager, but left a few initial comments on the rest

ableegoldman · 2020-02-07T01:17:04Z

+        if (state == CLOSED) {
+            transitionTo(CREATED);
+        } else {
+            throw new IllegalStateException("Illegal state " + state() + " while committing standby task " + id);


Remove "standby" from error message (unless this only applies to standbys?)

ableegoldman · 2020-02-07T01:27:42Z

                }
            }
+
+            stores.clear();


Do we also need to clear storeToChangelogTopic, etc?

ableegoldman · 2020-02-07T01:30:57Z

                // if we cannot get the position of the consumer within timeout, just return false
                return false;
            } catch (final KafkaException e) {
+                // this also includes InvalidOffsetException, which should not happen under normal


Just wondering, why is it ok to get InvalidOffsetException during restore/poll but not here? When might this get thrown from #position? ditto for in prepareChangelogs down below

consumer.poll throwing InvalidOffsetException should be handled as TaskCorrupted; consumer.position throwing InvalidOffsetException should not happen under normal scenarios, when it happens it indicates a bug and hence we do not need to special handle it.

ableegoldman · 2020-02-07T01:43:58Z

+            // a task is only closing / closed when 1) task manager is closing, 2) a rebalance is undergoing;
+            // in either case we can just log it and move on without notifying the thread since the consumer
+            // would soon be updated to not return any records for this task anymore.
+            log.info("Stream task {} is already in {} state, skip adding records to it.", id(), state());


Hm. I think we should actually be concerned if we ever get to here -- I'm not sure the TaskMigratedException made sense either. Trying to add records to a closed task would imply that we closed the task due to shutting down or because we no longer own it, both cases which should also involves trimming its topic(s) from the consumer's assignment, but were still returned records for said topic(s).

Unless, the consumer may still return already-fetched records from partitions no longer in its assignment during poll? I thought we would trim those records out and only return from the actual assignment

Here's my rationale:

If the task is closed due to rebalance (i.e. we #handleAssignment or #handleLostAll), there might still be some buffered records from the consumer that are returning (since we update the subscription of consumer after), in this case since the subscription would be updated in the next iteration and no records would be returned, it is okay to just skip this once.

If the task is closed due to closing the thread, then there's no need to throw an exception either.

ableegoldman · 2020-02-07T01:45:30Z

+            } catch (final TaskMigratedException e) {
+                log.warn("Detected that the thread is being fenced. " +
+                    "This implies that this thread missed a rebalance and dropped out of the consumer group. " +
+                    "Will migrate out all assigned tasks and rejoin the consumer group.");


Suggested change

"Will migrate out all assigned tasks and rejoin the consumer group.");

"Will close out all assigned tasks and rejoin the consumer group.");

…nvalid-offset-changelog-reader

vvcephei

Thanks for this, @guozhangwang ! Comments below.

vvcephei · 2020-02-08T04:20:08Z

+ *
+ * 1) Under EOS, if the checkpoint file does not contain offsets for corresponding store's changelogs, meaning
+ *    previously it was not close cleanly;
+ * 2) Out-of-range exception thrown during restoration, meaning that the changelog has been modified and we re-bootstrap


Just now having this thought... Supposing this happens, is it guaranteed to apply to all the stores in the task? I.e., do we really need to re-bootstrap all the stores, or just the one(s) for which our offset is out of range?

Yes, we can re-bootstrap the state stores only -- this is what we did in the past but that was a lot messier (remember we have to use the optional in fixed-order map? :P). My thoughts are that for non-EOS, the checkpoint file would likely exist so even re-bootstrap the whole task would be okay, for EOS, it is safer to re-bootstrap the whole task.

It's certainly safer, but the performance hit seems concerning... restoration i/o is already one of the things people complain about most, and this choice could amplify it multiple times over.

Maybe we can handle it more cleanly by closing all the stores nicely, writing a checkpoint file with the out-of-range stores' checkpoints at 0, and then re-bootstrapping the task, so it only has to restore the broken stores?

That sounds a good idea. Let me try it out.

vvcephei · 2020-02-08T04:26:41Z

                throw new TaskMigratedException("Restore consumer get fenced by instance-id polling records.", e);
+            } catch (final InvalidOffsetException e) {
+                log.warn("Encountered {} fetching records from restore consumer for partitions {}, " +
+                    "marking the corresponding tasks as corrupted.", e.getClass().getName(), e.partitions());


I guess it wouldn't hurt to explain what the exception means (our position is too old and has been deleted or compacted by the broker) and what we hope to accomplish by marking the task as corrupted (to re-bootstrap the stores from the changelog and return to normal processing).

vvcephei · 2020-02-08T04:35:25Z

-                changelogReader.remove(task.changelogPartitions());
-            }
-
-            for (final TopicPartition inputPartition : inputPartitions) {


The intent was to remove the input partitions from the map any time we remove a task from tasks. It looks like your code maintains this (in a clearer and cleaner way).

…nvalid-offset-changelog-reader

guozhangwang · 2020-02-08T07:12:32Z

retest this please

…nvalid-offset-changelog-reader

guozhangwang

@vvcephei I made the change and it's a bit more complicated than I thought, the current proposal is the smallest change that I can make (it still added new APIs into Task interface). LMK WDYT.

guozhangwang · 2020-02-10T22:34:33Z

+    public void markChangelogAsCorrupted(final Set<TopicPartition> partitions) {
+        stateMgr.markChangelogAsCorrupted(partitions);
+
+        // only write a new checkpoint (excluding the corrupted partitions) if eos is disabled


We can checkpoint (excluding the corrupted partitions) for 1) standby tasks and 2) non-eos active tasks, for eos active tasks we should not write the checkpoint since for eos we HAVE TO reboot every store from scratch to maintain consistency.

…nvalid-offset-changelog-reader

vvcephei

Thanks @guozhangwang , I had a few nits, but the main concern is about the skipping-adding-records thing.

Otherwise, this change looks great to me!

vvcephei · 2020-02-18T23:05:10Z

-        if (state() == State.CLOSED || state() == State.CLOSING) {
-            log.info("Stream task {} is already closed, probably because it got unexpectedly migrated to another thread already. " +
-                         "Notifying the thread to trigger a new rebalance immediately.", id());
-            throw new TaskMigratedException(id());
-        }
-


Should we still skip adding records? It looks like that was the intent, but I think what actually would happen is that we'd still add the records, but skip processing them.

This is intentional: as in this PR #8091 where we fixed the committing offset we would look into the buffered record so that we can get the correct "next" offset to commit, if we skip adding records here when the task is closing we would return incorrect results potentially.

…nals/StreamTask.java Co-Authored-By: John Roesler <vvcephei@users.noreply.github.com>

…nals/StreamThread.java Co-Authored-By: John Roesler <vvcephei@users.noreply.github.com>

…ub.com/guozhangwang/kafka into KMinor-invalid-offset-changelog-reader

guozhangwang

Adding the fix for KAFKA-9572

guozhangwang · 2020-02-20T22:50:45Z

        if (state() == State.CREATED || state() == State.CLOSING || state() == State.SUSPENDED) {
            // do nothing
            log.trace("Skip suspending since state is {}", state());
+        } else if (state() == State.RUNNING) {


Here is the attempted fix of https://issues.apache.org/jira/browse/KAFKA-9572: if we are closing / suspending a restoring task, we should only update the checkpoint file but should NOT commit offsets, since the committed offsets indicate the "restore end" and should not be updated, cc @cadonna who filed the JIRA.

guozhangwang · 2020-02-20T22:51:30Z

                }

+                transitionTo(State.CLOSING);
+            } else if (state() == State.RESTORING) {


This is part of the fix as well: only flushing / checkpointing, but not committing.

vvcephei

Thanks @guozhangwang !

guozhangwang · 2020-02-21T00:14:03Z

The local run in JDK8 / Scala 2.12 passed. Will merge to trunk now.

@test

* apache-github/trunk: (23 commits) KAFKA-9530; Fix flaky test `testDescribeGroupWithShortInitializationTimeout` (apache#8154) HOTFIX: fix NPE in Kafka Streams IQ (apache#8158) MINOR: set scala version automatically based on gradle.properties KAFKA-9577; SaslClientAuthenticator incorrectly negotiates SASL_HANDSHAKE version (apache#8142) KAFKA-9441: Add internal TransactionManager (apache#8105) MINOR: Document endpoints for connector topic tracking (KIP-558) MINOR: Standby task commit needed when offsets updated (apache#8146) KAFKA-9206; Throw KafkaException on CORRUPT_MESSAGE error in Fetch response (apache#8111) MINOR: Remove unwanted regexReplace on tests/kafkatest/__init__.py KAFKA-9586: Fix errored json filename in ops documentation KAFKA-9575: Mention ZooKeeper 3.5.7 upgrade KAFKA-9481: Graceful handling TaskMigrated and TaskCorrupted (apache#8058) HOTFIX: don't try to remove uninitialized changelogs from assignment & don't prematurely mark task closed (apache#8140) MINOR: Fix javadoc at org.apache.kafka.clients.producer.KafkaProducer.InterceptorCallback#onCompletion (apache#7337) MINOR: Improve EOS example exception handling (apache#8052) MINOR: Fix a number of warnings in clients test (apache#8073) MINOR: Update shell scripts to support z/OS system (apache#7913) MINOR: Wording fix in Streams DSL docs (apache#5692) MINOR: Add missing @test annotation to MetadataTest#testMetadataMerge (apache#8141) KAFKA-9533: ValueTransform forwards `null` values (apache#8108) ...

@test

…etrics-common * confluent/master: (76 commits) KAFKA-9530; Fix flaky test `testDescribeGroupWithShortInitializationTimeout` (apache#8154) HOTFIX: fix NPE in Kafka Streams IQ (apache#8158) MINOR: set scala version automatically based on gradle.properties KAFKA-9577; SaslClientAuthenticator incorrectly negotiates SASL_HANDSHAKE version (apache#8142) KAFKA-9441: Add internal TransactionManager (apache#8105) MINOR: Document endpoints for connector topic tracking (KIP-558) MINOR: Standby task commit needed when offsets updated (apache#8146) Changes to migrate to Artifactory (#263) KAFKA-9206; Throw KafkaException on CORRUPT_MESSAGE error in Fetch response (apache#8111) MINOR: Remove unwanted regexReplace on tests/kafkatest/__init__.py KAFKA-9586: Fix errored json filename in ops documentation KAFKA-9575: Mention ZooKeeper 3.5.7 upgrade KAFKA-9481: Graceful handling TaskMigrated and TaskCorrupted (apache#8058) HOTFIX: don't try to remove uninitialized changelogs from assignment & don't prematurely mark task closed (apache#8140) MINOR: Fix javadoc at org.apache.kafka.clients.producer.KafkaProducer.InterceptorCallback#onCompletion (apache#7337) MINOR: Improve EOS example exception handling (apache#8052) MINOR: Fix a number of warnings in clients test (apache#8073) MINOR: Update shell scripts to support z/OS system (apache#7913) MINOR: Wording fix in Streams DSL docs (apache#5692) MINOR: Add missing @test annotation to MetadataTest#testMetadataMerge (apache#8141) ...

guozhangwang added 6 commits February 5, 2020 12:44

add timer for update limit offsets

3bfd87c

Merge branch 'trunk' of https://github.com/apache/kafka into KMinor-c…

7493782

…hangelog-reader-limit-offset-update

consumer.position invalid offset

33af581

add task corrupted logic

4419900

re-enable the unit test

23cd292

rebased

112d9c6

guozhangwang commented Feb 7, 2020

View reviewed changes

ableegoldman reviewed Feb 7, 2020

View reviewed changes

guozhangwang added 2 commits February 7, 2020 08:56

Merge branch 'trunk' of https://github.com/apache/kafka into KMinor-i…

a4af5a5

…nvalid-offset-changelog-reader

PR comments

64dad10

mjsax added the streams label Feb 7, 2020

guozhangwang added 2 commits February 7, 2020 13:29

Merge branch 'trunk' of https://github.com/apache/kafka into KMinor-i…

47f3e09

…nvalid-offset-changelog-reader

remove FixedOrderMapTest

134dda3

vvcephei reviewed Feb 8, 2020

View reviewed changes

Merge branch 'trunk' of https://github.com/apache/kafka into KMinor-i…

2954571

…nvalid-offset-changelog-reader

guozhangwang added 5 commits February 8, 2020 10:23

Merge branch 'trunk' of https://github.com/apache/kafka into KMinor-i…

e037be0

…nvalid-offset-changelog-reader

address comments

a8d81c0

minor fix

bbd19bc

Merge branch 'trunk' of https://github.com/apache/kafka into KMinor-i…

ae9457b

…nvalid-offset-changelog-reader

github comments

6b6c391

guozhangwang commented Feb 10, 2020

View reviewed changes

guozhangwang added 2 commits February 10, 2020 15:38

fix checkstyle

2fd2eae

Merge branch 'trunk' of https://github.com/apache/kafka into KMinor-i…

f5ce3db

…nvalid-offset-changelog-reader

guozhangwang mentioned this pull request Feb 12, 2020

KAFKA-9274: Gracefully handle timeout exception #8060

Merged

3 tasks

guozhangwang added 2 commits February 12, 2020 14:02

Merge branch 'trunk' of https://github.com/apache/kafka into KMinor-i…

d444d84

…nvalid-offset-changelog-reader

comments

40731e9

guozhangwang mentioned this pull request Feb 14, 2020

KAFKA-9441: Add internal StreamsProducer #8105

Merged

guozhangwang added 2 commits February 14, 2020 11:37

Merge branch 'trunk' of https://github.com/apache/kafka into KMinor-i…

8738cf5

…nvalid-offset-changelog-reader

move javadoc

9831638

vvcephei reviewed Feb 18, 2020

View reviewed changes

guozhangwang and others added 7 commits February 18, 2020 16:00

further fix

946a944

Update streams/src/main/java/org/apache/kafka/streams/processor/inter…

c7719c2

…nals/StreamTask.java Co-Authored-By: John Roesler <vvcephei@users.noreply.github.com>

Update streams/src/main/java/org/apache/kafka/streams/processor/inter…

9210ceb

…nals/StreamThread.java Co-Authored-By: John Roesler <vvcephei@users.noreply.github.com>

Update streams/src/main/java/org/apache/kafka/streams/processor/inter…

5f36288

…nals/StreamThread.java Co-Authored-By: John Roesler <vvcephei@users.noreply.github.com>

rebase from trunk

6bbc2f6

Merge branch 'KMinor-invalid-offset-changelog-reader' of https://gith…

f4049d0

…ub.com/guozhangwang/kafka into KMinor-invalid-offset-changelog-reader

add unit tests

03f4778

guozhangwang commented Feb 20, 2020

View reviewed changes

vvcephei approved these changes Feb 20, 2020

View reviewed changes

guozhangwang merged commit 3b6573c into apache:trunk Feb 21, 2020

guozhangwang deleted the KMinor-invalid-offset-changelog-reader branch April 24, 2020 23:52

ableegoldman mentioned this pull request Mar 26, 2021

KAFKA-12523: handle TaskCorruption/TimeoutException during handleCorruption and handleRevocation #10407

Merged

	"Will migrate out all assigned tasks and rejoin the consumer group.");
	"Will close out all assigned tasks and rejoin the consumer group.");

Conversation

guozhangwang commented Feb 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

guozhangwang commented Feb 7, 2020

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Feb 7, 2020

Uh oh!

ableegoldman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Feb 8, 2020

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

guozhangwang commented Feb 7, 2020 •

edited

Loading