KAFKA-9727: cleanup the state store for standby task dirty close and check null for changelogs by abbccdda · Pull Request #8307 · apache/kafka

abbccdda · 2020-03-17T04:20:08Z

This PR fixes three things:

the state should be closed when standby task is restoring as well
the EOS standby task should also wipe out state under dirty close
the changelog reader should check for null as well

The sequence to reproduce the system test failure:

Stream job close uncleanly, leaving active task 0_0 no committed offset
The task 0_0 switch from active to standby task, which never logs anything in checkpoint under EOS
Task 0_0 gets illegal state for not finding checkpoints, throwing task corrupted exception
Exception were caught and the task was closed, however the state store was already registered, and not released.
Next iteration we shall hit lock not available as it never gets released.
We shall also hit a NPE in the changelog removal as well since it gets removed in the first time handling corruption of the standby task.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

abbccdda · 2020-03-17T04:37:12Z

Nice catch, this is indeed possible.

abbccdda · 2020-03-17T04:37:19Z

Standby task should never be in RESTORING since we always transit from CREATED -> RUNNING -> RESTORING in one call. Did you observe this was not the case from failed system tests? Even in unclean close case you described I did not see why it could be possible..

abbccdda · 2020-03-17T04:37:25Z

I think this is what the below TODO (191) was added for, Thanks :) Please feel free to remove that TODO marker then.

guozhangwang

LGTM! I agree that 2)/3) are bugs, not sure about 1) -- we can discuss more about this.

Could you add some tests as well?

guozhangwang · 2020-03-17T21:27:32Z

Standby task should never be in RESTORING since we always transit from CREATED -> RUNNING -> RESTORING in one call. Did you observe this was not the case from failed system tests? Even in unclean close case you described I did not see why it could be possible..

guozhangwang · 2020-03-17T21:28:06Z

I think this is what the below TODO (191) was added for, Thanks :) Please feel free to remove that TODO marker then.

guozhangwang · 2020-03-17T21:34:12Z

Nice catch, this is indeed possible.

guozhangwang · 2020-03-17T21:35:39Z

test this please

abbccdda · 2020-03-18T01:01:53Z

Discussed offline with @guozhangwang , the fix 1 was not correct and the true issue was due to the state transition. We call registerStateStores before transiting from CREATED to RUNNING, and if we throw corrupted exceptions there, the task shall not go over the closeStateManager call during close() in the current trunk logic. The proper fix is to trigger closeStateManager for CREATED state as well.

guozhangwang · 2020-03-20T18:44:02Z

test this please

vvcephei · 2020-03-20T18:49:06Z

test this please

guozhangwang · 2020-03-20T18:56:23Z

    public void closeClean(final Map<TopicPartition, Long> checkpoint) {
        Objects.requireNonNull(checkpoint);
-        close(true, checkpoint);
+        close(true);


Not for this PR: we can clean up the task-manager code to not pass in the checkpoint at all.

guozhangwang · 2020-03-20T18:57:54Z

        this.time = time;
        this.recordCollector = recordCollector;
-        eosDisabled = !StreamsConfig.EXACTLY_ONCE.equals(config.getString(StreamsConfig.PROCESSING_GUARANTEE_CONFIG));
+        eosEnabled = StreamsConfig.EXACTLY_ONCE.equals(config.getString(StreamsConfig.PROCESSING_GUARANTEE_CONFIG));


This part will have some conflicts with @mjsax 's PR, just a note.

Yea, one of us probably needs to rebase

guozhangwang · 2020-03-20T19:02:00Z

+        waitForCondition(() -> streamInstanceTwo.state().equals(KafkaStreams.State.RUNNING),
+            "Stream instance one should be up and running by now");
+
+        streamInstanceOne.close(Duration.ofSeconds(30));


For my own education: before the fix, this integration test will fail when instance-2 is started?

Yes, actually either instance-1 or instance-2 would fail, depending on which box gets standby assignment. There would be a IllegalState + NPE exception sequence happening.

abbccdda force-pushed the KAFKA-9727 branch from e337492 to 7b8a7ca Compare March 17, 2020 04:33

abbccdda commented Mar 17, 2020

View reviewed changes

guozhangwang reviewed Mar 17, 2020

View reviewed changes

abbccdda mentioned this pull request Mar 19, 2020

KAFKA-9441: Unify committing within TaskManager #8218

Merged

abbccdda force-pushed the KAFKA-9727 branch from 7b8a7ca to 5362b74 Compare March 19, 2020 22:44

add fixes

e06dfb1

abbccdda force-pushed the KAFKA-9727 branch from 5362b74 to e06dfb1 Compare March 19, 2020 22:44

standby task fix

65bba12

abbccdda force-pushed the KAFKA-9727 branch from 0ee6c98 to 2652b0f Compare March 20, 2020 01:38

unclean close test

e4b254c

abbccdda force-pushed the KAFKA-9727 branch from 2652b0f to e4b254c Compare March 20, 2020 01:55

integration test?

3245b2d

abbccdda force-pushed the KAFKA-9727 branch from 12acf46 to 000bf9e Compare March 20, 2020 18:22

add standby tasks integration

3f4cfd7

abbccdda force-pushed the KAFKA-9727 branch from 000bf9e to 3f4cfd7 Compare March 20, 2020 18:30

guozhangwang reviewed Mar 20, 2020

View reviewed changes

guozhangwang merged commit c249ea8 into apache:trunk Mar 20, 2020

abbccdda deleted the KAFKA-9727 branch March 20, 2020 22:43

Conversation

abbccdda commented Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Mar 17, 2020

Uh oh!

abbccdda commented Mar 18, 2020

Uh oh!

guozhangwang commented Mar 20, 2020

Uh oh!

vvcephei commented Mar 20, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abbccdda commented Mar 17, 2020 •

edited

Loading