Skip to content

KAFKA-9727: cleanup the state store for standby task dirty close and check null for changelogs#8307

Merged
guozhangwang merged 5 commits intoapache:trunkfrom
abbccdda:KAFKA-9727
Mar 20, 2020
Merged

KAFKA-9727: cleanup the state store for standby task dirty close and check null for changelogs#8307
guozhangwang merged 5 commits intoapache:trunkfrom
abbccdda:KAFKA-9727

Conversation

@abbccdda
Copy link
Copy Markdown

@abbccdda abbccdda commented Mar 17, 2020

This PR fixes three things:

  1. the state should be closed when standby task is restoring as well
  2. the EOS standby task should also wipe out state under dirty close
  3. the changelog reader should check for null as well

The sequence to reproduce the system test failure:

  1. Stream job close uncleanly, leaving active task 0_0 no committed offset
  2. The task 0_0 switch from active to standby task, which never logs anything in checkpoint under EOS
  3. Task 0_0 gets illegal state for not finding checkpoints, throwing task corrupted exception
  4. Exception were caught and the task was closed, however the state store was already registered, and not released.
  5. Next iteration we shall hit lock not available as it never gets released.
  6. We shall also hit a NPE in the changelog removal as well since it gets removed in the first time handling corruption of the standby task.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix 3

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, this is indeed possible.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standby task should never be in RESTORING since we always transit from CREATED -> RUNNING -> RESTORING in one call. Did you observe this was not the case from failed system tests? Even in unclean close case you described I did not see why it could be possible..

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix 2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what the below TODO (191) was added for, Thanks :) Please feel free to remove that TODO marker then.

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I agree that 2)/3) are bugs, not sure about 1) -- we can discuss more about this.

Could you add some tests as well?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standby task should never be in RESTORING since we always transit from CREATED -> RUNNING -> RESTORING in one call. Did you observe this was not the case from failed system tests? Even in unclean close case you described I did not see why it could be possible..

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what the below TODO (191) was added for, Thanks :) Please feel free to remove that TODO marker then.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, this is indeed possible.

@guozhangwang
Copy link
Copy Markdown
Contributor

test this please

@abbccdda
Copy link
Copy Markdown
Author

Discussed offline with @guozhangwang , the fix 1 was not correct and the true issue was due to the state transition. We call registerStateStores before transiting from CREATED to RUNNING, and if we throw corrupted exceptions there, the task shall not go over the closeStateManager call during close() in the current trunk logic. The proper fix is to trigger closeStateManager for CREATED state as well.

@guozhangwang
Copy link
Copy Markdown
Contributor

test this please

1 similar comment
@vvcephei
Copy link
Copy Markdown
Contributor

test this please

public void closeClean(final Map<TopicPartition, Long> checkpoint) {
Objects.requireNonNull(checkpoint);
close(true, checkpoint);
close(true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR: we can clean up the task-manager code to not pass in the checkpoint at all.

this.time = time;
this.recordCollector = recordCollector;
eosDisabled = !StreamsConfig.EXACTLY_ONCE.equals(config.getString(StreamsConfig.PROCESSING_GUARANTEE_CONFIG));
eosEnabled = StreamsConfig.EXACTLY_ONCE.equals(config.getString(StreamsConfig.PROCESSING_GUARANTEE_CONFIG));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part will have some conflicts with @mjsax 's PR, just a note.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, one of us probably needs to rebase

waitForCondition(() -> streamInstanceTwo.state().equals(KafkaStreams.State.RUNNING),
"Stream instance one should be up and running by now");

streamInstanceOne.close(Duration.ofSeconds(30));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own education: before the fix, this integration test will fail when instance-2 is started?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually either instance-1 or instance-2 would fail, depending on which box gets standby assignment. There would be a IllegalState + NPE exception sequence happening.

@guozhangwang guozhangwang merged commit c249ea8 into apache:trunk Mar 20, 2020
@abbccdda abbccdda deleted the KAFKA-9727 branch March 20, 2020 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants