Skip to content

KAFKA-7672: Restoring tasks need to be closed upon task suspension#6113

Merged
guozhangwang merged 7 commits intoapache:trunkfrom
guozhangwang:K7672-task-resumption
Feb 22, 2019
Merged

KAFKA-7672: Restoring tasks need to be closed upon task suspension#6113
guozhangwang merged 7 commits intoapache:trunkfrom
guozhangwang:K7672-task-resumption

Conversation

@guozhangwang
Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang commented Jan 10, 2019

  1. In activeTasks.suspend, we should also close all restoring tasks as well. Closing restoring tasks would not require task.close as in closeNonRunningTasks , since the topology is not initialized yet, instead only state stores are initialized. So we only need to call task.closeStateManager.

  2. Also add @linyli001 's fix.

  3. Unit tests updated accordingly.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@guozhangwang guozhangwang changed the title [WIP] Restoring tasks need to be closed KAFKA-7672: Restoring tasks need to be closed Jan 10, 2019
@guozhangwang guozhangwang changed the title KAFKA-7672: Restoring tasks need to be closed KAFKA-7672: Restoring tasks need to be closed upon task suspension Jan 10, 2019
needsRestoring.clear();
endOffsets.clear();
needsInitializing.clear();
completedRestorers.clear();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is @linyli001 's fix.

Copy link
Copy Markdown

@suiyuan2009 suiyuan2009 Feb 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some hack code to listeners and transformer init func before next version release..

@guozhangwang
Copy link
Copy Markdown
Contributor Author

return restoringByPartition.get(partition);
}

@Override
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following three functions do not have logical changes, just re-grouping all overridden functions on top of AssignedTasks here.

throw new IllegalStateException(logPrefix + "consumer has not been initialized while adding stream tasks. This should not happen.");
}

changelogReader.reset();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not reset changelogReader since we still need its restored offset when closing the restoring tasks.

@guozhangwang guozhangwang requested a review from mjsax January 11, 2019 00:02
@mjsax mjsax added the streams label Jan 14, 2019
// close all restoring tasks as well and then reset changelog reader;
// for those restoring and still assigned tasks, they will be re-created
// in addStreamTasks.
firstException.compareAndSet(null, active.closeAllRestoringTasks());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we close those tasks instead of just suspending them? Could we close them only if they are not re-assigned?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The life time of a task is:

created -> [initializeStateStores] -> restoring (writes to the initialized state stores) -> [initializeTopology] -> running -> [closeTopology] -> suspended -> [closeStateManager] -> dead

I.e. the restoring tasks do not have topology initialized at all, whereas suspend call is just trying to closeTopology.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but why can't we keep restoring task open than, and hope they get reassigned so we can continue restoring them?

Copy link
Copy Markdown
Contributor Author

@guozhangwang guozhangwang Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can, the issue is that today we clear all the store-restorers hence there's an issue.

We can, of course do some optimizations like do not close restoring tasks, and also do not clear their corresponding restorers as well, but this is out of the scope of this PR and I want to address it separately.

cc @vvcephei

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guozhangwang Can you create a Jira to track this cleanup?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guozhangwang
Copy link
Copy Markdown
Contributor Author

guozhangwang commented Jan 18, 2019

retest this please

final Iterator<StreamTask> restoringTaskIterator = restoring.values().iterator();
while (restoringTaskIterator.hasNext()) {
final StreamTask task = restoringTaskIterator.next();
log.debug("Closing restoring and not re-assigned task {}", task.id());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my denseness... Why are these "not re-assigned"? They're part of a data structure called "assigned tasks", which seems to imply that they are assigned.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terminology storm coming :)

The assigned tasks contains the following non-overlapping sets (see my other comment as well for details): created, restoring, running, suspended. Normally created should be empty since once a task is created it should move on transit to either restoring or running immediately. This function is called below for suspendTasksAndState. In which case we should:

  1. created and suspended: no action, since it should be empty.
  2. running: transit to suspended.
  3. restoring: close immediately.

The comment itself is indeed misleading, I will change to Closing restoring task

Copy link
Copy Markdown
Contributor

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @guozhangwang

I did ask one question about a comment, and I share @mjsax 's question about why we need to fully close stores that may be reassigned.

@guozhangwang
Copy link
Copy Markdown
Contributor Author

Failed tests are not relevant. Local run passed.

Copy link
Copy Markdown
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@guozhangwang
Copy link
Copy Markdown
Contributor Author

retest this please

@guozhangwang guozhangwang merged commit 0d461e4 into apache:trunk Feb 22, 2019
guozhangwang added a commit that referenced this pull request Feb 22, 2019
…6113)

* In activeTasks.suspend, we should also close all restoring tasks as well. Closing restoring tasks would not require `task.close` as in `closeNonRunningTasks `, since the topology is not initialized yet, instead only state stores are initialized. So we only need to call `task.closeStateManager`.
* Also add @linyli001 's fix.
* Unit tests updated accordingly.

Reviewers: Matthias J. Sax <mjsax@apache.org>,  John Roesler <john@confluent.io>
@guozhangwang
Copy link
Copy Markdown
Contributor Author

Cherry-picked to 2.2 as well.

jarekr pushed a commit to confluentinc/kafka that referenced this pull request Apr 18, 2019
* AK/trunk: (36 commits)
  KAFKA-7962: Avoid NPE for StickyAssignor (apache#6308)
  Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback (apache#6330)
  KAFKA-7918: Inline generic parameters Pt. II: RocksDB Bytes Store and Memory LRU Caches (apache#6327)
  MINOR: fix parameter naming (apache#6316)
  KAFKA-7956 In ShutdownableThread, immediately complete the shutdown if the thread has not been started (apache#6218)
  MINOR: Refactor replica log dir fetching for improved logging (apache#6313)
  [TRIVIAL] Remove unused StreamsGraphNode#repartitionRequired (apache#6227)
  MINOR: Increase produce timeout to 120 seconds (apache#6326)
  KAFKA-7918: Inline generic parameters Pt. I: in-memory key-value store (apache#6293)
  MINOR: Fix line break issue in upgrade notes (apache#6320)
  KAFKA-7972: Use automatic RPC generation in SaslHandshake
  MINOR: Enable capture of full stack trace in StreamTask#process (apache#6310)
  KAFKA-7938: Fix test flakiness in DeleteConsumerGroupsTest (apache#6312)
  KAFKA-7937: Fix Flaky Test ResetConsumerGroupOffsetTest.testResetOffsetsNotExistingGroup (apache#6311)
  MINOR: Update docs to say 2.2 (apache#6315)
  KAFKA-7672 : force write checkpoint during StreamTask #suspend (apache#6115)
  KAFKA-7961; Ignore assignment for un-subscribed partitions (apache#6304)
  KAFKA-7672: Restoring tasks need to be closed upon task suspension (apache#6113)
  KAFKA-7864; validate partitions are 0-based (apache#6246)
  KAFKA-7492 : Updated javadocs for aggregate and reduce methods returning null behavior. (apache#6285)
  ...
pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
…pache#6113)

* In activeTasks.suspend, we should also close all restoring tasks as well. Closing restoring tasks would not require `task.close` as in `closeNonRunningTasks `, since the topology is not initialized yet, instead only state stores are initialized. So we only need to call `task.closeStateManager`.
* Also add @linyli001 's fix.
* Unit tests updated accordingly.

Reviewers: Matthias J. Sax <mjsax@apache.org>,  John Roesler <john@confluent.io>
@guozhangwang guozhangwang deleted the K7672-task-resumption branch April 24, 2020 23:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants