Task queue unblock by jasonk000 · Pull Request #12099 · apache/druid

jasonk000 · 2021-12-27T22:08:07Z

Description

Improves the stability of Overlord and all tasks in a cluster when there are large (1000+) task counts, by reducing contention between the management thread and the reception of status updates from the cluster.

Introduce GuardedBy to TaskQueue

.. and fix some existing missed spots.

Introduce TaskQueueScaleTest

To test scalability of starting and stopping 1000 tasks (set with a 60sec timeout), that currently fails and is fixed in the next commit.

Reduce TaskQueue contention

Reduce the duration of holding the giant critical lock, which increases responsiveness:

Break apart the TaskQueue-Manager manage loop to a critical (locked) section and a section that can be run concurrently with notifications (ie: sending any necessary shutdown requests).
- This is the most important part of the change, since in the existing code the blocking shutdown requests are performed inside the loop. By moving the blocking calls outside the loop we make it possible for status notifications to be promptly processed.
Minimise duration that notifyStatus calls take holding the giant lock.
Move other logging etc outside the critical section where possible.

Design decisions

I chose a BlockingQueue implementation because it is easy to reason about the submission / poll / offer ordering. Other options would be Semaphore etc.

There is potential future work:

Current behaviour, if a task shutdown() call is slow it slows down submission of tasks across the whole loop - this is not improved by the PR.
This could be mitigated by introducing an Executor, or by applying a decision that shutdown() call implementations are non-blocking.
If recommended, I'd suggest we do this as a separate PR.

This follows the mailing list discussions here:
https://lists.apache.org/thread/9jgdwrodwsfcg98so6kzfhdmn95gzyrj

h/t @gianm for the rebase + test case.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster. (as part of another block of changes).

@gianm

…ith large task counts This introduces a test case to confirm how long it will take to launch and manage (aka shutdown) a large number of threads in the TaskQueue. h/t to @gianm for main implementation.

lgtm-com · 2021-12-27T23:21:56Z

This pull request fixes 1 alert when merging ef94f4f into 476d0bf - view on LGTM.com

fixed alerts:

1 for Useless comparison test

lgtm-com · 2022-01-06T02:15:05Z

This pull request fixes 1 alert when merging 22f633b into 6846622 - view on LGTM.com

fixed alerts:

1 for Useless comparison test

lgtm-com · 2022-01-06T18:52:07Z

This pull request fixes 1 alert when merging 1ee151d into c28b283 - view on LGTM.com

fixed alerts:

1 for Useless comparison test

gianm

@jasonk000, apologies for the delay in review. This patch looks good to me; I just had one question about the size of the queue.

There's also a conflict that arose with master. Merging master and then applying this patch will fix it: gianm@8ec5418

jasonk000 · 2022-05-01T21:48:55Z

@gianm thanks, I've merged + pulled this in and the test passes:

[INFO] Running org.apache.druid.indexing.overlord.TaskQueueScaleTest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 14.864 s - in org.apache.druid.indexing.overlord.TaskQueueScaleTest

The build overall has failed, but I think it is unrelated. Could you share your thoughts? It's in processing project, which is upstream of this change which only modifies indexing-service.

[ERROR] org.apache.druid.query.groupby.epinephelinae.BufferHashGrouperTest.testGrowingOverflowingInteger
[ERROR]   Run 1: BufferHashGrouperTest.testGrowingOverflowingInteger:132->makeGrouper:178 » OutOfMemory Direct buffer memory
[ERROR]   Run 2: BufferHashGrouperTest.testGrowingOverflowingInteger:132->makeGrouper:178 » OutOfMemory
[ERROR]   Run 3: BufferHashGrouperTest.testGrowingOverflowingInteger:132->makeGrouper:178 » OutOfMemory
[ERROR]   Run 4: BufferHashGrouperTest.testGrowingOverflowingInteger:132->makeGrouper:178 » OutOfMemory Direct buffer memory

lgtm-com · 2022-05-01T22:07:34Z

This pull request fixes 1 alert when merging 17808ff into dd8781f - view on LGTM.com

fixed alerts:

1 for Useless comparison test

gianm · 2022-05-14T23:44:26Z

The build overall has failed, but I think it is unrelated. Could you share your thoughts? It's in processing project, which is upstream of this change which only modifies indexing-service.

Yeah, that was unrelated. The patch looks good to me now, & I'm about to merge it. Thank you for the contribution!

It was possible for manageInternal to relaunch a task while it was being cleaned up, due to a race that happens when notifyStatus is called to clean up a successful task: 1) In a critical section, notifyStatus removes the task from "tasks". 2) Outside a critical section, notifyStatus calls taskRunner.shutdown to let the task runner know it can clear out its data structures. 3) In a critical section, syncFromStorage adds the task back to "tasks", because it is still present in metadata storage. 4) In a critical section, manageInternalCritical notices that the task is in "tasks" and is not running in the taskRunner, so it launches it again. 5) In a (different) critical section, notifyStatus updates the metadata store to set the task status to SUCCESS. 6) The task continues running even though it should not be. The possibility for this race was introduced in apache#12099, which shrunk the critical section in notifyStatus. Prior to that patch, a single critical section encompassed (1), (2), and (5), so the ordering above was not possible. This patch does the following: 1) Fixes the race by adding a recentlyCompletedTasks set that prevents the main management loop from doing anything with tasks that are currently being cleaned up. 2) Switches the order of the critical sections in notifyStatus, so metadata store updates happen first. This is useful in case of server failures: it ensures that if the Overlord fails in the midst of notifyStatus, then completed-task statuses are still available in ZK or on MMs for the next Overlord. (Those are cleaned up by taskRunner.shutdown, which formerly ran first.) This isn't related to the race described above, but is fixed opportunistically as part of the same patch. 3) Changes the "tasks" list to a map. Many operations require retrieval or removal of individual tasks; those are now O(1) instead of O(N) in the number of running tasks.

It was possible for manageInternal to relaunch a task while it was being cleaned up, due to a race that happens when notifyStatus is called to clean up a successful task: 1) In a critical section, notifyStatus removes the task from "tasks". 2) Outside a critical section, notifyStatus calls taskRunner.shutdown to let the task runner know it can clear out its data structures. 3) In a critical section, syncFromStorage adds the task back to "tasks", because it is still present in metadata storage. 4) In a critical section, manageInternalCritical notices that the task is in "tasks" and is not running in the taskRunner, so it launches it again. 5) In a (different) critical section, notifyStatus updates the metadata store to set the task status to SUCCESS. 6) The task continues running even though it should not be. The possibility for this race was introduced in apache#12099, which shrunk the critical section in notifyStatus. Prior to that patch, a single critical section encompassed (1), (2), and (5), so the ordering above was not possible. This patch does the following: 1) Fixes the race by adding a recentlyCompletedTasks set that prevents the main management loop from doing anything with tasks that are currently being cleaned up. 2) Switches the order of the critical sections in notifyStatus, so metadata store updates happen first. This is useful in case of server failures: it ensures that if the Overlord fails in the midst of notifyStatus, then completed-task statuses are still available in ZK or on MMs for the next Overlord. (Those are cleaned up by taskRunner.shutdown, which formerly ran first.) This isn't related to the race described above, but is fixed opportunistically as part of the same patch. 3) Changes the "tasks" list to a map. Many operations require retrieval or removal of individual tasks; those are now O(1) instead of O(N) in the number of running tasks. 4) Changes various log messages to use task ID instead of full task payload, to make the logs more readable.

* Fix race in TaskQueue.notifyStatus. It was possible for manageInternal to relaunch a task while it was being cleaned up, due to a race that happens when notifyStatus is called to clean up a successful task: 1) In a critical section, notifyStatus removes the task from "tasks". 2) Outside a critical section, notifyStatus calls taskRunner.shutdown to let the task runner know it can clear out its data structures. 3) In a critical section, syncFromStorage adds the task back to "tasks", because it is still present in metadata storage. 4) In a critical section, manageInternalCritical notices that the task is in "tasks" and is not running in the taskRunner, so it launches it again. 5) In a (different) critical section, notifyStatus updates the metadata store to set the task status to SUCCESS. 6) The task continues running even though it should not be. The possibility for this race was introduced in #12099, which shrunk the critical section in notifyStatus. Prior to that patch, a single critical section encompassed (1), (2), and (5), so the ordering above was not possible. This patch does the following: 1) Fixes the race by adding a recentlyCompletedTasks set that prevents the main management loop from doing anything with tasks that are currently being cleaned up. 2) Switches the order of the critical sections in notifyStatus, so metadata store updates happen first. This is useful in case of server failures: it ensures that if the Overlord fails in the midst of notifyStatus, then completed-task statuses are still available in ZK or on MMs for the next Overlord. (Those are cleaned up by taskRunner.shutdown, which formerly ran first.) This isn't related to the race described above, but is fixed opportunistically as part of the same patch. 3) Changes the "tasks" list to a map. Many operations require retrieval or removal of individual tasks; those are now O(1) instead of O(N) in the number of running tasks. 4) Changes various log messages to use task ID instead of full task payload, to make the logs more readable. * Fix format string. * Update comment.

@gianm

* concurrency: introduce GuardedBy to TaskQueue * perf: Introduce TaskQueueScaleTest to test performance of TaskQueue with large task counts This introduces a test case to confirm how long it will take to launch and manage (aka shutdown) a large number of threads in the TaskQueue. h/t to @gianm for main implementation. * perf: improve scalability of TaskQueue with large task counts * linter fixes, expand test coverage * pr feedback suggestion; swap to different linter * swap to use SuppressWarnings * Fix TaskQueueScaleTest. Co-authored-by: Gian Merlino <gian@imply.io>

jasonk000 added 3 commits December 27, 2021 12:40

concurrency: introduce GuardedBy to TaskQueue

6f4d9eb

perf: Introduce TaskQueueScaleTest to test performance of TaskQueue w…

db28dd8

…ith large task counts This introduces a test case to confirm how long it will take to launch and manage (aka shutdown) a large number of threads in the TaskQueue. h/t to @gianm for main implementation.

perf: improve scalability of TaskQueue with large task counts

ef94f4f

jasonk000 mentioned this pull request Dec 27, 2021

perf: eliminate double string concat in remote-task-runner shutdown logging #12097

Merged

3 tasks

This was referenced Dec 28, 2021

Tq scale test concurrent gianm/druid#3

Closed

task management performance fixes jasonk000/druid#7

Closed

Kafka ingestion lag spikes up whenever tasks are rolling #11414

Open

linter fixes, expand test coverage

22f633b

jasonk000 commented Jan 6, 2022

View reviewed changes

Comment thread indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java Outdated

jasonk000 added 2 commits January 6, 2022 09:27

pr feedback suggestion; swap to different linter

823664d

swap to use SuppressWarnings

1ee151d

asdf2014 added the Performance label Jan 15, 2022

jihoonson added the Area - Ingestion label Apr 21, 2022

gianm approved these changes Apr 22, 2022

View reviewed changes

Comment thread indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java

jasonk000 and others added 2 commits May 1, 2022 12:57

Merge branch 'master' into task-queue-unblock

10bf313

Fix TaskQueueScaleTest.

17808ff

gianm merged commit bb1a6de into apache:master May 14, 2022

gianm mentioned this pull request Aug 14, 2022

Fix race in TaskQueue.notifyStatus. #12901

Merged

abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022

techdocsmith mentioned this pull request Aug 26, 2022

[Draft] 24.0 Release notes #12825

Closed

abhishekagarwal87 mentioned this pull request Sep 8, 2022

Test issue [Please ignore] #13055

Closed

jasonk000 mentioned this pull request May 16, 2023

Remove giant lock from Overlord TaskQueue #14293

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task queue unblock#12099

Task queue unblock#12099
gianm merged 8 commits intoapache:masterfrom
jasonk000:task-queue-unblock

jasonk000 commented Dec 27, 2021 •

edited

Loading

Uh oh!

lgtm-com Bot commented Dec 27, 2021

Uh oh!

lgtm-com Bot commented Jan 6, 2022

Uh oh!

Uh oh!

lgtm-com Bot commented Jan 6, 2022

Uh oh!

gianm left a comment

Uh oh!

Uh oh!

jasonk000 commented May 1, 2022 •

edited

Loading

Uh oh!

lgtm-com Bot commented May 1, 2022

Uh oh!

gianm commented May 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jasonk000 commented Dec 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Introduce GuardedBy to TaskQueue

Introduce TaskQueueScaleTest

Reduce TaskQueue contention

Design decisions

Uh oh!

lgtm-com Bot commented Dec 27, 2021

Uh oh!

lgtm-com Bot commented Jan 6, 2022

Uh oh!

Uh oh!

lgtm-com Bot commented Jan 6, 2022

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jasonk000 commented May 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgtm-com Bot commented May 1, 2022

Uh oh!

gianm commented May 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jasonk000 commented Dec 27, 2021 •

edited

Loading

jasonk000 commented May 1, 2022 •

edited

Loading