Skip to content

Task queue unblock#12099

Merged
gianm merged 8 commits intoapache:masterfrom
jasonk000:task-queue-unblock
May 14, 2022
Merged

Task queue unblock#12099
gianm merged 8 commits intoapache:masterfrom
jasonk000:task-queue-unblock

Conversation

@jasonk000
Copy link
Copy Markdown
Contributor

@jasonk000 jasonk000 commented Dec 27, 2021

Description

Improves the stability of Overlord and all tasks in a cluster when there are large (1000+) task counts, by reducing contention between the management thread and the reception of status updates from the cluster.

Introduce GuardedBy to TaskQueue

.. and fix some existing missed spots.

Introduce TaskQueueScaleTest

To test scalability of starting and stopping 1000 tasks (set with a 60sec timeout), that currently fails and is fixed in the next commit.

Reduce TaskQueue contention

Reduce the duration of holding the giant critical lock, which increases responsiveness:

  • Break apart the TaskQueue-Manager manage loop to a critical (locked) section and a section that can be run concurrently with notifications (ie: sending any necessary shutdown requests).
    • This is the most important part of the change, since in the existing code the blocking shutdown requests are performed inside the loop. By moving the blocking calls outside the loop we make it possible for status notifications to be promptly processed.
  • Minimise duration that notifyStatus calls take holding the giant lock.
  • Move other logging etc outside the critical section where possible.

Design decisions

I chose a BlockingQueue implementation because it is easy to reason about the submission / poll / offer ordering. Other options would be Semaphore etc.

There is potential future work:

  • Current behaviour, if a task shutdown() call is slow it slows down submission of tasks across the whole loop - this is not improved by the PR.
  • This could be mitigated by introducing an Executor, or by applying a decision that shutdown() call implementations are non-blocking.
  • If recommended, I'd suggest we do this as a separate PR.

This follows the mailing list discussions here:
https://lists.apache.org/thread/9jgdwrodwsfcg98so6kzfhdmn95gzyrj

h/t @gianm for the rebase + test case.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster. (as part of another block of changes).

…ith large task counts

This introduces a test case to confirm how long it will take to launch and manage (aka shutdown)
a large number of threads in the TaskQueue.

h/t to @gianm for main implementation.
@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Dec 27, 2021

This pull request fixes 1 alert when merging ef94f4f into 476d0bf - view on LGTM.com

fixed alerts:

  • 1 for Useless comparison test

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Jan 6, 2022

This pull request fixes 1 alert when merging 22f633b into 6846622 - view on LGTM.com

fixed alerts:

  • 1 for Useless comparison test

Comment thread indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java Outdated
@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Jan 6, 2022

This pull request fixes 1 alert when merging 1ee151d into c28b283 - view on LGTM.com

fixed alerts:

  • 1 for Useless comparison test

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasonk000, apologies for the delay in review. This patch looks good to me; I just had one question about the size of the queue.

There's also a conflict that arose with master. Merging master and then applying this patch will fix it: gianm@8ec5418

@jasonk000
Copy link
Copy Markdown
Contributor Author

jasonk000 commented May 1, 2022

@gianm thanks, I've merged + pulled this in and the test passes:

[INFO] Running org.apache.druid.indexing.overlord.TaskQueueScaleTest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 14.864 s - in org.apache.druid.indexing.overlord.TaskQueueScaleTest

The build overall has failed, but I think it is unrelated. Could you share your thoughts? It's in processing project, which is upstream of this change which only modifies indexing-service.

[ERROR] org.apache.druid.query.groupby.epinephelinae.BufferHashGrouperTest.testGrowingOverflowingInteger
[ERROR]   Run 1: BufferHashGrouperTest.testGrowingOverflowingInteger:132->makeGrouper:178 » OutOfMemory Direct buffer memory
[ERROR]   Run 2: BufferHashGrouperTest.testGrowingOverflowingInteger:132->makeGrouper:178 » OutOfMemory
[ERROR]   Run 3: BufferHashGrouperTest.testGrowingOverflowingInteger:132->makeGrouper:178 » OutOfMemory
[ERROR]   Run 4: BufferHashGrouperTest.testGrowingOverflowingInteger:132->makeGrouper:178 » OutOfMemory Direct buffer memory

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented May 1, 2022

This pull request fixes 1 alert when merging 17808ff into dd8781f - view on LGTM.com

fixed alerts:

  • 1 for Useless comparison test

@gianm
Copy link
Copy Markdown
Contributor

gianm commented May 14, 2022

The build overall has failed, but I think it is unrelated. Could you share your thoughts? It's in processing project, which is upstream of this change which only modifies indexing-service.

Yeah, that was unrelated. The patch looks good to me now, & I'm about to merge it. Thank you for the contribution!

@gianm gianm merged commit bb1a6de into apache:master May 14, 2022
gianm added a commit to gianm/druid that referenced this pull request Aug 14, 2022
It was possible for manageInternal to relaunch a task while it was
being cleaned up, due to a race that happens when notifyStatus is
called to clean up a successful task:

1) In a critical section, notifyStatus removes the task from "tasks".
2) Outside a critical section, notifyStatus calls taskRunner.shutdown
   to let the task runner know it can clear out its data structures.
3) In a critical section, syncFromStorage adds the task back to "tasks",
   because it is still present in metadata storage.
4) In a critical section, manageInternalCritical notices that the task
   is in "tasks" and is not running in the taskRunner, so it launches
   it again.
5) In a (different) critical section, notifyStatus updates the metadata
   store to set the task status to SUCCESS.
6) The task continues running even though it should not be.

The possibility for this race was introduced in apache#12099, which shrunk
the critical section in notifyStatus. Prior to that patch, a single
critical section encompassed (1), (2), and (5), so the ordering above
was not possible.

This patch does the following:

1) Fixes the race by adding a recentlyCompletedTasks set that prevents
   the main management loop from doing anything with tasks that are
   currently being cleaned up.
2) Switches the order of the critical sections in notifyStatus, so
   metadata store updates happen first. This is useful in case of
   server failures: it ensures that if the Overlord fails in the midst
   of notifyStatus, then completed-task statuses are still available in
   ZK or on MMs for the next Overlord. (Those are cleaned up by
   taskRunner.shutdown, which formerly ran first.) This isn't related
   to the race described above, but is fixed opportunistically as part
   of the same patch.
3) Changes the "tasks" list to a map. Many operations require retrieval
   or removal of individual tasks; those are now O(1) instead of O(N)
   in the number of running tasks.
gianm added a commit to gianm/druid that referenced this pull request Aug 14, 2022
It was possible for manageInternal to relaunch a task while it was
being cleaned up, due to a race that happens when notifyStatus is
called to clean up a successful task:

1) In a critical section, notifyStatus removes the task from "tasks".
2) Outside a critical section, notifyStatus calls taskRunner.shutdown
   to let the task runner know it can clear out its data structures.
3) In a critical section, syncFromStorage adds the task back to "tasks",
   because it is still present in metadata storage.
4) In a critical section, manageInternalCritical notices that the task
   is in "tasks" and is not running in the taskRunner, so it launches
   it again.
5) In a (different) critical section, notifyStatus updates the metadata
   store to set the task status to SUCCESS.
6) The task continues running even though it should not be.

The possibility for this race was introduced in apache#12099, which shrunk
the critical section in notifyStatus. Prior to that patch, a single
critical section encompassed (1), (2), and (5), so the ordering above
was not possible.

This patch does the following:

1) Fixes the race by adding a recentlyCompletedTasks set that prevents
   the main management loop from doing anything with tasks that are
   currently being cleaned up.
2) Switches the order of the critical sections in notifyStatus, so
   metadata store updates happen first. This is useful in case of
   server failures: it ensures that if the Overlord fails in the midst
   of notifyStatus, then completed-task statuses are still available in
   ZK or on MMs for the next Overlord. (Those are cleaned up by
   taskRunner.shutdown, which formerly ran first.) This isn't related
   to the race described above, but is fixed opportunistically as part
   of the same patch.
3) Changes the "tasks" list to a map. Many operations require retrieval
   or removal of individual tasks; those are now O(1) instead of O(N)
   in the number of running tasks.
gianm added a commit to gianm/druid that referenced this pull request Aug 14, 2022
It was possible for manageInternal to relaunch a task while it was
being cleaned up, due to a race that happens when notifyStatus is
called to clean up a successful task:

1) In a critical section, notifyStatus removes the task from "tasks".
2) Outside a critical section, notifyStatus calls taskRunner.shutdown
   to let the task runner know it can clear out its data structures.
3) In a critical section, syncFromStorage adds the task back to "tasks",
   because it is still present in metadata storage.
4) In a critical section, manageInternalCritical notices that the task
   is in "tasks" and is not running in the taskRunner, so it launches
   it again.
5) In a (different) critical section, notifyStatus updates the metadata
   store to set the task status to SUCCESS.
6) The task continues running even though it should not be.

The possibility for this race was introduced in apache#12099, which shrunk
the critical section in notifyStatus. Prior to that patch, a single
critical section encompassed (1), (2), and (5), so the ordering above
was not possible.

This patch does the following:

1) Fixes the race by adding a recentlyCompletedTasks set that prevents
   the main management loop from doing anything with tasks that are
   currently being cleaned up.
2) Switches the order of the critical sections in notifyStatus, so
   metadata store updates happen first. This is useful in case of
   server failures: it ensures that if the Overlord fails in the midst
   of notifyStatus, then completed-task statuses are still available in
   ZK or on MMs for the next Overlord. (Those are cleaned up by
   taskRunner.shutdown, which formerly ran first.) This isn't related
   to the race described above, but is fixed opportunistically as part
   of the same patch.
3) Changes the "tasks" list to a map. Many operations require retrieval
   or removal of individual tasks; those are now O(1) instead of O(N)
   in the number of running tasks.
gianm added a commit to gianm/druid that referenced this pull request Aug 14, 2022
It was possible for manageInternal to relaunch a task while it was
being cleaned up, due to a race that happens when notifyStatus is
called to clean up a successful task:

1) In a critical section, notifyStatus removes the task from "tasks".
2) Outside a critical section, notifyStatus calls taskRunner.shutdown
   to let the task runner know it can clear out its data structures.
3) In a critical section, syncFromStorage adds the task back to "tasks",
   because it is still present in metadata storage.
4) In a critical section, manageInternalCritical notices that the task
   is in "tasks" and is not running in the taskRunner, so it launches
   it again.
5) In a (different) critical section, notifyStatus updates the metadata
   store to set the task status to SUCCESS.
6) The task continues running even though it should not be.

The possibility for this race was introduced in apache#12099, which shrunk
the critical section in notifyStatus. Prior to that patch, a single
critical section encompassed (1), (2), and (5), so the ordering above
was not possible.

This patch does the following:

1) Fixes the race by adding a recentlyCompletedTasks set that prevents
   the main management loop from doing anything with tasks that are
   currently being cleaned up.
2) Switches the order of the critical sections in notifyStatus, so
   metadata store updates happen first. This is useful in case of
   server failures: it ensures that if the Overlord fails in the midst
   of notifyStatus, then completed-task statuses are still available in
   ZK or on MMs for the next Overlord. (Those are cleaned up by
   taskRunner.shutdown, which formerly ran first.) This isn't related
   to the race described above, but is fixed opportunistically as part
   of the same patch.
3) Changes the "tasks" list to a map. Many operations require retrieval
   or removal of individual tasks; those are now O(1) instead of O(N)
   in the number of running tasks.
4) Changes various log messages to use task ID instead of full task
   payload, to make the logs more readable.
gianm added a commit to gianm/druid that referenced this pull request Aug 14, 2022
It was possible for manageInternal to relaunch a task while it was
being cleaned up, due to a race that happens when notifyStatus is
called to clean up a successful task:

1) In a critical section, notifyStatus removes the task from "tasks".
2) Outside a critical section, notifyStatus calls taskRunner.shutdown
   to let the task runner know it can clear out its data structures.
3) In a critical section, syncFromStorage adds the task back to "tasks",
   because it is still present in metadata storage.
4) In a critical section, manageInternalCritical notices that the task
   is in "tasks" and is not running in the taskRunner, so it launches
   it again.
5) In a (different) critical section, notifyStatus updates the metadata
   store to set the task status to SUCCESS.
6) The task continues running even though it should not be.

The possibility for this race was introduced in apache#12099, which shrunk
the critical section in notifyStatus. Prior to that patch, a single
critical section encompassed (1), (2), and (5), so the ordering above
was not possible.

This patch does the following:

1) Fixes the race by adding a recentlyCompletedTasks set that prevents
   the main management loop from doing anything with tasks that are
   currently being cleaned up.
2) Switches the order of the critical sections in notifyStatus, so
   metadata store updates happen first. This is useful in case of
   server failures: it ensures that if the Overlord fails in the midst
   of notifyStatus, then completed-task statuses are still available in
   ZK or on MMs for the next Overlord. (Those are cleaned up by
   taskRunner.shutdown, which formerly ran first.) This isn't related
   to the race described above, but is fixed opportunistically as part
   of the same patch.
3) Changes the "tasks" list to a map. Many operations require retrieval
   or removal of individual tasks; those are now O(1) instead of O(N)
   in the number of running tasks.
4) Changes various log messages to use task ID instead of full task
   payload, to make the logs more readable.
vogievetsky pushed a commit that referenced this pull request Aug 15, 2022
* Fix race in TaskQueue.notifyStatus.

It was possible for manageInternal to relaunch a task while it was
being cleaned up, due to a race that happens when notifyStatus is
called to clean up a successful task:

1) In a critical section, notifyStatus removes the task from "tasks".
2) Outside a critical section, notifyStatus calls taskRunner.shutdown
   to let the task runner know it can clear out its data structures.
3) In a critical section, syncFromStorage adds the task back to "tasks",
   because it is still present in metadata storage.
4) In a critical section, manageInternalCritical notices that the task
   is in "tasks" and is not running in the taskRunner, so it launches
   it again.
5) In a (different) critical section, notifyStatus updates the metadata
   store to set the task status to SUCCESS.
6) The task continues running even though it should not be.

The possibility for this race was introduced in #12099, which shrunk
the critical section in notifyStatus. Prior to that patch, a single
critical section encompassed (1), (2), and (5), so the ordering above
was not possible.

This patch does the following:

1) Fixes the race by adding a recentlyCompletedTasks set that prevents
   the main management loop from doing anything with tasks that are
   currently being cleaned up.
2) Switches the order of the critical sections in notifyStatus, so
   metadata store updates happen first. This is useful in case of
   server failures: it ensures that if the Overlord fails in the midst
   of notifyStatus, then completed-task statuses are still available in
   ZK or on MMs for the next Overlord. (Those are cleaned up by
   taskRunner.shutdown, which formerly ran first.) This isn't related
   to the race described above, but is fixed opportunistically as part
   of the same patch.
3) Changes the "tasks" list to a map. Many operations require retrieval
   or removal of individual tasks; those are now O(1) instead of O(N)
   in the number of running tasks.
4) Changes various log messages to use task ID instead of full task
   payload, to make the logs more readable.

* Fix format string.

* Update comment.
@abhishekagarwal87 abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022
anishanagarajan pushed a commit to twitter-forks/druid that referenced this pull request Sep 23, 2022
* concurrency: introduce GuardedBy to TaskQueue

* perf: Introduce TaskQueueScaleTest to test performance of TaskQueue with large task counts

This introduces a test case to confirm how long it will take to launch and manage (aka shutdown)
a large number of threads in the TaskQueue.

h/t to @gianm for main implementation.

* perf: improve scalability of TaskQueue with large task counts

* linter fixes, expand test coverage

* pr feedback suggestion; swap to different linter

* swap to use SuppressWarnings

* Fix TaskQueueScaleTest.

Co-authored-by: Gian Merlino <gian@imply.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants