Fix race in TaskQueue.notifyStatus. by gianm · Pull Request #12901 · apache/druid

gianm · 2022-08-14T06:42:28Z

It was possible for manageInternal to relaunch a task while it was
being cleaned up, due to a race that happens when notifyStatus is
called to clean up a successful task:

In a critical section, notifyStatus removes the task from tasks.
Outside a critical section, notifyStatus calls taskRunner.shutdown
to let the task runner know it can clear out its data structures.
In a critical section, syncFromStorage adds the task back to tasks,
because it is still present in metadata storage.
In a critical section, manageInternalCritical notices that the task
is in tasks and is not running in the taskRunner, so it launches
it again.
In a (different) critical section, notifyStatus updates the metadata
store to set the task status to SUCCESS.
The task continues running even though it should not be.

The possibility for this race was introduced in #12099, which shrunk
the critical section in notifyStatus. Prior to that patch, a single
critical section encompassed (1), (2), and (5), so the ordering above
was not possible.

This patch does the following:

Fixes the race by adding a recentlyCompletedTasks set that prevents
the main management loop from doing anything with tasks that are
currently being cleaned up.
Switches the order of the critical sections in notifyStatus, so
metadata store updates happen first. This is useful in case of
server failures: it ensures that if the Overlord fails in the midst
of notifyStatus, then completed-task statuses are still available in
ZK or on MMs for the next Overlord. (Those are cleaned up by
taskRunner.shutdown, which formerly ran first.) This isn't related
to the race described above, but is fixed opportunistically as part
of the same patch.
Changes the tasks list to a map. Many operations require retrieval
or removal of individual tasks; those are now O(1) instead of O(N)
in the number of running tasks.
Changes various log messages to use task ID instead of full task
payload, to make the logs more readable.

It was possible for manageInternal to relaunch a task while it was being cleaned up, due to a race that happens when notifyStatus is called to clean up a successful task: 1) In a critical section, notifyStatus removes the task from "tasks". 2) Outside a critical section, notifyStatus calls taskRunner.shutdown to let the task runner know it can clear out its data structures. 3) In a critical section, syncFromStorage adds the task back to "tasks", because it is still present in metadata storage. 4) In a critical section, manageInternalCritical notices that the task is in "tasks" and is not running in the taskRunner, so it launches it again. 5) In a (different) critical section, notifyStatus updates the metadata store to set the task status to SUCCESS. 6) The task continues running even though it should not be. The possibility for this race was introduced in apache#12099, which shrunk the critical section in notifyStatus. Prior to that patch, a single critical section encompassed (1), (2), and (5), so the ordering above was not possible. This patch does the following: 1) Fixes the race by adding a recentlyCompletedTasks set that prevents the main management loop from doing anything with tasks that are currently being cleaned up. 2) Switches the order of the critical sections in notifyStatus, so metadata store updates happen first. This is useful in case of server failures: it ensures that if the Overlord fails in the midst of notifyStatus, then completed-task statuses are still available in ZK or on MMs for the next Overlord. (Those are cleaned up by taskRunner.shutdown, which formerly ran first.) This isn't related to the race described above, but is fixed opportunistically as part of the same patch. 3) Changes the "tasks" list to a map. Many operations require retrieval or removal of individual tasks; those are now O(1) instead of O(N) in the number of running tasks. 4) Changes various log messages to use task ID instead of full task payload, to make the logs more readable.

gianm · 2022-08-14T06:46:17Z

Log messages look like this when the race happens:

  2022-08-13T08:17:02,342 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: <task-id>
  2022-08-13T08:17:02,342 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Adding pending task[<task-id>].
(1) and (2) 👇
  2022-08-13T08:17:28,365 INFO [HttpRemoteTaskRunner-worker-sync-3] org.apache.druid.indexing.overlord.TaskQueue - Received SUCCESS status for task: <task-id>
(3) 👇
  2022-08-13T08:17:28,476 INFO [TaskQueue-StorageSync] org.apache.druid.indexing.overlord.TaskLockbox - Adding task[<task-id>] to activeTasks
(4) 👇
  2022-08-13T08:17:28,476 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: <task-id>
  2022-08-13T08:17:28,476 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Adding pending task[<task-id>].
  2022-08-13T08:17:28,482 INFO [HttpRemoteTaskRunner-worker-sync-3] org.apache.druid.indexing.overlord.TaskQueue - Task done: <task-to-string>
(5) 👇
  2022-08-13T08:17:28,482 INFO [HttpRemoteTaskRunner-worker-sync-3] org.apache.druid.indexing.overlord.TaskQueue - Task SUCCESS: <task-to-string> (26012 run duration)
  2022-08-13T08:17:37,586 INFO [HttpRemoteTaskRunner-worker-sync-4] org.apache.druid.indexing.overlord.TaskQueue - Received FAILED status for task: <task-id>
  2022-08-13T08:17:37,593 ERROR [HttpRemoteTaskRunner-worker-sync-4] org.apache.druid.indexing.overlord.TaskQueue - Ignoring notification for already-complete task: {class=org.apache.druid.indexing.overlord.TaskQueue, task=<task-id>}
  2022-08-13T08:17:37,594 INFO [HttpRemoteTaskRunner-worker-sync-4] org.apache.druid.indexing.overlord.TaskQueue - Task FAILED: <task-to-string> (9100 run duration)

gianm · 2022-08-14T06:48:38Z

FYI @jasonk000

I have run the patch in this PR locally through some smoke tests, but haven't run it in production yet. If you have a chance to, then please let us know how it goes.

lgtm-com · 2022-08-14T07:58:50Z

This pull request introduces 1 alert when merging 2274c28 into 4d65c08 - view on LGTM.com

new alerts:

1 for Missing format argument

gianm · 2022-08-14T08:56:43Z

new alerts:

* 1 for Missing format argument

Fixed in the latest commit.

rohangarg

Great find and explanation with logs! 👍
LGTM mostly, left a couple of small comments.

rohangarg · 2022-08-14T19:50:14Z

+  private static final long MANAGEMENT_WAIT_TIMEOUT_NANOS = TimeUnit.SECONDS.toNanos(60);
+  private static final long MIN_WAIT_TIME_MS = 100;

+  // Task ID -> Task that is currently running


I think this map can contain tasks in pending state too. Also, this could be out-of-sync with task-storage as per documentation (upon manual changes in task-storage). Maybe, this can be called in-memory view of known tasks?

I changed the comment to hopefully be clearer. It now reads:

// Task ID -> Task, for tasks that are active in some way (submitted, running, or finished and to-be-cleaned-up).

rohangarg · 2022-08-14T20:07:16Z


-    if (removed == 0) {
-      log.warn("Unknown task completed: %s", task.getId());
+    // Save status to metadata store first, so if we crash while doing the rest of the shutdown, our successor


doubt : does this order change also solve the race condition problem itself? As per my current understanding, re-launching can happen only if syncFromStorage thinks that there is an active task in metadata storage and the in-memory view doesn't know about it. That leads to creation of in-memory task which then management thread launches.

With this change, the syncFromStorage can have three views :

tasks map contains the task and task-storage has it as active task

tasks map contains the task and task-storage doesn't have it as active task in which case both syncFromStorage and notifyStatus will clean it up

tasks map doesn't contain the task and task-storage doesn't have it too

I think all 3 cases should be ok, but I maybe missing something here.

The order switch solves the original race but creates a new one, which you have listed as case (2). It would create a situation where two threads are trying to clean up the same task at the same time. This may be fine, but I don't think it's prudent to rely on it being fine. Cleaner to ensure that only one thread tries to clean up the task.

jasonk000 · 2022-09-04T20:53:28Z

I have run the patch in this PR locally through some smoke tests, but haven't run it in production yet. If you have a chance to, then please let us know how it goes.

Thanks @gianm, great catch + fix. We are running 0.20.x still on most of our stack, so are not in a good spot to test it out directly. Hoping we can get to something newer soon 🤞 .

gianm added Bug Area - Ingestion labels Aug 14, 2022

Fix format string.

1416acc

rohangarg approved these changes Aug 14, 2022

View reviewed changes

Update comment.

53b4e64

vogievetsky merged commit 28836df into apache:master Aug 15, 2022

gianm deleted the fix-task-queue-sneaky-relaunch branch August 15, 2022 15:44

abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022

jasonk000 mentioned this pull request May 16, 2023

Remove giant lock from Overlord TaskQueue #14293

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race in TaskQueue.notifyStatus.#12901

Fix race in TaskQueue.notifyStatus.#12901
vogievetsky merged 3 commits intoapache:masterfrom
gianm:fix-task-queue-sneaky-relaunch

gianm commented Aug 14, 2022

Uh oh!

gianm commented Aug 14, 2022

Uh oh!

gianm commented Aug 14, 2022 •

edited

Loading

Uh oh!

lgtm-com Bot commented Aug 14, 2022

Uh oh!

gianm commented Aug 14, 2022

Uh oh!

rohangarg left a comment

Uh oh!

rohangarg Aug 14, 2022 •

edited

Loading

Uh oh!

gianm Aug 14, 2022

Uh oh!

rohangarg Aug 14, 2022

Uh oh!

gianm Aug 14, 2022

Uh oh!

jasonk000 commented Sep 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

gianm commented Aug 14, 2022

Uh oh!

gianm commented Aug 14, 2022

Uh oh!

gianm commented Aug 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgtm-com Bot commented Aug 14, 2022

Uh oh!

gianm commented Aug 14, 2022

Uh oh!

rohangarg left a comment

Choose a reason for hiding this comment

Uh oh!

rohangarg Aug 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Aug 14, 2022

Choose a reason for hiding this comment

Uh oh!

rohangarg Aug 14, 2022

Choose a reason for hiding this comment

Uh oh!

gianm Aug 14, 2022

Choose a reason for hiding this comment

Uh oh!

jasonk000 commented Sep 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gianm commented Aug 14, 2022 •

edited

Loading

rohangarg Aug 14, 2022 •

edited

Loading