Skip to content

Fix race in TaskQueue.notifyStatus.#12901

Merged
vogievetsky merged 3 commits intoapache:masterfrom
gianm:fix-task-queue-sneaky-relaunch
Aug 15, 2022
Merged

Fix race in TaskQueue.notifyStatus.#12901
vogievetsky merged 3 commits intoapache:masterfrom
gianm:fix-task-queue-sneaky-relaunch

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Aug 14, 2022

It was possible for manageInternal to relaunch a task while it was
being cleaned up, due to a race that happens when notifyStatus is
called to clean up a successful task:

  1. In a critical section, notifyStatus removes the task from tasks.
  2. Outside a critical section, notifyStatus calls taskRunner.shutdown
    to let the task runner know it can clear out its data structures.
  3. In a critical section, syncFromStorage adds the task back to tasks,
    because it is still present in metadata storage.
  4. In a critical section, manageInternalCritical notices that the task
    is in tasks and is not running in the taskRunner, so it launches
    it again.
  5. In a (different) critical section, notifyStatus updates the metadata
    store to set the task status to SUCCESS.
  6. The task continues running even though it should not be.

The possibility for this race was introduced in #12099, which shrunk
the critical section in notifyStatus. Prior to that patch, a single
critical section encompassed (1), (2), and (5), so the ordering above
was not possible.

This patch does the following:

  1. Fixes the race by adding a recentlyCompletedTasks set that prevents
    the main management loop from doing anything with tasks that are
    currently being cleaned up.
  2. Switches the order of the critical sections in notifyStatus, so
    metadata store updates happen first. This is useful in case of
    server failures: it ensures that if the Overlord fails in the midst
    of notifyStatus, then completed-task statuses are still available in
    ZK or on MMs for the next Overlord. (Those are cleaned up by
    taskRunner.shutdown, which formerly ran first.) This isn't related
    to the race described above, but is fixed opportunistically as part
    of the same patch.
  3. Changes the tasks list to a map. Many operations require retrieval
    or removal of individual tasks; those are now O(1) instead of O(N)
    in the number of running tasks.
  4. Changes various log messages to use task ID instead of full task
    payload, to make the logs more readable.

It was possible for manageInternal to relaunch a task while it was
being cleaned up, due to a race that happens when notifyStatus is
called to clean up a successful task:

1) In a critical section, notifyStatus removes the task from "tasks".
2) Outside a critical section, notifyStatus calls taskRunner.shutdown
   to let the task runner know it can clear out its data structures.
3) In a critical section, syncFromStorage adds the task back to "tasks",
   because it is still present in metadata storage.
4) In a critical section, manageInternalCritical notices that the task
   is in "tasks" and is not running in the taskRunner, so it launches
   it again.
5) In a (different) critical section, notifyStatus updates the metadata
   store to set the task status to SUCCESS.
6) The task continues running even though it should not be.

The possibility for this race was introduced in apache#12099, which shrunk
the critical section in notifyStatus. Prior to that patch, a single
critical section encompassed (1), (2), and (5), so the ordering above
was not possible.

This patch does the following:

1) Fixes the race by adding a recentlyCompletedTasks set that prevents
   the main management loop from doing anything with tasks that are
   currently being cleaned up.
2) Switches the order of the critical sections in notifyStatus, so
   metadata store updates happen first. This is useful in case of
   server failures: it ensures that if the Overlord fails in the midst
   of notifyStatus, then completed-task statuses are still available in
   ZK or on MMs for the next Overlord. (Those are cleaned up by
   taskRunner.shutdown, which formerly ran first.) This isn't related
   to the race described above, but is fixed opportunistically as part
   of the same patch.
3) Changes the "tasks" list to a map. Many operations require retrieval
   or removal of individual tasks; those are now O(1) instead of O(N)
   in the number of running tasks.
4) Changes various log messages to use task ID instead of full task
   payload, to make the logs more readable.
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Aug 14, 2022

Log messages look like this when the race happens:

  2022-08-13T08:17:02,342 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: <task-id>
  2022-08-13T08:17:02,342 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Adding pending task[<task-id>].
(1) and (2) 👇
  2022-08-13T08:17:28,365 INFO [HttpRemoteTaskRunner-worker-sync-3] org.apache.druid.indexing.overlord.TaskQueue - Received SUCCESS status for task: <task-id>
(3) 👇
  2022-08-13T08:17:28,476 INFO [TaskQueue-StorageSync] org.apache.druid.indexing.overlord.TaskLockbox - Adding task[<task-id>] to activeTasks
(4) 👇
  2022-08-13T08:17:28,476 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: <task-id>
  2022-08-13T08:17:28,476 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Adding pending task[<task-id>].
  2022-08-13T08:17:28,482 INFO [HttpRemoteTaskRunner-worker-sync-3] org.apache.druid.indexing.overlord.TaskQueue - Task done: <task-to-string>
(5) 👇
  2022-08-13T08:17:28,482 INFO [HttpRemoteTaskRunner-worker-sync-3] org.apache.druid.indexing.overlord.TaskQueue - Task SUCCESS: <task-to-string> (26012 run duration)
  2022-08-13T08:17:37,586 INFO [HttpRemoteTaskRunner-worker-sync-4] org.apache.druid.indexing.overlord.TaskQueue - Received FAILED status for task: <task-id>
  2022-08-13T08:17:37,593 ERROR [HttpRemoteTaskRunner-worker-sync-4] org.apache.druid.indexing.overlord.TaskQueue - Ignoring notification for already-complete task: {class=org.apache.druid.indexing.overlord.TaskQueue, task=<task-id>}
  2022-08-13T08:17:37,594 INFO [HttpRemoteTaskRunner-worker-sync-4] org.apache.druid.indexing.overlord.TaskQueue - Task FAILED: <task-to-string> (9100 run duration)

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Aug 14, 2022

FYI @jasonk000

I have run the patch in this PR locally through some smoke tests, but haven't run it in production yet. If you have a chance to, then please let us know how it goes.

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Aug 14, 2022

This pull request introduces 1 alert when merging 2274c28 into 4d65c08 - view on LGTM.com

new alerts:

  • 1 for Missing format argument

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Aug 14, 2022

new alerts:

* 1 for Missing format argument

Fixed in the latest commit.

Copy link
Copy Markdown
Member

@rohangarg rohangarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find and explanation with logs! 👍
LGTM mostly, left a couple of small comments.

private static final long MANAGEMENT_WAIT_TIMEOUT_NANOS = TimeUnit.SECONDS.toNanos(60);
private static final long MIN_WAIT_TIME_MS = 100;

// Task ID -> Task that is currently running
Copy link
Copy Markdown
Member

@rohangarg rohangarg Aug 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this map can contain tasks in pending state too. Also, this could be out-of-sync with task-storage as per documentation (upon manual changes in task-storage). Maybe, this can be called in-memory view of known tasks?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the comment to hopefully be clearer. It now reads:

// Task ID -> Task, for tasks that are active in some way (submitted, running, or finished and to-be-cleaned-up).


if (removed == 0) {
log.warn("Unknown task completed: %s", task.getId());
// Save status to metadata store first, so if we crash while doing the rest of the shutdown, our successor
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doubt : does this order change also solve the race condition problem itself? As per my current understanding, re-launching can happen only if syncFromStorage thinks that there is an active task in metadata storage and the in-memory view doesn't know about it. That leads to creation of in-memory task which then management thread launches.

With this change, the syncFromStorage can have three views :

  1. tasks map contains the task and task-storage has it as active task
  2. tasks map contains the task and task-storage doesn't have it as active task in which case both syncFromStorage and notifyStatus will clean it up
  3. tasks map doesn't contain the task and task-storage doesn't have it too

I think all 3 cases should be ok, but I maybe missing something here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order switch solves the original race but creates a new one, which you have listed as case (2). It would create a situation where two threads are trying to clean up the same task at the same time. This may be fine, but I don't think it's prudent to rely on it being fine. Cleaner to ensure that only one thread tries to clean up the task.

@vogievetsky vogievetsky merged commit 28836df into apache:master Aug 15, 2022
@gianm gianm deleted the fix-task-queue-sneaky-relaunch branch August 15, 2022 15:44
@abhishekagarwal87 abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022
@jasonk000
Copy link
Copy Markdown
Contributor

I have run the patch in this PR locally through some smoke tests, but haven't run it in production yet. If you have a chance to, then please let us know how it goes.

Thanks @gianm, great catch + fix. We are running 0.20.x still on most of our stack, so are not in a good spot to test it out directly. Hoping we can get to something newer soon 🤞 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants