Fix race in TaskQueue.notifyStatus.#12901
Conversation
It was possible for manageInternal to relaunch a task while it was being cleaned up, due to a race that happens when notifyStatus is called to clean up a successful task: 1) In a critical section, notifyStatus removes the task from "tasks". 2) Outside a critical section, notifyStatus calls taskRunner.shutdown to let the task runner know it can clear out its data structures. 3) In a critical section, syncFromStorage adds the task back to "tasks", because it is still present in metadata storage. 4) In a critical section, manageInternalCritical notices that the task is in "tasks" and is not running in the taskRunner, so it launches it again. 5) In a (different) critical section, notifyStatus updates the metadata store to set the task status to SUCCESS. 6) The task continues running even though it should not be. The possibility for this race was introduced in apache#12099, which shrunk the critical section in notifyStatus. Prior to that patch, a single critical section encompassed (1), (2), and (5), so the ordering above was not possible. This patch does the following: 1) Fixes the race by adding a recentlyCompletedTasks set that prevents the main management loop from doing anything with tasks that are currently being cleaned up. 2) Switches the order of the critical sections in notifyStatus, so metadata store updates happen first. This is useful in case of server failures: it ensures that if the Overlord fails in the midst of notifyStatus, then completed-task statuses are still available in ZK or on MMs for the next Overlord. (Those are cleaned up by taskRunner.shutdown, which formerly ran first.) This isn't related to the race described above, but is fixed opportunistically as part of the same patch. 3) Changes the "tasks" list to a map. Many operations require retrieval or removal of individual tasks; those are now O(1) instead of O(N) in the number of running tasks. 4) Changes various log messages to use task ID instead of full task payload, to make the logs more readable.
|
Log messages look like this when the race happens: |
|
FYI @jasonk000 I have run the patch in this PR locally through some smoke tests, but haven't run it in production yet. If you have a chance to, then please let us know how it goes. |
|
This pull request introduces 1 alert when merging 2274c28 into 4d65c08 - view on LGTM.com new alerts:
|
Fixed in the latest commit. |
rohangarg
left a comment
There was a problem hiding this comment.
Great find and explanation with logs! 👍
LGTM mostly, left a couple of small comments.
| private static final long MANAGEMENT_WAIT_TIMEOUT_NANOS = TimeUnit.SECONDS.toNanos(60); | ||
| private static final long MIN_WAIT_TIME_MS = 100; | ||
|
|
||
| // Task ID -> Task that is currently running |
There was a problem hiding this comment.
I think this map can contain tasks in pending state too. Also, this could be out-of-sync with task-storage as per documentation (upon manual changes in task-storage). Maybe, this can be called in-memory view of known tasks?
There was a problem hiding this comment.
I changed the comment to hopefully be clearer. It now reads:
// Task ID -> Task, for tasks that are active in some way (submitted, running, or finished and to-be-cleaned-up).
|
|
||
| if (removed == 0) { | ||
| log.warn("Unknown task completed: %s", task.getId()); | ||
| // Save status to metadata store first, so if we crash while doing the rest of the shutdown, our successor |
There was a problem hiding this comment.
doubt : does this order change also solve the race condition problem itself? As per my current understanding, re-launching can happen only if syncFromStorage thinks that there is an active task in metadata storage and the in-memory view doesn't know about it. That leads to creation of in-memory task which then management thread launches.
With this change, the syncFromStorage can have three views :
tasksmap contains the task and task-storage has it as active tasktasksmap contains the task and task-storage doesn't have it as active task in which case bothsyncFromStorageandnotifyStatuswill clean it uptasksmap doesn't contain the task and task-storage doesn't have it too
I think all 3 cases should be ok, but I maybe missing something here.
There was a problem hiding this comment.
The order switch solves the original race but creates a new one, which you have listed as case (2). It would create a situation where two threads are trying to clean up the same task at the same time. This may be fine, but I don't think it's prudent to rely on it being fine. Cleaner to ensure that only one thread tries to clean up the task.
Thanks @gianm, great catch + fix. We are running 0.20.x still on most of our stack, so are not in a good spot to test it out directly. Hoping we can get to something newer soon 🤞 . |
It was possible for
manageInternalto relaunch a task while it wasbeing cleaned up, due to a race that happens when
notifyStatusiscalled to clean up a successful task:
notifyStatusremoves the task fromtasks.notifyStatuscallstaskRunner.shutdownto let the task runner know it can clear out its data structures.
syncFromStorageadds the task back totasks,because it is still present in metadata storage.
manageInternalCriticalnotices that the taskis in
tasksand is not running in thetaskRunner, so it launchesit again.
notifyStatusupdates the metadatastore to set the task status to SUCCESS.
The possibility for this race was introduced in #12099, which shrunk
the critical section in
notifyStatus. Prior to that patch, a singlecritical section encompassed (1), (2), and (5), so the ordering above
was not possible.
This patch does the following:
recentlyCompletedTasksset that preventsthe main management loop from doing anything with tasks that are
currently being cleaned up.
notifyStatus, sometadata store updates happen first. This is useful in case of
server failures: it ensures that if the Overlord fails in the midst
of
notifyStatus, then completed-task statuses are still available inZK or on MMs for the next Overlord. (Those are cleaned up by
taskRunner.shutdown, which formerly ran first.) This isn't relatedto the race described above, but is fixed opportunistically as part
of the same patch.
taskslist to a map. Many operations require retrievalor removal of individual tasks; those are now O(1) instead of O(N)
in the number of running tasks.
payload, to make the logs more readable.