Remove `giant` lock from Overlord TaskQueue by kfaraz · Pull Request #14293 · apache/druid

kfaraz · 2023-05-16T16:00:37Z

Description

The TaskQueue in the Overlord implements concurrency control using a giant lock. A similar technique has been used in other classes such as TaskMaster and TaskLockbox. While this giant lock does guarantee thread-safe access of critical sections of code, it can be too restrictive at times and can even lead to the Overlord being completely stuck.

A typical scenario is described below.

Insertion of a sub-task of an index_parallel task fails with a SQLTransientException (say, due to an oversized payload)
The index_parallel repeatedly requests the Overlord to insert this sub-task
Each time, the Overlord tries to insert this task up to 10 times (using RetryUtils)
While the Overlord is trying to insert, the calling thread holds the TaskQueue.giant lock
This causes the Overlord to essentially hang as no other TaskQueue operation can proceed without the lock. This includes operations like adding a new task, killing a task, submitting tasks to runner for execution, syncing from metadata, etc.

Note: The indefinite retry issue in the above scenario is also being addressed separately in #14271

Current implementation

The giant lock is a reentrant lock, which is effectively the same as the object monitor associated with any java object. In principle, this lock could be replaced by simply making all the methods of TaskQueue synchronized.

There are several fields which can be protected with more lenient locking to improve performance.

Proposed implementation

The operations/fields currently protected by the giant lock are discussed below.

(a) Methods: `start()`, `stop()`, `syncFromStorage()`, `manage()`

Change: Make methods synchronized
Rationale: This effectively remains the same as the current implementation

(b) Field `LinkedHashMap<String, Task> tasks`: `putIfAbsent()`, `get()`, `values()`, `remove()`

Change: Use a ConcurrentHashMap instead in conjunction with a BlockingDeque<String>
Rationale:

The only concurrency control needed here is at a task level, which can be easily ensured by a ConcurrentHashMap.
The BlockingDeque is used to maintain the order in which task IDs were submitted to the TaskQueue.
The updates to these data structures are made atomically using compute() and computeIfAbsent().

(c) Field `HashMap<String, Future> taskFutures`: `put()`, `remove()`

Change: Replace with a Sets.newConcurrentHashSet() instead
Rationale: This field is just used to track task ids already submitted to the task runner.

(d) Field `taskStorage`: `insert()`

Change: Move outside critical section
Rationale: TaskStorage implementations do not maintain any state and thus don't require any concurrency control

(e) Field `taskLockbox`: `add()`, `remove()`

Change: Move outside critical section
Rationale: TaskLockbox has its own giant lock and can thus be safely accessed here.

Other changes

Pending items (WIP)

Unit tests for

cleanup of dangling locks left over by tasks that couldn't acquire their locks on leader re-election
proper operation under different race conditions in TaskQueue

This PR has:

-                         Long::sum
-                     ));
+    return taskRunner.getRunningTasks().stream().collect(
+        Collectors.toMap(TaskRunnerWorkItem::getDataSource, task -> 1L, Long::sum)


-                         Long::sum
-                     ));
+    return taskRunner.getPendingTasks().stream().collect(
+        Collectors.toMap(TaskRunnerWorkItem::getDataSource, task -> 1L, Long::sum)


-                         Long::sum
-                     ));
+    return taskRunner.getRunningTasks().stream().collect(
+        Collectors.toMap(TaskRunnerWorkItem::getDataSource, task -> 1L, Long::sum)


-                         Long::sum
-                     ));
+    return taskRunner.getPendingTasks().stream().collect(
+        Collectors.toMap(TaskRunnerWorkItem::getDataSource, task -> 1L, Long::sum)


-    managementMayBeNecessary.offer(this);
+    synchronized (managementRequested) {
+      managementRequested.set(true);
+      managementRequested.notify();


jasonk000

@kfaraz , good to see -- performance of this code has been key to stability of the cluster for us.

I think there's some prior work here that's worth reviewing too to make sure we don't reintroduce these bugs, I've tagged at appropriate points.

jasonk000 · 2023-05-16T23:07:29Z

-      return delta;
-    }
+    Map<String, Long> total = new HashMap<>(datasourceToSuccessfulTaskCount);
+    datasourceToSuccessfulTaskCount.clear();


This and following fns have a small race if the CHM was changed between the call to clear. There could be some alternatives, like iterating entries and using replaceAll, or remove() or merge() in a loop, that would allow a more atomic unload of the CHM? Or, maybe make the value in the map an AtomicLong, and CAS modify it to zero during unload?

Yeah, I didn't like this either. Updated to iterate over keys and atomically remove them one by one.

jasonk000 · 2023-05-16T23:16:23Z

+    }
+    requestManagement();
+    // Remove any unacquired locks from storage (shutdown only clears entries for which a TaskLockPosse was acquired)
+    // This is called after requesting management as locks need to be cleared after notifyStatus is processed


Observation - This is an interesting comment, given that there's no guarantee the management has run by now. I wonder if it's stale, or is there an issue here? ... , since it has same behaviour as previous code.

Yeah, my thoughts exactly. Calling requestManagement() does not guarantee that management has actually happened. I am not a 100% on what was going on here, so left it as is. I will double check this code and maybe add some tests.

I checked the code. This comment is outdated/irrelevant now.

Firstly, calling requestManagement() just adds a request to the queue and does not ensure that task management has actually taken place. In fact, on start-up, it typically wouldn't take place until after 1 minute (default value of start delay).

Secondly, even if for some other reason, requestManagement() needs to be called before clearing the dangling locks, it would have already been called in the for loop preceding this (shutdown calls notifyStatus calls requestManagement).

I am removing this comment and moving the contents of the next for loop into the first one. I am also adding some tests to ensure cleanup of such dangling locks.

jasonk000 · 2023-05-16T23:29:50Z

-        catch (Exception e) {
-          log.warn(e, "TaskRunner failed to clean up task: %s", taskId);
-        }
+    final Set<String> knownTaskIds = tasks.keySet();


I'm a little wary of reintroducing this bug: #12901, in the case that tasks was changed. Can you check over it? It seems like if tasks has changed we might have an issue. But - you've covered most of them with synchronized. Can you check in add() and removeTasksInternal, I think these operate on tasks without being synchronized, so it might lead to some race?

Nb: earlier PR #12099 is maybe relevant here too for background/history.

Thanks for calling this out, @jasonk000 !

The issue described in #12901 is basically a race condition between notifyStatus and manageInternalCritical (renamed in this patch to runReadyTasks), where a task that is being shutdown might get relaunched. In #12901, this was solved by using this order of events:

Mark the task as recentlyCompleted

Update metadata store

Finish cleanup of in-memory data structures
All recentlyCompleted tasks were ignored in manageInternalCritical.

In the new set of changes, I have retained this complete behaviour. As soon as a task is marked as recentlyCompleted, it will not be touched by runReadyTasks or killUnknownTasks. The in-memory data-structures (including recentlyCompleted itself) are finally cleaned up atomically only when the task shutdown is finished.

I will try to add some tests for this scenario. Please let me know if you think of any other race condition, would be nice to have tests for all of those.

I have tried to ensure that all the changes made in the patch #12901 are retained.

This patch does the following:

Fixes the race by adding a recentlyCompletedTasks set that prevents
the main management loop from doing anything with tasks that are
currently being cleaned up.

The set of recentlyCompletedTaskIds is still being used for this purpose.

Switches the order of the critical sections in notifyStatus, so
metadata store updates happen first. This is useful in case of
server failures: it ensures that if the Overlord fails in the midst
of notifyStatus, then completed-task statuses are still available in
ZK or on MMs for the next Overlord. (Those are cleaned up by
taskRunner.shutdown, which formerly ran first.) This isn't related
to the race described above, but is fixed opportunistically as part
of the same patch.

Order of calls is still the same i.e. metadata store update happens first

Changes the tasks list to a map. Many operations require retrieval
or removal of individual tasks; those are now O(1) instead of O(N)
in the number of running tasks.

The tasks map is now a ConcurrentHashMap which still performs O(1) get and remove.

But now we also have an activeTaskIdQueue on which we do an O(n) remove. But this is fine because it does not block other operations of the TaskQueue. We need a queue(-like) structure here that must be thread-safe, so I ended up using a LinkedBlockingDeque. I am not sure of an alternative that would offer better time complexity.

Changes various log messages to use task ID instead of full task
payload, to make the logs more readable.

Retained as is.

jasonk000 · 2023-05-16T23:32:46Z

-    }
+    tasks.computeIfAbsent(
+        task.getId(),
+        taskId -> {


It seems the implementation here relies on an implicit lock being taken by CHM. Might be worth noting given that it is implementations-specific with CHM, in case implementation is switched in the future.

Yeah, it is intentional. I will add a comment calling out why it must be a ConcurrentHashMap.

jasonk000 · 2023-05-16T23:33:36Z

-    // Critical section: remove this task from all of our tracking data structures.
-    giant.lock();
-    try {
-      if (removeTaskInternal(task.getId())) {


re: earlier note about possibly re-introducing #12901.

jasonk000 · 2023-05-16T23:37:54Z

+    // Add new tasks and clean up removed tasks
+    addedTasks.forEach(this::addTaskInternal);
+    removedTasks.forEach(this::removeTaskInternal);
+    log.info(


This existed before, but it might be wise to remove as much as possible, including logging, from the synchronized block, and tighten it down a bit.

I don't think the footprint of this call would really make much of a difference compared to the earlier operations, fetching from DB and map manipulation.

But yeah, you never really know with logging. I suppose I could just return the counts from this method and log the values in the caller itself.

Could use synchronized (this) { .... } ?

But yes - it does raise a good qn about the remainder of the workflow.

kfaraz · 2023-05-17T03:31:09Z

Thanks for the feedback, @jasonk000 ! I have responded to your comments and plan to make changes/add more tests wherever necessary.

    // do not care if the item fits into the queue:
    // if the queue is already full, request has been triggered anyway
-    managementMayBeNecessary.offer(this);
+    managementRequestQueue.offer(new Object());


    // do not care if the item fits into the queue:
    // if the queue is already full, request has been triggered anyway
-    managementMayBeNecessary.offer(this);
+    managementRequestQueue.offer(this);


    // do not care if the item fits into the queue:
    // if the queue is already full, request has been triggered anyway
-    managementMayBeNecessary.offer(this);
+    managementRequestQueue.offer(reason);


github-actions · 2024-02-12T00:17:25Z

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

github-actions · 2024-03-12T00:15:47Z

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

Remove giant lock from TaskQueue

ac59440

kfaraz added Performance Area - Ingestion labels May 16, 2023

Fix tests

c61a2c9

github-advanced-security AI found potential problems May 16, 2023

View reviewed changes

jasonk000 reviewed May 16, 2023

View reviewed changes

Track success/failed counts consistently

daf123c

github-advanced-security AI found potential problems May 17, 2023

View reviewed changes

Comment thread indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java Fixed

kfaraz added 3 commits May 17, 2023 20:30

Add some javadocs

66287a3

Remove extra changes

893af8e

wip: temp changes

4371249

github-advanced-security AI found potential problems May 17, 2023

View reviewed changes

Comment thread indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java Fixed

github-advanced-security AI found potential problems May 17, 2023

View reviewed changes

kfaraz added the Design Review label May 20, 2023

Fix for spotbugs

129a1a6

github-advanced-security AI found potential problems May 20, 2023

View reviewed changes

AmatyaAvadhanula self-requested a review May 31, 2023 14:02

kfaraz added 3 commits June 15, 2023 19:33

Minor fixes

3d6d499

Fix TaskQueueScaleTest

2096a00

Merge branch 'master' of github.com:apache/druid into cleanup_task_queue

2503e02

kfaraz added the WIP label Jul 17, 2023

github-actions Bot added the stale label Feb 12, 2024

github-actions Bot closed this Mar 12, 2024

kfaraz deleted the cleanup_task_queue branch May 2, 2025 07:12

Conversation

kfaraz commented May 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Current implementation

Proposed implementation

(a) Methods: start(), stop(), syncFromStorage(), manage()

(b) Field LinkedHashMap<String, Task> tasks: putIfAbsent(), get(), values(), remove()

(c) Field HashMap<String, Future> taskFutures: put(), remove()

(d) Field taskStorage: insert()

(e) Field taskLockbox: add(), remove()

Other changes

Pending items (WIP)

Uh oh!

Check notice

Check notice

Uh oh!

Check notice

Check notice

Check warning

jasonk000 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz commented May 17, 2023

Uh oh!

Check notice

Uh oh!

Uh oh!

Check notice

Check notice

github-actions Bot commented Feb 12, 2024

Uh oh!

github-actions Bot commented Mar 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kfaraz commented May 16, 2023 •

edited

Loading

(a) Methods: `start()`, `stop()`, `syncFromStorage()`, `manage()`

(b) Field `LinkedHashMap<String, Task> tasks`: `putIfAbsent()`, `get()`, `values()`, `remove()`

(c) Field `HashMap<String, Future> taskFutures`: `put()`, `remove()`

(d) Field `taskStorage`: `insert()`

(e) Field `taskLockbox`: `add()`, `remove()`