Make SegmentAllocationQueue multithreaded#18098
Conversation
| private boolean batchAllocationReduceMetadataIO = true; | ||
|
|
||
| @JsonProperty | ||
| private int batchAllocationNumThreads = 5; |
| /** | ||
| * Thread-safe list of datasources for which a segment allocation is currently in-progress. | ||
| */ | ||
| private final List<String> runningDatasources = Collections.synchronizedList(new ArrayList<>()); |
There was a problem hiding this comment.
contains and remove run on this list- could it be a Set?
| final String dataSource = nextBatch.key.dataSource; | ||
| if (nextBatch.isDue()) { | ||
| if (runningDatasources.contains(dataSource)) { | ||
| // Skip this batch as another batch for the same datasource is in progress |
There was a problem hiding this comment.
Will this cause a busy loop where we keep retrying this skipped batch over and over? I would think since it remains in processingQueue, we'll do a scheduleQueuePoll at the end of this function, and the default maxWaitTimeMilis is zero (meaning another immediate poll).
There was a problem hiding this comment.
Yes, thanks for catching this! This was a mental TODO but I forgot to mark it.
I will check what we can do here.
There was a problem hiding this comment.
Updated. Added a small delay of 5 millis in case anything was skipped or if all threads are busy.
There was a problem hiding this comment.
Hmm. I think this would still lead to a ton of task/action/batch/skipped metrics being emitted if we have a long-running allocation in flight with another queued up. Fixing that by extending the min wait time would be bad, because that would slow down our responsiveness to allocation requests. Delays are undesirable anyway- we want everything to be as reactive as possible.
Is there an alternate approach you could go with? Maybe when we skip a batch, put it into a separate data structure keyed by datasource. Then when the current batch finishes, the worker thread running that batch could move the skipped batches back to the main queue.
There was a problem hiding this comment.
Thanks for the suggestion, @gianm!
I have updated the approach in the PR but with some modifications that seemed to adhere
to the current design of the class better.
- Remove the delay
- When skipping a batch, mark it as "skipped" and emit the metric. Do not emit metric again if already skipped.
- Do not reschedule queue poll if all workers are busy OR if queue is empty OR if all batches were skipped.
- When a worker finishes, schedule a queue poll.
| // All remaining entries in the queue were skipped | ||
| log.debug("Not scheduling again since datasources are already being processed."); | ||
| } else if (processingQueue.isEmpty()) { | ||
| log.debug("Not scheduling again since queue is empty."); |
There was a problem hiding this comment.
I think this would be caught by the previous line -- if the queue is empty, numSkippedBatches and processingQueue.size() are both zero, and 0 >= 0. Consider collapsing them both into a block like Not scheduling again since there are no eligible batches (skipped[%d])
* Make SegmentAllocationQueue multithreaded * Do not run multiple jobs for the same datasource * Add docs, min schedule delay to avoid busy waiting * Trigger queue poll when worker finishes * Emit skip metric once per queued batch * Simplify scheduling condition

Description
Follow up to #17390
Once we start maintaining a
TaskLockboxfor each datasource, the single-threaded design of theSegmentAllocationQueuewould become the bottleneck.This patch makes
SegmentAllocationQueuemultithreaded so that allocation for multiple datasourcescan happen in parallel.
Non-batch segment allocation is already multithreaded as each allocation runs on its individual jetty thread.
Changes
druid.indexer.tasklock.batchAllocationNumThreadswith default value 5SegmentAllocationQueueto perform segment allocationtask/action/batch/submittedfor the count of submitted jobstask/action/batch/skippedfor the count of skipped jobsRelease note
Add config
druid.indexer.tasklock.batchAllocationNumThreadswith default value 5 to control the number ofsegment allocation threads. This allows for concurrent segment allocations if there are segment allocations
happening for several different datasources.
Note that setting this config to a very large value will put undue strain on the metadata store and only hamper performance.
This PR has: