Limit max batch size for segment allocation, add docs#13503
Limit max batch size for segment allocation, add docs#13503kfaraz merged 4 commits intoapache:masterfrom
Conversation
|
|
||
| ### Batching `segmentAllocate` actions | ||
|
|
||
| In a cluster with a large number of concurrent tasks (say > 1000), `segmentAllocate` actions on the overlord may take very long intervals of time to finish thus causing spikes in the `task/action/run/time`. This may result in lag building up while a task waits for a segment to get allocated. |
There was a problem hiding this comment.
I think this is going to be useful even at much lower concurrent task counts. I have seen people have issues with allocation timings around task rollover even with just 10s of tasks, requiring a scale-up of metadata store size.
| These are various overlord actions performed by tasks during their lifecycle. Some typical actions are as follows: | ||
| - `lockAcquire`: acquires a time-chunk lock on an interval for the task | ||
| - `lockRelease`: releases a lock acquired by the task on an interval | ||
| - `segmentInsertion`: inserts segments into metadata store |
There was a problem hiding this comment.
Worth mentioning segmentTransactionalInsert too?
| |`task/action/log/time`|Milliseconds taken to log a task action to the audit log.| `dataSource`, `taskId`, `taskType`, `taskActionType`|< 1000 (subsecond)| | ||
| |`task/action/run/time`|Milliseconds taken to execute a task action.| `dataSource`, `taskId`, `taskType`, `taskActionType`|Varies from subsecond to a few seconds, based on action type.| | ||
| |`task/action/success/count`|Number of task actions that were executed successfully during the emission period. Currently only being emitted for batched `segmentAllocate` actions.| `dataSource`, `taskId`, `taskType`, `taskActionType`|Varies| | ||
| |`task/action/failed/count`|Number of task actions that failed during the emission period. Currently only being emitted for batched `segmentAllocate` actions.| `dataSource`, `taskId`, `taskType`, `taskActionType`|Varies| |
There was a problem hiding this comment.
Would be good to link all mentions of "batched segmentAllocate actions" to ../ingestion/tasks.md#batching-segmentallocate-actions.
| |`druid.indexer.storage.recentlyFinishedThreshold`|Duration of time to store task results. Default is 24 hours. If you have hundreds of tasks running in a day, consider increasing this threshold.|PT24H| | ||
| |`druid.indexer.tasklock.forceTimeChunkLock`|_**Setting this to false is still experimental**_<br/> If set, all tasks are enforced to use time chunk lock. If not set, each task automatically chooses a lock type to use. This configuration can be overwritten by setting `forceTimeChunkLock` in the [task context](../ingestion/tasks.md#context). See [Task Locking & Priority](../ingestion/tasks.md#context) for more details about locking in tasks.|true| | ||
| |`druid.indexer.tasklock.batchSegmentAllocation`| If set to true, segment allocate actions are performed in batches to improve the throughput and reduce the average `task/action/run/time`.|false| | ||
| |`druid.indexer.tasklock.batchAllocationWaitTime`|Milliseconds to wait between adding the first segment allocate action to a batch and executing that batch. The waiting time allows the batch to add more requests and thus improve the average segment allocation run time. This configuration takes effect only if `batchSegmentAllocation` is enabled. <br> This value should be decreased __only if__ there are failures while allocating segments due to metadata operations on a very large batch.|500| |
There was a problem hiding this comment.
500 milliseconds already seems pretty short. If people are getting failures due to too-large batches, I wonder if reducing this will reliably fix their problem. Maybe it's better to suggest they turn off batch segment allocation completely.
Ideally, the batching would be robust to this case, and avoid generating oversized metadata operations. Is that possible?
There was a problem hiding this comment.
Yes, we can limit the batch size to a (hard-coded) max value, say 500. We have seen clusters where batches of 700-800 have executed successfully.
| |`druid.indexer.storage.type`|Choices are "local" or "metadata". Indicates whether incoming tasks should be stored locally (in heap) or in metadata storage. "local" is mainly for internal testing while "metadata" is recommended in production because storing incoming tasks in metadata storage allows for tasks to be resumed if the Overlord should fail.|local| | ||
| |`druid.indexer.storage.recentlyFinishedThreshold`|Duration of time to store task results. Default is 24 hours. If you have hundreds of tasks running in a day, consider increasing this threshold.|PT24H| | ||
| |`druid.indexer.tasklock.forceTimeChunkLock`|_**Setting this to false is still experimental**_<br/> If set, all tasks are enforced to use time chunk lock. If not set, each task automatically chooses a lock type to use. This configuration can be overwritten by setting `forceTimeChunkLock` in the [task context](../ingestion/tasks.md#context). See [Task Locking & Priority](../ingestion/tasks.md#context) for more details about locking in tasks.|true| | ||
| |`druid.indexer.tasklock.batchSegmentAllocation`| If set to true, segment allocate actions are performed in batches to improve the throughput and reduce the average `task/action/run/time`.|false| |
There was a problem hiding this comment.
Would be good to link to ../ingestion/tasks.md#batching-segmentallocate-actions.
|
Thanks a lot for the feedback, @gianm ! I have made the following changes:
I wonder if we should just deprecate and remove the |
writer-jill
left a comment
There was a problem hiding this comment.
Added a few suggestions for style & clarity.
Co-authored-by: Jill Osborne <jill.osborne@imply.io>
| } | ||
| } | ||
|
|
||
| throw new ISE("Allocation queue is at capacity, all batches are full."); |
There was a problem hiding this comment.
Is MAX_QUEUE_SIZE measured in batches or allocations? What happens if it gets hit? How likely is it that we will hit it?
There was a problem hiding this comment.
MAX_QUEUE_SIZE applies to batches. I am using it in the loop above because that is the maximum possible number of batches, even for the same datasource, group id, etc
It is extremely unlikely to hit the max queue limit. It can happen if say, we have 2000 datasources, all doing streaming ingestion concurrently. If the queue is full, new incoming allocate actions are rejected.
|
Merging this as the failure is unrelated. The same test has passed for jdk11. |
Changes: - Limit max batch size in `SegmentAllocationQueue` to 500 - Rename `batchAllocationMaxWaitTime` to `batchAllocationWaitTime` since the actual wait time may exceed this configured value. - Replace usage of `SegmentInsertAction` in `TaskToolbox` with `SegmentTransactionalInsertAction`
Changes: - Limit max batch size in `SegmentAllocationQueue` to 500 - Rename `batchAllocationMaxWaitTime` to `batchAllocationWaitTime` since the actual wait time may exceed this configured value. - Replace usage of `SegmentInsertAction` in `TaskToolbox` with `SegmentTransactionalInsertAction`
Follow up to #13369
Changes:
SegmentAllocationQueueto 500batchAllocationMaxWaitTimetobatchAllocationWaitTimesince it was a little misleading and actual wait time may exceed this configured value.SegmentInsertActionfromTaskToolbox