-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
The current CoalesceBatches optimization rule only create CoalesceBatchesExec based on the batch_size configured in the config struct, which can cause issues in some cases involving limit operators.
Consider the following scenario:
When a rule-compliant operation includes a limit operator on top of CoalesceBatchesExec, and the limit value is less than the batch_size, the entire computation might be blocked until a full Batch is collected, even though the limit has already been reached.
A possible operator tree:
SortExec: TopK(fetch=10), expr=[event_time@3 DESC]
LocalLimitExec: fetch=100
CoalesceBatchesExec: target_batch_size=8192
FilterExec: event_time@3 = 10
TableScanExec
My idea is:
- when the limit occurs you need to change the fetch of
CoalesceBatchesExectolimit/partition - if the limit is small enough, then no optimization is performed
Of course, we also need to consider special cases, like if the limit operator is above SortExec, then limit shouldn't affect the batch_size value.
Describe the solution you'd like
The fetch is determined based on the limit operator's value and the current parallelism.
Describe alternatives you've considered
When operators downstream of the limit operator require a full table scan (e.g., SortExec), batch_size is not handled specially.
Additional context
No response