-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
Most (all?) nodes in DataFusion explain plans have the property that they are displayed with the same name as the struct that implements them, which makes matching explain plans to code easy
However, a notable exception is RepartitionExec which is displayed sometimes as RepartitionExec and sometimes as SortPreservingRepartitionExec to signify it preserves the input sort order, which makes finding the relevant code harder (as there is no such code as SortPreservingRepartitionExec)
For example in the following plan RepartitionExec and SortPreservingRepartitionExec` are the same ExecutionPlan node:
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SortPreservingMergeExec: [cpu@2 ASC NULLS LAST,time@1 ASC NULLS LAST] |
| SortExec: expr=[cpu@2 ASC NULLS LAST,time@1 ASC NULLS LAST] |
| ProjectionExec: expr=[cpu as iox::measurement, time@0 as time, cpu@1 as cpu, coalesce(P@2 / 99.99, 0) as P] |
| RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=16 |
| ProjectionExec: expr=[time@1 as time, cpu@2 as cpu, (selector_max(cpu.usage_system,cpu.time)@3).[value] as P] |
| CoalesceBatchesExec: target_batch_size=8192 |
| FilterExec: AVG(cpu.usage_idle)@4 > 0.5 AND AVG(cpu.usage_idle)@4 < 1.5 AND (selector_max(cpu.usage_system,cpu.time))[value]selector_max(cpu.usage_system,cpu.time)@0 > 0 AND (selector_max(cpu.usage_system,cpu.time))[value]selector_max(cpu.usage_system,cpu.time)@0 < 1 |
| ProjectionExec: expr=[(selector_max(cpu.usage_system,cpu.time)@2).[value] as (selector_max(cpu.usage_system,cpu.time))[value]selector_max(cpu.usage_system,cpu.time), time@0 as time, cpu@1 as cpu, selector_max(cpu.usage_system,cpu.time)@2 as selector_max(cpu.usage_system,cpu.time), AVG(cpu.usage_idle)@3 as AVG(cpu.usage_idle)] |
| AggregateExec: mode=FinalPartitioned, gby=[time@0 as time, cpu@1 as cpu], aggr=[selector_max(cpu.usage_system,cpu.time), AVG(cpu.usage_idle)], ordering_mode=PartiallyOrdered |
| CoalesceBatchesExec: target_batch_size=8192 |
| SortPreservingRepartitionExec: partitioning=Hash([time@0, cpu@1], 16), input_partitions=16, sort_exprs=cpu@1 ASC |
| AggregateExec: mode=Partial, gby=[date_bin(10000000000, time@1, 0) as time, cpu@0 as cpu], aggr=[selector_max(cpu.usage_system,cpu.time), AVG(cpu.usage_idle)], ordering_mode=PartiallyOrdered |
| RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1 |
| ProjectionExec: expr=[cpu@1 as cpu, time@3 as time, usage_idle@4 as usage_idle, usage_system@5 as usage_system] |
| DeduplicateExec: [cpu@1 ASC,host@2 ASC,time@3 ASC] |
| SortExec: expr=[cpu@1 ASC,host@2 ASC,time@3 ASC,__chunk_order@0 ASC] |
| CoalesceBatchesExec: target_batch_size=8192 |
| FilterExec: time@3 <= 1699620021099501000 |
| RecordBatchesExec: chunks=1, projection=[__chunk_order, cpu, host, time, usage_idle, usage_system] |
|
Describe the solution you'd like
I would like to change the display of RepartitionExec so it always looks like RepartitionExec and has preserve_order=true to follow the same pattern as the other nodes where the output of the explain plan matches the name of the code that implements it.
So for example, name should always be RepartitionExec: https://github.com/apache/arrow-datafusion/blob/91c9d6f847eda0b5b1d01257b5c24459651d3926/datafusion/physical-plan/src/repartition/mod.rs#L373-L379
And instead of
SortPreservingRepartitionExec: partitioning=Hash([time@0, cpu@1], 16), input_partitions=16, sort_exprs=cpu@1 ASC
It should display like
RepartitionExec: partitioning=Hash([time@0, cpu@1], 16), input_partitions=16, preserve_order=true, sort_exprs=cpu@1 ASC
Describe alternatives you've considered
I recommend ONLY displaying preserve_order when it is true
so when false, display like this:
RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=16
Additional context
I think this is a good first issue as it is well specified and a mechanical change and would teach the person about datafusions tests. Beware, however, that this is likely to require a non trivial number of changes to expected test output.
Thanks in advance!