Skip to content

Change display of RepartitionExec from SortPreservingRepartitionExec to RepartitionExec preserve_order=true` #8129

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Most (all?) nodes in DataFusion explain plans have the property that they are displayed with the same name as the struct that implements them, which makes matching explain plans to code easy

However, a notable exception is RepartitionExec which is displayed sometimes as RepartitionExec and sometimes as SortPreservingRepartitionExec to signify it preserves the input sort order, which makes finding the relevant code harder (as there is no such code as SortPreservingRepartitionExec)

For example in the following plan RepartitionExec and SortPreservingRepartitionExec` are the same ExecutionPlan node:

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan                                                                                                                                                                                                                                                                                                                                                  |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SortPreservingMergeExec: [cpu@2 ASC NULLS LAST,time@1 ASC NULLS LAST]                                                                                                                                                                                                                                                                                 |
|   SortExec: expr=[cpu@2 ASC NULLS LAST,time@1 ASC NULLS LAST]                                                                                                                                                                                                                                                                                         |
|     ProjectionExec: expr=[cpu as iox::measurement, time@0 as time, cpu@1 as cpu, coalesce(P@2 / 99.99, 0) as P]                                                                                                                                                                                                                                       |
|       RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=16                                                                                                                                                                                                                                                                          |
|         ProjectionExec: expr=[time@1 as time, cpu@2 as cpu, (selector_max(cpu.usage_system,cpu.time)@3).[value] as P]                                                                                                                                                                                                                                 |
|           CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                 |
|             FilterExec: AVG(cpu.usage_idle)@4 > 0.5 AND AVG(cpu.usage_idle)@4 < 1.5 AND (selector_max(cpu.usage_system,cpu.time))[value]selector_max(cpu.usage_system,cpu.time)@0 > 0 AND (selector_max(cpu.usage_system,cpu.time))[value]selector_max(cpu.usage_system,cpu.time)@0 < 1                                                               |
|               ProjectionExec: expr=[(selector_max(cpu.usage_system,cpu.time)@2).[value] as (selector_max(cpu.usage_system,cpu.time))[value]selector_max(cpu.usage_system,cpu.time), time@0 as time, cpu@1 as cpu, selector_max(cpu.usage_system,cpu.time)@2 as selector_max(cpu.usage_system,cpu.time), AVG(cpu.usage_idle)@3 as AVG(cpu.usage_idle)] |
|                 AggregateExec: mode=FinalPartitioned, gby=[time@0 as time, cpu@1 as cpu], aggr=[selector_max(cpu.usage_system,cpu.time), AVG(cpu.usage_idle)], ordering_mode=PartiallyOrdered                                                                                                                                                         |
|                   CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                         |
|                     SortPreservingRepartitionExec: partitioning=Hash([time@0, cpu@1], 16), input_partitions=16, sort_exprs=cpu@1 ASC                                                                                                                                                                                                                  |
|                       AggregateExec: mode=Partial, gby=[date_bin(10000000000, time@1, 0) as time, cpu@0 as cpu], aggr=[selector_max(cpu.usage_system,cpu.time), AVG(cpu.usage_idle)], ordering_mode=PartiallyOrdered                                                                                                                                  |
|                         RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1                                                                                                                                                                                                                                                         |
|                           ProjectionExec: expr=[cpu@1 as cpu, time@3 as time, usage_idle@4 as usage_idle, usage_system@5 as usage_system]                                                                                                                                                                                                             |
|                             DeduplicateExec: [cpu@1 ASC,host@2 ASC,time@3 ASC]                                                                                                                                                                                                                                                                        |
|                               SortExec: expr=[cpu@1 ASC,host@2 ASC,time@3 ASC,__chunk_order@0 ASC]                                                                                                                                                                                                                                                    |
|                                 CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                           |
|                                   FilterExec: time@3 <= 1699620021099501000                                                                                                                                                                                                                                                                           |
|                                     RecordBatchesExec: chunks=1, projection=[__chunk_order, cpu, host, time, usage_idle, usage_system]                                                                                                                                                                                                                |
|                                                                                                                                             

Describe the solution you'd like

I would like to change the display of RepartitionExec so it always looks like RepartitionExec and has preserve_order=true to follow the same pattern as the other nodes where the output of the explain plan matches the name of the code that implements it.

So for example, name should always be RepartitionExec: https://github.com/apache/arrow-datafusion/blob/91c9d6f847eda0b5b1d01257b5c24459651d3926/datafusion/physical-plan/src/repartition/mod.rs#L373-L379

And instead of

 SortPreservingRepartitionExec: partitioning=Hash([time@0, cpu@1], 16), input_partitions=16, sort_exprs=cpu@1 ASC

It should display like

 RepartitionExec: partitioning=Hash([time@0, cpu@1], 16), input_partitions=16, preserve_order=true, sort_exprs=cpu@1 ASC

Describe alternatives you've considered

I recommend ONLY displaying preserve_order when it is true

so when false, display like this:

RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=16                       

Additional context

I think this is a good first issue as it is well specified and a mechanical change and would teach the person about datafusions tests. Beware, however, that this is likely to require a non trivial number of changes to expected test output.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions