Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Oct 2, 2023

Which issue does this PR close?

closes #7722

Rationale for this change

While debugging an issue upgrading our code to use DataFuson, @ozankabak pointed me at the following config: #7671 (comment)

This setting (I think) controls if the DataFusion planner should prefer using the existing sort order or trying to maximize paralleilsm using repartition and re-sorting

It turns out to be the right one, but I don't think I would have found it without @ozankabak 's suggestion

I think the core of my challenge is that the current name describes how it modifies DataFusion's algorithms rather than what effect it has on the plans

What changes are included in this PR?

I propose to change the config to prefer_existing_sort and update the documentation

Are these changes tested?

existing tests

Are there any user-facing changes?

yes, a config setting has a different name (and this is a breaking API change)

@alamb alamb added the api change Changes the API exposed to users of the crate label Oct 2, 2023
@github-actions github-actions bot added the core Core DataFusion crate label Oct 2, 2023
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new name, thanks @alamb

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 2, 2023
| datafusion.optimizer.repartition_windows | true | Should DataFusion repartition data using the partitions keys to execute window functions in parallel using the provided `target_partitions` level |
| datafusion.optimizer.repartition_sorts | true | Should DataFusion execute sorts in a per-partition fashion and merge afterwards instead of coalescing first and sorting globally. With this flag is enabled, plans in the form below `text "SortExec: [a@0 ASC]", " CoalescePartitionsExec", " RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1", ` would turn into the plan below which performs better in multithreaded environments `text "SortPreservingMergeExec: [a@0 ASC]", " SortExec: [a@0 ASC]", " RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1", ` |
| datafusion.optimizer.bounded_order_preserving_variants | false | When true, DataFusion will opportunistically remove sorts by replacing `RepartitionExec` with `SortPreservingRepartitionExec`, and `CoalescePartitionsExec` with `SortPreservingMergeExec`, even when the query is bounded. |
| datafusion.optimizer.prefer_existing_sort | false | When true, DataFusion will opportunistically remove sorts when the data is already sorted, replacing `RepartitionExec` with `SortPreservingRepartitionExec`, and `CoalescePartitionsExec` with `SortPreservingMergeExec`, When false, DataFusion will prefer to maximize the parallelism using `Repartition/Coalesce` and resort the data subsequently with `SortExec` |
Copy link
Member

@viirya viirya Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| datafusion.optimizer.prefer_existing_sort | false | When true, DataFusion will opportunistically remove sorts when the data is already sorted, replacing `RepartitionExec` with `SortPreservingRepartitionExec`, and `CoalescePartitionsExec` with `SortPreservingMergeExec`, When false, DataFusion will prefer to maximize the parallelism using `Repartition/Coalesce` and resort the data subsequently with `SortExec` |
| datafusion.optimizer.prefer_existing_sort | false | When true, DataFusion will opportunistically remove sorts when the data is already sorted, replacing `RepartitionExec` with `SortPreservingRepartitionExec` (i.e., `RepartitionExec` with `preserve_order` as true), and `CoalescePartitionsExec` with `SortPreservingMergeExec`. When false, DataFusion will prefer to maximize the parallelism using `Repartition/Coalesce` and resort the data subsequently with `SortExec` |

/// When true, DataFusion will opportunistically remove sorts by replacing
/// `RepartitionExec` with `SortPreservingRepartitionExec`, and
/// When true, DataFusion will opportunistically remove sorts when the data is already sorted,
/// replacing `RepartitionExec` with `SortPreservingRepartitionExec`, and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually there is no SortPreservingRepartitionExec operator but it is a variant of RepartitionExec with preserve_order as true. It is a little confusion at first if trying to look for SortPreservingRepartitionExec type in IDE.

Maybe:

Suggested change
/// replacing `RepartitionExec` with `SortPreservingRepartitionExec`, and
/// replacing `RepartitionExec` with `SortPreservingRepartitionExec` (i.e., `RepartitionExec` with `preserve_order` as true), and

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thank you @viirya -- I tried to improve the wording in 35c6748. Let me know what you think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @alamb

@alamb alamb merged commit 0408c2b into apache:main Oct 4, 2023
@alamb alamb deleted the alamb/order_preserving_docs branch October 4, 2023 21:20
Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request Oct 7, 2023
…sort` and update docs (apache#7723)

* Improve documentation for bounded_order_preserving_variants config

* update docs

* fmt

* update config

* fix typo :facepalm

* prettier

* Reword for clarity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bounded_order_preserving_variants configuration setting is confusingly named

3 participants