-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Rename bounded_order_preserving_variants config to prefer_exising_sort and update docs
#7723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ozankabak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new name, thanks @alamb
docs/source/user-guide/configs.md
Outdated
| | datafusion.optimizer.repartition_windows | true | Should DataFusion repartition data using the partitions keys to execute window functions in parallel using the provided `target_partitions` level | | ||
| | datafusion.optimizer.repartition_sorts | true | Should DataFusion execute sorts in a per-partition fashion and merge afterwards instead of coalescing first and sorting globally. With this flag is enabled, plans in the form below `text "SortExec: [a@0 ASC]", " CoalescePartitionsExec", " RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1", ` would turn into the plan below which performs better in multithreaded environments `text "SortPreservingMergeExec: [a@0 ASC]", " SortExec: [a@0 ASC]", " RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1", ` | | ||
| | datafusion.optimizer.bounded_order_preserving_variants | false | When true, DataFusion will opportunistically remove sorts by replacing `RepartitionExec` with `SortPreservingRepartitionExec`, and `CoalescePartitionsExec` with `SortPreservingMergeExec`, even when the query is bounded. | | ||
| | datafusion.optimizer.prefer_existing_sort | false | When true, DataFusion will opportunistically remove sorts when the data is already sorted, replacing `RepartitionExec` with `SortPreservingRepartitionExec`, and `CoalescePartitionsExec` with `SortPreservingMergeExec`, When false, DataFusion will prefer to maximize the parallelism using `Repartition/Coalesce` and resort the data subsequently with `SortExec` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | datafusion.optimizer.prefer_existing_sort | false | When true, DataFusion will opportunistically remove sorts when the data is already sorted, replacing `RepartitionExec` with `SortPreservingRepartitionExec`, and `CoalescePartitionsExec` with `SortPreservingMergeExec`, When false, DataFusion will prefer to maximize the parallelism using `Repartition/Coalesce` and resort the data subsequently with `SortExec` | | |
| | datafusion.optimizer.prefer_existing_sort | false | When true, DataFusion will opportunistically remove sorts when the data is already sorted, replacing `RepartitionExec` with `SortPreservingRepartitionExec` (i.e., `RepartitionExec` with `preserve_order` as true), and `CoalescePartitionsExec` with `SortPreservingMergeExec`. When false, DataFusion will prefer to maximize the parallelism using `Repartition/Coalesce` and resort the data subsequently with `SortExec` | |
datafusion/common/src/config.rs
Outdated
| /// When true, DataFusion will opportunistically remove sorts by replacing | ||
| /// `RepartitionExec` with `SortPreservingRepartitionExec`, and | ||
| /// When true, DataFusion will opportunistically remove sorts when the data is already sorted, | ||
| /// replacing `RepartitionExec` with `SortPreservingRepartitionExec`, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually there is no SortPreservingRepartitionExec operator but it is a variant of RepartitionExec with preserve_order as true. It is a little confusion at first if trying to look for SortPreservingRepartitionExec type in IDE.
Maybe:
| /// replacing `RepartitionExec` with `SortPreservingRepartitionExec`, and | |
| /// replacing `RepartitionExec` with `SortPreservingRepartitionExec` (i.e., `RepartitionExec` with `preserve_order` as true), and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks @alamb
…sort` and update docs (apache#7723) * Improve documentation for bounded_order_preserving_variants config * update docs * fmt * update config * fix typo :facepalm * prettier * Reword for clarity
Which issue does this PR close?
closes #7722
Rationale for this change
While debugging an issue upgrading our code to use DataFuson, @ozankabak pointed me at the following config: #7671 (comment)
This setting (I think) controls if the DataFusion planner should prefer using the existing sort order or trying to maximize paralleilsm using repartition and re-sorting
It turns out to be the right one, but I don't think I would have found it without @ozankabak 's suggestion
I think the core of my challenge is that the current name describes how it modifies DataFusion's algorithms rather than what effect it has on the plans
What changes are included in this PR?
I propose to change the config to
prefer_existing_sortand update the documentationAre these changes tested?
existing tests
Are there any user-facing changes?
yes, a config setting has a different name (and this is a breaking API change)