Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Apr 5, 2023

Which issue does this PR close?

Rationale for this change

While reviewing #5881 I had to look up what exactly "new_with_partitioning" did and what all the arguments meant and found the whole thing quite obtuse. Let's make this clearer.

What changes are included in this PR?

  1. Add doc comments better explaining what SortExec does and what preserve_partitioning does
  2. Add SortExec::new() which is infallable (try_new always returned true)
  3. Add SortExec::with_preserve_partitioning to set the preserve partitioning flag
  4. Add SortExec::with_fetch to set the fetch
  5. Mark SortExec::try_new() and SortExec::new_with_partitioning as deprecated

Are these changes tested?

Yes, by clippy and existing tests

Are there any user-facing changes?

The old APIs are deprecated, but they will still work

@github-actions github-actions bot added the core Core DataFusion crate label Apr 5, 2023
)) as _;
let sort2 = Arc::new(
SortExec::new(sort_exprs.clone(), repartition_exec, None)
.with_preserve_partitioning(true),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes it clearer that the sort is being constructed to preserve the input partitioning

@alamb alamb requested a review from tustvold April 5, 2023 21:04
@alamb
Copy link
Contributor Author

alamb commented Apr 6, 2023

WHile looking at the code again, I realized that fetch also was almost always None so rather than having to provide a None at all creation sites, I added a with_fetch function -- since I was changing the signature of new() anyways I figured I might as well do that as well in the same PR -- in 8c5bee4

}

/// Whether this `SortExec` preserves partitioning of the children
pub fn with_fetch(mut self, fetch: Option<usize>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if limit might be a more standard name for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the term comes from Spark (some earlier versions of the code had the word Limit). In any event I think fetch is at least consistent with the rest of the DataFusion codebase, such as in the implementation of sort:

https://github.com/apache/arrow-datafusion/blob/d9cffa6b18e0796dc718cec3ab94bc0119da747a/datafusion/core/src/physical_plan/sorts/sort.rs#L335

@alamb alamb merged commit db2fa44 into apache:main Apr 10, 2023
korowa pushed a commit to korowa/arrow-datafusion that referenced this pull request Apr 13, 2023
* Clean up SortExec creation and add doc comments

* Reduce API surface

* restore sort bench

* fix benchmark

* Add with_fetch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants