-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-33206: [C++] Add support for StructArray sorting and nested sort keys #35727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
Note that this doesn't currently extend nested key support to |
7e0a2fb to
3360b04
Compare
|
@pitrou Sorry, I pushed changes to this but forgot to re-ping... There are a few things that could probably be tweaked still, but I wanted to get an opinion on the general approach here first (regarding sorting by entire struct fields). |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks more flexible now @benibus, thanks! Some comments below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of leaking tmp_indices into the caller like this, can we have a toplevel AddField that doesn't take this parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could. Although doing it this way potentially avoids reallocating the vector for every path (it should retain its max capacity).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the toplevel AddField can delegate to the existing functions, but hide the tmp_indices dance from its caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you mean accross all sort keys. Fair enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, the fact that you're also passing an inout-parameter that gets appended to advocates for a struct carrying that state:
struct SortFieldPopulator {
public:
Result<std::vector<SortField>> FindSortKeys(...)
protected:
std::vector<SortField> sort_fields_;
std::unordered_set<FieldPath> seen_;
std::vector<int> tmp_indices_;
};3360b04 to
d9ace64
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, one minor suggestion.
|
Conbench analyzed the 6 benchmark runs on commit There were 26 benchmark results indicating a performance regression:
The full Conbench report has more details. |
### Rationale for this change Fixes a regression introduced in #35727. ### What changes are included in this PR? Re-implements a branch in the `Table` sorter that defers to the `ChunkedArray` sorter for single sort keys. ### Are these changes tested? Covered by existing tests. ### Are there any user-facing changes? No. * Closes: #36176 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
We don't currently support sorting
StructArrays despite already having the high-level facilities to do so. For instance, we allow passing multiple sort keys (based onFieldRefs) to sort record batches and tables - but the current implementations are fairly limited since nested refs aren't allowed (due to the burden of null flattening). Since #35197, we now have an easier way to do this.What changes are included in this PR?
StructArrayinsort_indicessort_indicesforRecordBatch,ChunkedArray, andTableAre these changes tested?
Yes (tests are included)
Are there any user-facing changes?
Yes