Skip to content

Conversation

@benibus
Copy link
Contributor

@benibus benibus commented May 23, 2023

Rationale for this change

We don't currently support sorting StructArrays despite already having the high-level facilities to do so. For instance, we allow passing multiple sort keys (based on FieldRefs) to sort record batches and tables - but the current implementations are fairly limited since nested refs aren't allowed (due to the burden of null flattening). Since #35197, we now have an easier way to do this.

What changes are included in this PR?

  • Adds support for StructArray in sort_indices
  • Adds support for nested sort keys in sort_indices for RecordBatch, ChunkedArray, and Table

Are these changes tested?

Yes (tests are included)

Are there any user-facing changes?

Yes

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #33206 has been automatically assigned in GitHub to PR creator.

@benibus benibus marked this pull request as ready for review May 24, 2023 15:24
@benibus benibus requested a review from westonpace as a code owner May 24, 2023 15:24
@benibus
Copy link
Contributor Author

benibus commented May 24, 2023

Note that this doesn't currently extend nested key support to rank and selectk, but that would be a logical next step.

@github-actions github-actions bot removed the awaiting review Awaiting review label May 30, 2023
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label May 30, 2023
@benibus benibus force-pushed the GH-33206-struct-sorting branch from 7e0a2fb to 3360b04 Compare June 7, 2023 14:26
@benibus benibus requested a review from pitrou June 12, 2023 05:04
@benibus
Copy link
Contributor Author

benibus commented Jun 12, 2023

@pitrou Sorry, I pushed changes to this but forgot to re-ping...

There are a few things that could probably be tweaked still, but I wanted to get an opinion on the general approach here first (regarding sorting by entire struct fields).

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more flexible now @benibus, thanks! Some comments below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of leaking tmp_indices into the caller like this, can we have a toplevel AddField that doesn't take this parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. Although doing it this way potentially avoids reallocating the vector for every path (it should retain its max capacity).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the toplevel AddField can delegate to the existing functions, but hide the tmp_indices dance from its caller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you mean accross all sort keys. Fair enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, the fact that you're also passing an inout-parameter that gets appended to advocates for a struct carrying that state:

struct SortFieldPopulator {
 public:
  Result<std::vector<SortField>> FindSortKeys(...)

 protected:
  std::vector<SortField> sort_fields_;
  std::unordered_set<FieldPath> seen_;
  std::vector<int> tmp_indices_;
};

@benibus benibus force-pushed the GH-33206-struct-sorting branch from 3360b04 to d9ace64 Compare June 19, 2023 00:36
@benibus benibus requested a review from pitrou June 19, 2023 01:45
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, one minor suggestion.

@benibus benibus requested a review from pitrou June 19, 2023 17:34
@pitrou pitrou merged commit 0f65220 into apache:main Jun 20, 2023
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit 0f652201.

There were 26 benchmark results indicating a performance regression:

The full Conbench report has more details.

pitrou pushed a commit that referenced this pull request Jun 20, 2023
### Rationale for this change

Fixes a regression introduced in #35727.

### What changes are included in this PR?

Re-implements a branch in the `Table` sorter that defers to the `ChunkedArray` sorter for single sort keys.

### Are these changes tested?

Covered by existing tests.

### Are there any user-facing changes?

No.

* Closes: #36176

Authored-by: benibus <bpharks@gmx.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] Ensure sorting Table, RecordBatch can sort Struct and Dictionary cols

2 participants