ARROW-14656: [Python] Add sorting for StructArray #14369

Jedi18 · 2022-10-11T07:46:12Z

This is the first PR for #11659 (splitting up this PR into multiple PRs). This PR adds a sort helper function for the array, chunked_array and array of structs (StructArray) and chunked array of structs sorting.

A NestedValuesComparator is added which is planned to be a common comparator for anything that has nested fields and columns. Follow up PRs will update the Table, RecordBatch etc sorting implementations to use this NestedValuesComparator.

Related issue: #33206

github-actions · 2022-10-11T07:58:12Z

https://issues.apache.org/jira/browse/ARROW-14656

jorisvandenbossche

Thanks for the PR! Only looked at the Python code for now, added a few minor comments.

Can you also add some tests to test_compute.py for the new capabilities of sort_indices on a StructArray? (I know this is implicitly tested through the Array.sort() method, but I think it's good to also have an explicit test for it. I think we will also want a C++ test for it)

python/pyarrow/array.pxi

jorisvandenbossche · 2022-10-26T10:04:45Z

python/pyarrow/array.pxi

+        if fieldname is not None:
+            tosort = self.field(fieldname)


We currently don't use "fieldname" anywhere. Maybe just field as the keyword argument?

field implied a Field instance to me, that's why I went for fieldname.

jorisvandenbossche · 2022-10-26T10:05:42Z

python/pyarrow/table.pxi


+    def sort(self, order="ascending"):
+        """
+        Sort the ChunkedArray


Should we support the fieldname keyword here as well? (and raise an error in case the type of the ChunkedArray is not a struct)

The ChunkedArray doesn't provide support for accessing fields directly (IE: no ChunkedArray.field(X) method) so we would probably have to implement that too if we want to add support for sorting an individual field. Which seems to expand the scope of this change. I propose we make a separate ticket for improving support for Struct data in ChunkedArray

Yes, that's indeed a general issue that ChunkedArray can be annoying when working with a type that type-specific methods on the array (that's not only for StructArray)

pitrou · 2022-10-26T14:19:04Z

Thanks @Jedi18 for attempting this. At a high level, two things:

Can you first tackle just the C++ side and write tests on that side?

Follow up PRs will update the Table, RecordBatch etc sorting implementations to use this NestedValuesComparator.

Did you actually measure performance? I expect this to be slower that the current RecordBatch and Table sorting.

jorisvandenbossche · 2022-11-22T13:45:45Z

@Jedi18 do you have time to update this?

Jedi18 · 2022-11-22T13:59:57Z

@jorisvandenbossche so sorry about the late reply, I missed the github emails and hadn't checked this PR in a long while. Yes I'll address the review comments in the next few days when I get time.

Jedi18 · 2022-11-22T14:07:06Z

@pitrou Sorry for the late reply, just saw the comments now

Can you first tackle just the C++ side and write tests on that side?

Sure I'll do that first. There was no straightforward way to test it on the C++ side hence the python tests. I'll look into it a bit more and see how I can test it in the unit tests.

Did you actually measure performance? I expect this to be slower that the current RecordBatch and Table sorting.

I did not, and yeah that's one of the reasons for wanting seperate PRs for those so I can at the very least mirror the current implementation and performance.
Are you saying we should get those done first and measure it before we think about merging this in case it turns out to be slower than the current implementation?

pitrou · 2022-11-22T14:51:21Z

Are you saying we should get those done first and measure it before we think about merging this in case it turns out to be slower than the current implementation?

Yes, this would be better. Otherwise I would suggest directly reusing the current Table/Batch sorting for Struct array sorting (the one thing missing being handling the top-level null bitmap).

pitrou · 2022-11-22T14:54:43Z

For more context, the original Table/Batch sort was based on something similar to NestedValuesComparator, and I significantly improved its performance by switching to per-column sorting for cases with few keys.

You can look at the source code references for RadixRecordBatchSorter and MultipleKeyRecordBatchSorter.

amol- · 2022-11-23T10:21:13Z

Are you saying we should get those done first and measure it before we think about merging this in case it turns out to be slower than the current implementation?

Yes, this would be better. Otherwise I would suggest directly reusing the current Table/Batch sorting for Struct array sorting (the one thing missing being handling the top-level null bitmap).

To avoid mixing too many concerns in a single activity, I'd like to propose we focus this task on providing the capability and then tackle performance improvements and benchmarking into a dedicated activity. I agree we don't want to introduce any slowdown, but as far as we are talking about adding the capability to something that wasn't supported, I think we can live for a certain amount of time with having struct fields sorting slower than table sorting while we work on performance improvements for it.

This has been a feature under work for months at this point and I think it would provide immediate value to users while giving us the time to work on refining and improving it.

Obviously any work at migrating Tables and RecordBatches to the NestedValueComparator (or whatever will come) should be subordinate to ensuring the comparator is on pair with the current implementation.

pitrou · 2022-11-23T10:37:56Z

I disagree about introducing non-trivial C++ code that is redundant with existing code and probably sub-performant.

jorisvandenbossche · 2022-11-24T15:03:40Z

I agree with @pitrou that it seems better to first check if the implementation we already have isn't actually better to reuse. Otherwise the time spent on finalizing and reviewing a new implementation might turn out to have been unnecessary.

As far as I understand, we already have a sort implementation that can handle nested (struct-like) data, which is used for Table/RecordBatch, and which could also be used for StructArray (which additionally needs handling of a top-level validity bitmap).
So I think we can indeed focus first on providing the capability by first reusing whatever we already have, and then in subsequent tasks we can still further see if a new implementation could provide a speed-up for certain cases or not.

pitrou · 2022-11-24T15:11:41Z

To be clear, the current multi-column sorting facilities don't handle recursive nesting, only the top level of nesting. To handle recursive nesting, you need to flatten the columns. This can be done using StructArray::Flatten.

Another thing to do add is support for a top-level null bitmap. But that should be relatively easy. The most tedious part of all this is probably the moderate refactoring required.

pitrou · 2022-11-24T15:13:17Z

But I agree with quickly benchmarking the approach in this PR compared to the existing facility. It should be easy: 1) generate a RecordBatch (respectively Table); 2) convert it to a Struct Array (respectively ChunkedArray); 3) benchmark sorting one and the other using the same sorting parameters.

amol- · 2023-04-05T10:09:16Z

@benibus I know you are planning to work on the sorting, are you going to take over this PR or start fresh? My understanding is that we rejected the approach in this PR because of probable performance regressions. So if you plan to start fresh we can probably close this one

benibus · 2023-04-05T14:53:08Z

@amol- Yeah, the plan is to open a new PR with a different approach. I'll go ahead and close this.

github-actions bot added Component: C++ Component: Python labels Oct 11, 2022

Jedi18 added 4 commits October 11, 2022 16:18

tests and struct array sorting for arrays

546825d

add struct support for chunked array sorting

53fd8c7

minor simplification

0fe7ed4

add more struct chunked array tests

9aa0787

Jedi18 force-pushed the ARROW-14656_2 branch from 6dceffb to 02f3242 Compare October 11, 2022 10:48

Jedi18 marked this pull request as ready for review October 11, 2022 11:26

remove redundant compare overload

e27fb3a

Jedi18 force-pushed the ARROW-14656_2 branch from 02f3242 to e27fb3a Compare October 11, 2022 13:13

jorisvandenbossche reviewed Oct 26, 2022

View reviewed changes

jorisvandenbossche mentioned this pull request Dec 1, 2022

GH-14778: [Python] Add (Chunked)Array sort() and RecordBatch.sort_by methods #14781

Merged

asfimport mentioned this pull request Dec 2, 2022

[Python] Add sort_by method to Array, StructArray and ChunkedArray #30198

Closed

This was referenced Mar 22, 2023

ARROW-14656: [Python] sort helper for Array, ChunkedArray and StructArray #11659

Closed

[C++] Ensure sorting Table, RecordBatch can sort Struct and Dictionary cols #33206

Closed

jorisvandenbossche changed the title ~~ARROW-14656: [Python] Add sort helper function for Array, ChunkedArray and StructArray~~ ARROW-14656: [Python] Add sorting for StructArray Mar 22, 2023

benibus closed this Apr 5, 2023

ARROW-14656: [Python] Add sorting for StructArray #14369

ARROW-14656: [Python] Add sorting for StructArray #14369

Uh oh!

Conversation

Jedi18 commented Oct 11, 2022 • edited by jorisvandenbossche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 11, 2022

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche Oct 26, 2022

Choose a reason for hiding this comment

Uh oh!

amol- Dec 1, 2022

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Oct 26, 2022

Choose a reason for hiding this comment

Uh oh!

amol- Dec 1, 2022

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Dec 2, 2022

Choose a reason for hiding this comment

Uh oh!

pitrou commented Oct 26, 2022

Uh oh!

jorisvandenbossche commented Nov 22, 2022

Uh oh!

Jedi18 commented Nov 22, 2022

Uh oh!

Jedi18 commented Nov 22, 2022

Uh oh!

pitrou commented Nov 22, 2022

Uh oh!

pitrou commented Nov 22, 2022

Uh oh!

amol- commented Nov 23, 2022

Uh oh!

pitrou commented Nov 23, 2022

Uh oh!

jorisvandenbossche commented Nov 24, 2022

Uh oh!

pitrou commented Nov 24, 2022

Uh oh!

pitrou commented Nov 24, 2022

Uh oh!

amol- commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benibus commented Apr 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Jedi18 commented Oct 11, 2022 •

edited by jorisvandenbossche

Loading

amol- commented Apr 5, 2023 •

edited

Loading