-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14656: [Python] Add sorting for StructArray #14369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Only looked at the Python code for now, added a few minor comments.
Can you also add some tests to test_compute.py for the new capabilities of sort_indices on a StructArray? (I know this is implicitly tested through the Array.sort() method, but I think it's good to also have an explicit test for it. I think we will also want a C++ test for it)
| if fieldname is not None: | ||
| tosort = self.field(fieldname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently don't use "fieldname" anywhere. Maybe just field as the keyword argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
field implied a Field instance to me, that's why I went for fieldname.
|
|
||
| def sort(self, order="ascending"): | ||
| """ | ||
| Sort the ChunkedArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we support the fieldname keyword here as well? (and raise an error in case the type of the ChunkedArray is not a struct)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ChunkedArray doesn't provide support for accessing fields directly (IE: no ChunkedArray.field(X) method) so we would probably have to implement that too if we want to add support for sorting an individual field. Which seems to expand the scope of this change. I propose we make a separate ticket for improving support for Struct data in ChunkedArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's indeed a general issue that ChunkedArray can be annoying when working with a type that type-specific methods on the array (that's not only for StructArray)
|
Thanks @Jedi18 for attempting this. At a high level, two things:
Did you actually measure performance? I expect this to be slower that the current RecordBatch and Table sorting. |
|
@Jedi18 do you have time to update this? |
|
@jorisvandenbossche so sorry about the late reply, I missed the github emails and hadn't checked this PR in a long while. Yes I'll address the review comments in the next few days when I get time. |
|
@pitrou Sorry for the late reply, just saw the comments now
Sure I'll do that first. There was no straightforward way to test it on the C++ side hence the python tests. I'll look into it a bit more and see how I can test it in the unit tests.
I did not, and yeah that's one of the reasons for wanting seperate PRs for those so I can at the very least mirror the current implementation and performance. |
Yes, this would be better. Otherwise I would suggest directly reusing the current Table/Batch sorting for Struct array sorting (the one thing missing being handling the top-level null bitmap). |
|
For more context, the original Table/Batch sort was based on something similar to NestedValuesComparator, and I significantly improved its performance by switching to per-column sorting for cases with few keys. You can look at the source code references for |
To avoid mixing too many concerns in a single activity, I'd like to propose we focus this task on providing the capability and then tackle performance improvements and benchmarking into a dedicated activity. I agree we don't want to introduce any slowdown, but as far as we are talking about adding the capability to something that wasn't supported, I think we can live for a certain amount of time with having struct fields sorting slower than table sorting while we work on performance improvements for it. This has been a feature under work for months at this point and I think it would provide immediate value to users while giving us the time to work on refining and improving it. Obviously any work at migrating Tables and RecordBatches to the NestedValueComparator (or whatever will come) should be subordinate to ensuring the comparator is on pair with the current implementation. |
|
I disagree about introducing non-trivial C++ code that is redundant with existing code and probably sub-performant. |
|
I agree with @pitrou that it seems better to first check if the implementation we already have isn't actually better to reuse. Otherwise the time spent on finalizing and reviewing a new implementation might turn out to have been unnecessary. As far as I understand, we already have a sort implementation that can handle nested (struct-like) data, which is used for Table/RecordBatch, and which could also be used for StructArray (which additionally needs handling of a top-level validity bitmap). |
|
To be clear, the current multi-column sorting facilities don't handle recursive nesting, only the top level of nesting. To handle recursive nesting, you need to flatten the columns. This can be done using Another thing to do add is support for a top-level null bitmap. But that should be relatively easy. The most tedious part of all this is probably the moderate refactoring required. |
|
But I agree with quickly benchmarking the approach in this PR compared to the existing facility. It should be easy: 1) generate a RecordBatch (respectively Table); 2) convert it to a Struct Array (respectively ChunkedArray); 3) benchmark sorting one and the other using the same sorting parameters. |
|
@benibus I know you are planning to work on the sorting, are you going to take over this PR or start fresh? My understanding is that we rejected the approach in this PR because of probable performance regressions. So if you plan to start fresh we can probably close this one |
|
@amol- Yeah, the plan is to open a new PR with a different approach. I'll go ahead and close this. |
This is the first PR for #11659 (splitting up this PR into multiple PRs). This PR adds a sort helper function for the array, chunked_array and array of structs (StructArray) and chunked array of structs sorting.
A
NestedValuesComparatoris added which is planned to be a common comparator for anything that has nested fields and columns. Follow up PRs will update the Table, RecordBatch etc sorting implementations to use thisNestedValuesComparator.Related issue: #33206