Skip to content

Conversation

@Jedi18
Copy link
Contributor

@Jedi18 Jedi18 commented Oct 11, 2022

This is the first PR for #11659 (splitting up this PR into multiple PRs). This PR adds a sort helper function for the array, chunked_array and array of structs (StructArray) and chunked array of structs sorting.

A NestedValuesComparator is added which is planned to be a common comparator for anything that has nested fields and columns. Follow up PRs will update the Table, RecordBatch etc sorting implementations to use this NestedValuesComparator.

Related issue: #33206

@github-actions
Copy link

@Jedi18 Jedi18 marked this pull request as ready for review October 11, 2022 11:26
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Only looked at the Python code for now, added a few minor comments.

Can you also add some tests to test_compute.py for the new capabilities of sort_indices on a StructArray? (I know this is implicitly tested through the Array.sort() method, but I think it's good to also have an explicit test for it. I think we will also want a C++ test for it)

Comment on lines +2781 to +2782
if fieldname is not None:
tosort = self.field(fieldname)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently don't use "fieldname" anywhere. Maybe just field as the keyword argument?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field implied a Field instance to me, that's why I went for fieldname.


def sort(self, order="ascending"):
"""
Sort the ChunkedArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support the fieldname keyword here as well? (and raise an error in case the type of the ChunkedArray is not a struct)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ChunkedArray doesn't provide support for accessing fields directly (IE: no ChunkedArray.field(X) method) so we would probably have to implement that too if we want to add support for sorting an individual field. Which seems to expand the scope of this change. I propose we make a separate ticket for improving support for Struct data in ChunkedArray

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's indeed a general issue that ChunkedArray can be annoying when working with a type that type-specific methods on the array (that's not only for StructArray)

@pitrou
Copy link
Member

pitrou commented Oct 26, 2022

Thanks @Jedi18 for attempting this. At a high level, two things:

  1. Can you first tackle just the C++ side and write tests on that side?

Follow up PRs will update the Table, RecordBatch etc sorting implementations to use this NestedValuesComparator.

Did you actually measure performance? I expect this to be slower that the current RecordBatch and Table sorting.

@jorisvandenbossche
Copy link
Member

@Jedi18 do you have time to update this?

@Jedi18
Copy link
Contributor Author

Jedi18 commented Nov 22, 2022

@jorisvandenbossche so sorry about the late reply, I missed the github emails and hadn't checked this PR in a long while. Yes I'll address the review comments in the next few days when I get time.

@Jedi18
Copy link
Contributor Author

Jedi18 commented Nov 22, 2022

@pitrou Sorry for the late reply, just saw the comments now

Can you first tackle just the C++ side and write tests on that side?

Sure I'll do that first. There was no straightforward way to test it on the C++ side hence the python tests. I'll look into it a bit more and see how I can test it in the unit tests.

Did you actually measure performance? I expect this to be slower that the current RecordBatch and Table sorting.

I did not, and yeah that's one of the reasons for wanting seperate PRs for those so I can at the very least mirror the current implementation and performance.
Are you saying we should get those done first and measure it before we think about merging this in case it turns out to be slower than the current implementation?

@pitrou
Copy link
Member

pitrou commented Nov 22, 2022

Are you saying we should get those done first and measure it before we think about merging this in case it turns out to be slower than the current implementation?

Yes, this would be better. Otherwise I would suggest directly reusing the current Table/Batch sorting for Struct array sorting (the one thing missing being handling the top-level null bitmap).

@pitrou
Copy link
Member

pitrou commented Nov 22, 2022

For more context, the original Table/Batch sort was based on something similar to NestedValuesComparator, and I significantly improved its performance by switching to per-column sorting for cases with few keys.

You can look at the source code references for RadixRecordBatchSorter and MultipleKeyRecordBatchSorter.

@amol-
Copy link
Member

amol- commented Nov 23, 2022

Are you saying we should get those done first and measure it before we think about merging this in case it turns out to be slower than the current implementation?

Yes, this would be better. Otherwise I would suggest directly reusing the current Table/Batch sorting for Struct array sorting (the one thing missing being handling the top-level null bitmap).

To avoid mixing too many concerns in a single activity, I'd like to propose we focus this task on providing the capability and then tackle performance improvements and benchmarking into a dedicated activity. I agree we don't want to introduce any slowdown, but as far as we are talking about adding the capability to something that wasn't supported, I think we can live for a certain amount of time with having struct fields sorting slower than table sorting while we work on performance improvements for it.

This has been a feature under work for months at this point and I think it would provide immediate value to users while giving us the time to work on refining and improving it.

Obviously any work at migrating Tables and RecordBatches to the NestedValueComparator (or whatever will come) should be subordinate to ensuring the comparator is on pair with the current implementation.

@pitrou
Copy link
Member

pitrou commented Nov 23, 2022

I disagree about introducing non-trivial C++ code that is redundant with existing code and probably sub-performant.

@jorisvandenbossche
Copy link
Member

I agree with @pitrou that it seems better to first check if the implementation we already have isn't actually better to reuse. Otherwise the time spent on finalizing and reviewing a new implementation might turn out to have been unnecessary.

As far as I understand, we already have a sort implementation that can handle nested (struct-like) data, which is used for Table/RecordBatch, and which could also be used for StructArray (which additionally needs handling of a top-level validity bitmap).
So I think we can indeed focus first on providing the capability by first reusing whatever we already have, and then in subsequent tasks we can still further see if a new implementation could provide a speed-up for certain cases or not.

@pitrou
Copy link
Member

pitrou commented Nov 24, 2022

To be clear, the current multi-column sorting facilities don't handle recursive nesting, only the top level of nesting. To handle recursive nesting, you need to flatten the columns. This can be done using StructArray::Flatten.

Another thing to do add is support for a top-level null bitmap. But that should be relatively easy. The most tedious part of all this is probably the moderate refactoring required.

@pitrou
Copy link
Member

pitrou commented Nov 24, 2022

But I agree with quickly benchmarking the approach in this PR compared to the existing facility. It should be easy: 1) generate a RecordBatch (respectively Table); 2) convert it to a Struct Array (respectively ChunkedArray); 3) benchmark sorting one and the other using the same sorting parameters.

@amol-
Copy link
Member

amol- commented Apr 5, 2023

@benibus I know you are planning to work on the sorting, are you going to take over this PR or start fresh? My understanding is that we rejected the approach in this PR because of probable performance regressions. So if you plan to start fresh we can probably close this one

@benibus
Copy link
Contributor

benibus commented Apr 5, 2023

@amol- Yeah, the plan is to open a new PR with a different approach. I'll go ahead and close this.

@benibus benibus closed this Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants