-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-14778: [Python] Add (Chunked)Array sort() and RecordBatch.sort_by methods #14781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I left a few inline comment when reviewing #14369 (review), I think those are still relevant for this subset as well. |
|
@jorisvandenbossche I should have addressed your feedbacks from #14369 (review) |
python/pyarrow/tests/test_table.py
Outdated
| assert sorted_rb_dict["b"] == [2, 3, 4, 1] | ||
| assert sorted_rb_dict["c"] == ["foobar", "bar", "foo", "car"] | ||
|
|
||
| # test multi-key record batch sorter (> 8 sort keys) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering: is there something specific about more than 8 keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know about this one? (or it was just copied from the other PR?) It seems overly complex for the functionality that this PR is adding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two record batch sorters, RadixRecordBatchSorter and MultipleKeyRecordBatchSorter. I wanted to test both implementations so that's why this was added.
| // Radix sorting is consistently faster except when there is a large number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jedi18 thanks for the clarification! Since this stripped down version of the PR actually doesn't touch the RecordBatch sort implementation, it might not be needed to add those tests (I assume those two variants of sorters are also already tested in C++)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see c++ unit tests with more than 8 keys so yes I guess it's safe to assume both implementations are tested. I agree that we could probably remove these extra tests since the python tests should not be concerned with the details of the internal cpp implementation.
|
@jorisvandenbossche please re-review |
|
I just realized an issue with the simple workaround for sorting a StructArray by selecting one of its fields, and that is that this ignores top-level nulls .. Consider this example: This is due to what the |
Oh! Thanks for catching this. I took for granted that |
I think this would just be to use
Yes, I did as well, we were just discussing something similar for union arrays in another PR. I am planning to open an issue proposing to change this. |
|
@jorisvandenbossche the |
|
Benchmark runs are scheduled for baseline = 5c1044f and contender = 8a34732. 8a34732 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Uh oh!
There was an error while loading. Please reload this page.