Skip to content

Conversation

@amol-
Copy link
Member

@amol- amol- commented Nov 30, 2022

@apache apache deleted a comment from github-actions bot Nov 30, 2022
@jorisvandenbossche jorisvandenbossche changed the title GH-14778: Add Python sorting helpers GH-14778: [Python] Add (Chunked)Array sort() method Dec 1, 2022
@apache apache deleted a comment from github-actions bot Dec 1, 2022
@github-actions
Copy link

github-actions bot commented Dec 1, 2022

@apache apache deleted a comment from github-actions bot Dec 1, 2022
@jorisvandenbossche
Copy link
Member

I left a few inline comment when reviewing #14369 (review), I think those are still relevant for this subset as well.

@amol-
Copy link
Member Author

amol- commented Dec 1, 2022

@jorisvandenbossche I should have addressed your feedbacks from #14369 (review)

assert sorted_rb_dict["b"] == [2, 3, 4, 1]
assert sorted_rb_dict["c"] == ["foobar", "bar", "foo", "car"]

# test multi-key record batch sorter (> 8 sort keys)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering: is there something specific about more than 8 keys?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know about this one? (or it was just copied from the other PR?) It seems overly complex for the functionality that this PR is adding

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two record batch sorters, RadixRecordBatchSorter and MultipleKeyRecordBatchSorter. I wanted to test both implementations so that's why this was added.

// Radix sorting is consistently faster except when there is a large number

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jedi18 thanks for the clarification! Since this stripped down version of the PR actually doesn't touch the RecordBatch sort implementation, it might not be needed to add those tests (I assume those two variants of sorters are also already tested in C++)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see c++ unit tests with more than 8 keys so yes I guess it's safe to assume both implementations are tested. I agree that we could probably remove these extra tests since the python tests should not be concerned with the details of the internal cpp implementation.

@amol-
Copy link
Member Author

amol- commented Dec 2, 2022

@jorisvandenbossche please re-review

@jorisvandenbossche
Copy link
Member

I just realized an issue with the simple workaround for sorting a StructArray by selecting one of its fields, and that is that this ignores top-level nulls ..

Consider this example:

In [25]: arr = pa.StructArray.from_arrays([pa.array([5, 3, 4, 2, 1]), pa.array([1, 2, 3, 4, 5])], names=['a', 'b'], mask=pa.array([False, True, False, False, False]))

In [27]: arr.to_pylist()
Out[27]: [{'a': 5, 'b': 1}, None, {'a': 4, 'b': 3}, {'a': 2, 'b': 4}, {'a': 1, 'b': 5}]

In [30]: arr_sorted = arr.take(pc.sort_indices(arr.field('a')))

In [31]: arr_sorted.to_pylist()
Out[31]: [{'a': 1, 'b': 5}, {'a': 2, 'b': 4}, None, {'a': 4, 'b': 3}, {'a': 5, 'b': 1}]

This is due to what the field() method returns, of course. But we also have the StructArray.flatten() method, which gives the correct array to sort by (it only does this for all fields, while we only need one; but we could also directly use the StructArray::GetFlattenedField which is used by flatten() under the hood):

In [36]: arr_sorted2 = arr.take(pc.sort_indices(arr.flatten()[0]))

In [37]: arr_sorted2.to_pylist()
Out[37]: [{'a': 1, 'b': 5}, {'a': 2, 'b': 4}, {'a': 4, 'b': 3}, {'a': 5, 'b': 1}, None]

@amol-
Copy link
Member Author

amol- commented Dec 6, 2022

I just realized an issue with the simple workaround for sorting a StructArray by selecting one of its fields, and that is that this ignores top-level nulls ..

Oh! Thanks for catching this. I took for granted that field would account for the array validity bitmap. At least, that seemed a reasonable expectation. I'll make sure to update the StructArray helper to account for the base array validity bitmap.

@jorisvandenbossche
Copy link
Member

I'll make sure to update the StructArray helper to account for the base array validity bitmap.

I think this would just be to use StructArray::GetFlattenedField instead of StructArray::field / GetFieldByName, this should be straightforward to use, it's just not yet exposed in cython.

I took for granted that field would account for the array validity bitmap

Yes, I did as well, we were just discussing something similar for union arrays in another PR. I am planning to open an issue proposing to change this.

@amol-
Copy link
Member Author

amol- commented Dec 14, 2022

@jorisvandenbossche the StructArray.sort does now account for the parent array nulls. Also I simplified the test for recordbatch. While I recognise we haven't modified that in this PR, it's still good to keep the test around.

@ursabot
Copy link

ursabot commented Dec 16, 2022

Benchmark runs are scheduled for baseline = 5c1044f and contender = 8a34732. 8a34732 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.5% ⬆️0.03%] test-mac-arm
[Finished ⬇️0.82% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.34% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 8a347323 ec2-t3-xlarge-us-east-2
[Finished] 8a347323 test-mac-arm
[Finished] 8a347323 ursa-i9-9960x
[Finished] 8a347323 ursa-thinkcentre-m75q
[Finished] 5c1044fc ec2-t3-xlarge-us-east-2
[Finished] 5c1044fc test-mac-arm
[Finished] 5c1044fc ursa-i9-9960x
[Finished] 5c1044fc ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@jorisvandenbossche jorisvandenbossche changed the title GH-14778: [Python] Add (Chunked)Array sort() method GH-14778: [Python] Add (Chunked)Array sort() and RecordBatch.sort_by methods Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Add sort_by helper to Table, RecordBatch, Array

4 participants