-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12016 [C++] Implement array_sort_indices and sort_indices for BOOL type #10585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@pitrou Can you take a look at this? |
|
@nirandaperera as part of this PR, could you please make two small changes to the R package tests to exercise this new capability?
Thanks! |
|
@ianmcook Done! :-) |
|
@ianmcook it looks like something else is missing. |
|
@nirandaperera You may want to add a benchmark in |
|
Also, if you want to work on performance, note that a dedicated counting sort for boolean should be really simple. |
@pitrou I added a simple benchmark now. I'll add the improved version and run it against that bench. Didn't have to do much for RecordBatches and Tables because they are already using |
c1ef90b to
9cffeae
Compare
|
@pitrou I added a separate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really, you needn't write this yourself. Just call null_count() and true_count().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay. I did this way because we can count both nulls and trues in a single pass. But sure, I'll use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds rather weird. Are you sure about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, I didn't test this. Just following the ArrayCountSorter impl.
https://github.com/apache/arrow/blob/36d7a562f08d77dbf5ed699d486badcc9403f619/cpp/src/arrow/compute/kernels/vector_sort.cc#L438
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was added by me very early when implementing counting sort approach.
I think it's possibly due to 32bit counter array is smaller and have more chance to stay in L1 cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless it's benchmarked here as well, maybe let's remove the comment to avoid being confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is there in the primitive type impl as well. Should that be removed as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That one was benchmarked!
More seriously…maybe at least edit them to reflect that it was done due to a benchmark. Though I worry about the comment effectively bitrotting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's worth testing the non-null case separately. This will make less test code to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One minor comment about that confusing comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless it's benchmarked here as well, maybe let's remove the comment to avoid being confusing.
|
I made the suggested changes and I think this is ready now |
|
If we're removing the 32 vs 64 bit counter branch, can we benchmark it to make sure there's no impact? |
I'm still thinking how to do this benchmark? 😄 Because we can't call separate |
|
It would be a before vs after benchmark not side by side |
|
@nirandaperera MSVC doesn't look happy. |
|
@nirandaperera @kszucs The MSVC error seems unrelated to this PR and is cause by a timeout in a Flight test. |
|
I'm referring to these errors (appveyor not mingw). |
I believe this is due to a method being declared |
I tested this with the |
|
@ursabot please benchmark lang=C++ |
|
Thanks for checking. Let's also check with Conbench and if that's alright, then let's merge. |
|
Benchmark runs are scheduled for baseline = d7a8b46 and contender = 85445cf37da25a953bb15478938853613b64cc18. Results will be available as each benchmark for each run completes. |
|
@nirandaperera this apparently needs rebasing against master before we can run Conbench on it |
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for baseline = 737492e and contender = 992f8dcf38caf30405f100636760a77e5e98d056. Results will be available as each benchmark for each run completes. |
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for baseline = 737492e and contender = 257527c. Results will be available as each benchmark for each run completes. |
Adding
array_sort_indicesandpartition_nth_indicesforBooleanTypeusing existing sort and Nth-partition utils.This may be rather inefficient, since the values are traversed bit-by-bit rather than working on a byte/word.
May be we could work on it as a separate improvement?