-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[TOPI] Rewrite GPU argwhere using exclusive scan #7314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Could we add a column for the performance of the PR without thrust (i.e., TIR exclusive scan?) |
mbrookhart
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to include benchmarks without thrust in the PR for posterity, but otherwise this looks great, thanks! I'd wait to merge until @zhiics can review, since he wrote the existing kernel.
|
Ok updated the numbers to include TIR scan result. |
|
👍 Not as fast as thrust, as expected, but it's good to see it's still a performance improvement. |
zhiics
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvement.
85a91e9 to
63469a6
Compare
|
Thanks @mbrookhart @zhiics |
* use ex scan to write argwhere * add doc
* use ex scan to write argwhere * add doc
* use ex scan to write argwhere * add doc
* use ex scan to write argwhere * add doc
This PR improves the implementation of GPU
argwhereadded in #6868, using exclusive scan (see #7303).The current implementation of
argwhereis very inefficient, because it uses atomic to update the write location. Since all threads compete for the single location, this effectively makes it a sequential kernel. Moreover, since the output indices need to be lexicographically sorted, the current implementation involves sorting along each axis.Since
argwhereis literally an instance of stream compaction, this is a perfect application of exclusive scan. Now,argwheresimply consists ofboth of which are highly parallel operation. Thus, both atomic and sort are gone, vastly simplifying the implementation. Moreover, it also brings huge speed up, as shown below.
All numbers in milli sec
please review @zhiics @Laurawly @mbrookhart @tkonolige @anijain2305 @trevor-m