Skip to content

[Perf] Add cross-GPU subgroup.ballot(predicate) primitive#600

Draft
hughperkins wants to merge 1 commit intomainfrom
hp/cross-gpu-ballot
Draft

[Perf] Add cross-GPU subgroup.ballot(predicate) primitive#600
hughperkins wants to merge 1 commit intomainfrom
hp/cross-gpu-ballot

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

Implement a portable ballot operation that returns a u32 bitmask where bit i is set if lane i's predicate is non-zero. Works across CUDA (__ballot_sync), AMDGPU (amdgcn_ballot.i32), and SPIR-V/Vulkan (OpGroupNonUniformBallot).

Follows the same cross-backend pattern as subgroup.shuffle: a single Python API (subgroup.ballot) dispatches to the appropriate backend intrinsic at codegen time. On AMDGPU CDNA with 64-wide wavefronts only the low 32 bits are returned, consistent with the u32 return type.

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Implement a portable ballot operation that returns a u32 bitmask where
bit i is set if lane i's predicate is non-zero. Works across CUDA
(__ballot_sync), AMDGPU (amdgcn_ballot.i32), and SPIR-V/Vulkan
(OpGroupNonUniformBallot).

Follows the same cross-backend pattern as subgroup.shuffle: a single
Python API (subgroup.ballot) dispatches to the appropriate backend
intrinsic at codegen time. On AMDGPU CDNA with 64-wide wavefronts only
the low 32 bits are returned, consistent with the u32 return type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant