perf(amdgpu): add subgroupDppSwapPairs for intra-wavefront pair exchange#14
perf(amdgpu): add subgroupDppSwapPairs for intra-wavefront pair exchange#14michaelselehov wants to merge 5 commits intoamd-integrationfrom
Conversation
Expose the AMDGPU v_mov_b32_dpp quad_perm:[1,0,3,2] instruction as a new subgroup intrinsic. This enables sub-wave parallelism patterns where adjacent lanes exchange and reduce values without going through LDS, cutting the reduction to a single VALU-pipe cycle. - Register the op in internal_ops.inc.h and type_system.cpp - AMDGPU codegen: emit llvm.amdgcn.update.dpp with ctrl=0xB1 (32-bit native, 64-bit via lo/hi split) - CPU/base codegen: guard with QD_ERROR + actionable message - Python binding: qd.simt.subgroup.dpp_swap_pairs(value) Assisted-by: Claude Opus
Desugar 3-argument range() into a while-loop at the AST level.
The Quadrants IR does not natively support a step parameter in
range-for, so range(start, stop, step) is lowered to:
i = start
while i < stop:
<body>
i += step
This eliminates the need for manual while-loop workarounds when
writing strided iteration patterns (e.g. sub-wave parallelism).
SummaryTwo related changes that add DPP (Data Parallel Primitives) support for AMDGPU Commit 1 — New subgroup operation that swaps values between adjacent lane pairs within a Changes:
The op follows the same pattern as existing subgroup operations Commit 2 — The AST transformer previously rejected 3-argument Since the C++ IR ( i = start
while i < stop:
<body>
i += stepThis is fully transparent to the user and requires no C++ changes. Test Plan
|
test_range_for_three_arguments now verifies correct strided iteration instead of expecting QuadrantsCompilationError. test_exception_in_node_with_body uses range() (0 args) as the invalid construct instead of range(1, 2, 3) which is now valid.
|
yaoliu13
left a comment
There was a problem hiding this comment.
This PR has conflicts if we want to merge it to amd-integration. Please read Confluence page "Genesis PR Review Process".
Expose the AMDGPU v_mov_b32_dpp quad_perm:[1,0,3,2] instruction as a new subgroup intrinsic. This enables sub-wave parallelism patterns where adjacent lanes exchange and reduce values without going through LDS, cutting the reduction to a single VALU-pipe cycle.
Assisted-by: Claude Opus
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough