perf(amdgpu): add subgroupDppSwapPairs for intra-wavefront pair exchange by michaelselehov · Pull Request #14 · ROCm/quadrants

michaelselehov · 2026-04-24T09:01:33Z

Expose the AMDGPU v_mov_b32_dpp quad_perm:[1,0,3,2] instruction as a new subgroup intrinsic. This enables sub-wave parallelism patterns where adjacent lanes exchange and reduce values without going through LDS, cutting the reduction to a single VALU-pipe cycle.

Register the op in internal_ops.inc.h and type_system.cpp
AMDGPU codegen: emit llvm.amdgcn.update.dpp with ctrl=0xB1 (32-bit native, 64-bit via lo/hi split)
CPU/base codegen: guard with QD_ERROR + actionable message
Python binding: qd.simt.subgroup.dpp_swap_pairs(value)

Assisted-by: Claude Opus

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Expose the AMDGPU v_mov_b32_dpp quad_perm:[1,0,3,2] instruction as a new subgroup intrinsic. This enables sub-wave parallelism patterns where adjacent lanes exchange and reduce values without going through LDS, cutting the reduction to a single VALU-pipe cycle. - Register the op in internal_ops.inc.h and type_system.cpp - AMDGPU codegen: emit llvm.amdgcn.update.dpp with ctrl=0xB1 (32-bit native, 64-bit via lo/hi split) - CPU/base codegen: guard with QD_ERROR + actionable message - Python binding: qd.simt.subgroup.dpp_swap_pairs(value) Assisted-by: Claude Opus

Desugar 3-argument range() into a while-loop at the AST level. The Quadrants IR does not natively support a step parameter in range-for, so range(start, stop, step) is lowered to: i = start while i < stop: <body> i += step This eliminates the need for manual while-loop workarounds when writing strided iteration patterns (e.g. sub-wave parallelism).

michaelselehov · 2026-04-24T09:32:23Z

Summary

Two related changes that add DPP (Data Parallel Primitives) support for AMDGPU
and fix a limitation in the Python frontend range-for loops.

Commit 1 — subgroupDppSwapPairs intrinsic

New subgroup operation that swaps values between adjacent lane pairs within a
wavefront using the AMDGPU v_mov_b32_dpp quad_perm:[1,0,3,2] instruction.
This is useful for intra-wavefront communication patterns where neighboring
threads need to exchange data without going through LDS.

Changes:

internal_ops.inc.h, type_system.cpp: register the new polymorphic op
subgroupDppSwapPairs(ValueT) -> ValueT
codegen_amdgpu.cpp: AMDGPU backend emits llvm.amdgcn.update.dpp with
dpp_ctrl=0xB1 (quad_perm:[1,0,3,2]). Native 32-bit types go through
directly; 64-bit types are split into lo/hi halves.
codegen_llvm.cpp: base (CPU) codegen guards against GPU-only ops with a
clear error message pointing the user to qd.static backend guards.
subgroup.py: Python binding qd.simt.subgroup.dpp_swap_pairs(value)

The op follows the same pattern as existing subgroup operations
(subgroupBroadcast, subgroupAdd, etc.) and slots into the same visitor
dispatch.

Commit 2 — range(start, stop, step) support

The AST transformer previously rejected 3-argument range() calls
(Range should have 1 or 2 arguments). This was a pain point when writing
strided iteration patterns — users had to fall back to manual while loops.

Since the C++ IR (FrontendForStmt) does not have a step field, we desugar
for i in range(start, stop, step) into a while-loop at the AST level:

i = start
while i < stop:
    <body>
    i += step

This is fully transparent to the user and requires no C++ changes.

Test Plan

Build: 83/83 targets, zero errors
DPP smoke test on MI300X: lane swap pattern [1,0,3,2,5,4,...] verified
CPU fallback: clear QD_ERROR message, no segfault
Strided range: range(0, N, 2), range(start, n, 2) both work
Existing tests: test_internal_func (6), test_intrinsics (4+2skip),
test_type_system (3), test_lang (28) — all pass, no regression

test_range_for_three_arguments now verifies correct strided iteration instead of expecting QuadrantsCompilationError. test_exception_in_node_with_body uses range() (0 args) as the invalid construct instead of range(1, 2, 3) which is now valid.

michaelselehov · 2026-04-24T14:24:40Z

Pyright errors are in untouched files. Probably preexisting.
AMD GPU Runner is misconfigured (missing aws-region)
The rest looks green

yaoliu13

This PR has conflicts if we want to merge it to amd-integration. Please read Confluence page "Genesis PR Review Process".

michaelselehov added 2 commits April 24, 2026 03:58

michaelselehov added 3 commits April 24, 2026 04:47

chore: fix linter warnings (black, clang-format, ruff, trailing-ws)

f6193e9

test: fix caret count and line offset in test_exception.py

0de54f9

yaoliu13 requested changes May 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(amdgpu): add subgroupDppSwapPairs for intra-wavefront pair exchange#14

perf(amdgpu): add subgroupDppSwapPairs for intra-wavefront pair exchange#14
michaelselehov wants to merge 5 commits intoamd-integrationfrom
perf/mselehov/dpp-subwave-reduce

michaelselehov commented Apr 24, 2026

Uh oh!

michaelselehov commented Apr 24, 2026

Uh oh!

michaelselehov commented Apr 24, 2026

Uh oh!

yaoliu13 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

michaelselehov commented Apr 24, 2026

Brief Summary

Walkthrough

Uh oh!

michaelselehov commented Apr 24, 2026

Summary

Test Plan

Uh oh!

michaelselehov commented Apr 24, 2026

Uh oh!

yaoliu13 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants