Skip to content

perf(amdgpu): add subgroupDppSwapPairs for intra-wavefront pair exchange#14

Open
michaelselehov wants to merge 5 commits intoamd-integrationfrom
perf/mselehov/dpp-subwave-reduce
Open

perf(amdgpu): add subgroupDppSwapPairs for intra-wavefront pair exchange#14
michaelselehov wants to merge 5 commits intoamd-integrationfrom
perf/mselehov/dpp-subwave-reduce

Conversation

@michaelselehov
Copy link
Copy Markdown

Expose the AMDGPU v_mov_b32_dpp quad_perm:[1,0,3,2] instruction as a new subgroup intrinsic. This enables sub-wave parallelism patterns where adjacent lanes exchange and reduce values without going through LDS, cutting the reduction to a single VALU-pipe cycle.

  • Register the op in internal_ops.inc.h and type_system.cpp
  • AMDGPU codegen: emit llvm.amdgcn.update.dpp with ctrl=0xB1 (32-bit native, 64-bit via lo/hi split)
  • CPU/base codegen: guard with QD_ERROR + actionable message
  • Python binding: qd.simt.subgroup.dpp_swap_pairs(value)

Assisted-by: Claude Opus

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Expose the AMDGPU v_mov_b32_dpp quad_perm:[1,0,3,2] instruction as a
new subgroup intrinsic.  This enables sub-wave parallelism patterns
where adjacent lanes exchange and reduce values without going through
LDS, cutting the reduction to a single VALU-pipe cycle.

- Register the op in internal_ops.inc.h and type_system.cpp
- AMDGPU codegen: emit llvm.amdgcn.update.dpp with ctrl=0xB1
  (32-bit native, 64-bit via lo/hi split)
- CPU/base codegen: guard with QD_ERROR + actionable message
- Python binding: qd.simt.subgroup.dpp_swap_pairs(value)

Assisted-by: Claude Opus
Desugar 3-argument range() into a while-loop at the AST level.
The Quadrants IR does not natively support a step parameter in
range-for, so range(start, stop, step) is lowered to:

    i = start
    while i < stop:
        <body>
        i += step

This eliminates the need for manual while-loop workarounds when
writing strided iteration patterns (e.g. sub-wave parallelism).
@michaelselehov
Copy link
Copy Markdown
Author

Summary

Two related changes that add DPP (Data Parallel Primitives) support for AMDGPU
and fix a limitation in the Python frontend range-for loops.

Commit 1 — subgroupDppSwapPairs intrinsic

New subgroup operation that swaps values between adjacent lane pairs within a
wavefront using the AMDGPU v_mov_b32_dpp quad_perm:[1,0,3,2] instruction.
This is useful for intra-wavefront communication patterns where neighboring
threads need to exchange data without going through LDS.

Changes:

  • internal_ops.inc.h, type_system.cpp: register the new polymorphic op
    subgroupDppSwapPairs(ValueT) -> ValueT
  • codegen_amdgpu.cpp: AMDGPU backend emits llvm.amdgcn.update.dpp with
    dpp_ctrl=0xB1 (quad_perm:[1,0,3,2]). Native 32-bit types go through
    directly; 64-bit types are split into lo/hi halves.
  • codegen_llvm.cpp: base (CPU) codegen guards against GPU-only ops with a
    clear error message pointing the user to qd.static backend guards.
  • subgroup.py: Python binding qd.simt.subgroup.dpp_swap_pairs(value)

The op follows the same pattern as existing subgroup operations
(subgroupBroadcast, subgroupAdd, etc.) and slots into the same visitor
dispatch.

Commit 2 — range(start, stop, step) support

The AST transformer previously rejected 3-argument range() calls
(Range should have 1 or 2 arguments). This was a pain point when writing
strided iteration patterns — users had to fall back to manual while loops.

Since the C++ IR (FrontendForStmt) does not have a step field, we desugar
for i in range(start, stop, step) into a while-loop at the AST level:

i = start
while i < stop:
    <body>
    i += step

This is fully transparent to the user and requires no C++ changes.

Test Plan

  • Build: 83/83 targets, zero errors
  • DPP smoke test on MI300X: lane swap pattern [1,0,3,2,5,4,...] verified
  • CPU fallback: clear QD_ERROR message, no segfault
  • Strided range: range(0, N, 2), range(start, n, 2) both work
  • Existing tests: test_internal_func (6), test_intrinsics (4+2skip),
    test_type_system (3), test_lang (28) — all pass, no regression

test_range_for_three_arguments now verifies correct strided iteration
instead of expecting QuadrantsCompilationError.

test_exception_in_node_with_body uses range() (0 args) as the invalid
construct instead of range(1, 2, 3) which is now valid.
@michaelselehov
Copy link
Copy Markdown
Author

  • Pyright errors are in untouched files. Probably preexisting.
  • AMD GPU Runner is misconfigured (missing aws-region)
    The rest looks green

Copy link
Copy Markdown
Collaborator

@yaoliu13 yaoliu13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has conflicts if we want to merge it to amd-integration. Please read Confluence page "Genesis PR Review Process".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants