Allow preallocated moe sorting buffer by tpopp · Pull Request #2687 · ROCm/aiter

tpopp · 2026-04-10T09:56:55Z

Motivation

In locations like vLLM, they have abstracted the calling code to accept preallocated workspaces to use as output buffers.

Technical Details

Allow an output buffer here as an optional argument and otherwise allocate a new buffer.

Test Plan

Correctness testing is extended to also check that values are correct and the output buffer was used.

github-actions · 2026-04-10T09:57:22Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-355`	Run Triton tests on MI355 in addition to MI325
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2687 --add-label <label>

Add an optional `moe_buf` parameter through the moe_sorting and fused_moe call chain. When provided, the sorting kernel writes directly into the caller's buffer instead of allocating a new one, eliminating a redundant copy on the output path. Made-with: Cursor

Made-with: Cursor

Plumb `moe_buf` through the vLLM AITER fused MoE interface so the kernel writes directly into the caller's pre-allocated output buffer. This avoids a device-to-device copy of the full MoE output on every forward pass. Requires AITER with ROCm/aiter#2687 merged. When `moe_buf` is `None` (older AITER), the existing allocation + copy behavior is preserved. Co-authored-by: Tres Popp <tres.popp@amd.com> Signed-off-by: nholmber <nholmber@users.noreply.github.com>

tpopp · 2026-04-21T07:38:03Z

A concern was raised around why the fix can't be done on the caller side. The reason output buffers are desired is because in non HIPGraph cases, they don't want ~3 different temporary buffer allocations for intermediate workspaces and the output but a single allocation to limit overhead.

valarLip · 2026-04-21T07:41:48Z

we have internal logic and wlll use it for different way, like fuse quant or not fuse quant, the size and datatype for this buffer will change, this is can't for outside

…e.py - Restore import to match main: use `from aiter import fused_dynamic_mxfp4_quant_moe_sort, mxfp4_moe_sort_fwd` instead of importing from internal triton path and fp4_utils - Replace all fp4_utils.moe_mxfp4_sort() calls with mxfp4_moe_sort_fwd() using correct parameter names (cols= instead of block_size=) - Remove all moe_buf preallocated buffer additions (PR #2687 rejected): parameter defaults, if-guards, and pass-throughs in _moe_sorting_impl, moe_sorting, fused_moe, fused_moe_fake, and fused_moe_ - Fix moe_sorting_dispatch_policy type annotation: bool -> int in fused_moe_fake and fused_moe_ - Remove moe_buf pass-through test from test_moe_sorting.py - Preserve legitimate fp4_utils usage (mxfp4_to_f32, e8m0_to_f32) with local imports in stage1/stage2 fallback functions

tpopp requested review from a team and ChuanLi1101 April 10, 2026 09:56

tpopp force-pushed the moe-buf-passthrough branch 2 times, most recently from b3b291a to ba31a6a Compare April 13, 2026 07:04

nholmber requested a review from valarLip April 13, 2026 11:28

tpopp added 2 commits April 14, 2026 09:05

Add moe_buf pass-through test to existing test_moe_sorting

60a459c

Made-with: Cursor

tpopp force-pushed the moe-buf-passthrough branch from ba31a6a to 60a459c Compare April 14, 2026 07:05

nholmber mentioned this pull request Apr 20, 2026

[ROCm] Pass moe_buf to AITER to eliminate MoE output copy vllm-project/vllm#40368

Closed

3 tasks

sunway513 mentioned this pull request May 1, 2026

[Silo] Bulk merge: kernel fixes and features (SplitK, MoE fixes, Qwen3-Next, pa_mqa OOB) #3005

Open

3 tasks

valarLip closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow preallocated moe sorting buffer#2687

Allow preallocated moe sorting buffer#2687
tpopp wants to merge 2 commits intoROCm:mainfrom
tpopp:moe-buf-passthrough

tpopp commented Apr 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

tpopp commented Apr 21, 2026

Uh oh!

valarLip commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tpopp commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Uh oh!

github-actions Bot commented Apr 10, 2026

🏷️ CI Guide

Uh oh!

tpopp commented Apr 21, 2026

Uh oh!

valarLip commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tpopp commented Apr 10, 2026 •

edited

Loading