Skip to content

Allow preallocated moe sorting buffer#2687

Closed
tpopp wants to merge 2 commits intoROCm:mainfrom
tpopp:moe-buf-passthrough
Closed

Allow preallocated moe sorting buffer#2687
tpopp wants to merge 2 commits intoROCm:mainfrom
tpopp:moe-buf-passthrough

Conversation

@tpopp
Copy link
Copy Markdown
Contributor

@tpopp tpopp commented Apr 10, 2026

Motivation

In locations like vLLM, they have abstracted the calling code to accept preallocated workspaces to use as output buffers.

Technical Details

Allow an output buffer here as an optional argument and otherwise allocate a new buffer.

Test Plan

Correctness testing is extended to also check that values are correct and the output buffer was used.

@tpopp tpopp requested review from a team and ChuanLi1101 April 10, 2026 09:56
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2687 --add-label <label>

@tpopp tpopp force-pushed the moe-buf-passthrough branch 2 times, most recently from b3b291a to ba31a6a Compare April 13, 2026 07:04
@nholmber nholmber requested a review from valarLip April 13, 2026 11:28
tpopp added 2 commits April 14, 2026 09:05
Add an optional `moe_buf` parameter through the moe_sorting and
fused_moe call chain. When provided, the sorting kernel writes
directly into the caller's buffer instead of allocating a new one,
eliminating a redundant copy on the output path.

Made-with: Cursor
@tpopp tpopp force-pushed the moe-buf-passthrough branch from ba31a6a to 60a459c Compare April 14, 2026 07:05
nholmber added a commit to nholmber/vllm that referenced this pull request Apr 20, 2026
Plumb `moe_buf` through the vLLM AITER fused MoE interface so the
kernel writes directly into the caller's pre-allocated output buffer.
This avoids a device-to-device copy of the full MoE output on every
forward pass.

Requires AITER with ROCm/aiter#2687 merged. When `moe_buf` is `None`
(older AITER), the existing allocation + copy behavior is preserved.

Co-authored-by: Tres Popp <tres.popp@amd.com>
Signed-off-by: nholmber <nholmber@users.noreply.github.com>
@tpopp
Copy link
Copy Markdown
Contributor Author

tpopp commented Apr 21, 2026

A concern was raised around why the fix can't be done on the caller side. The reason output buffers are desired is because in non HIPGraph cases, they don't want ~3 different temporary buffer allocations for intermediate workspaces and the output but a single allocation to limit overhead.

@valarLip
Copy link
Copy Markdown
Collaborator

we have internal logic and wlll use it for different way, like fuse quant or not fuse quant, the size and datatype for this buffer will change, this is can't for outside

@valarLip valarLip closed this May 4, 2026
sunway513 added a commit that referenced this pull request May 4, 2026
…e.py

- Restore import to match main: use `from aiter import
  fused_dynamic_mxfp4_quant_moe_sort, mxfp4_moe_sort_fwd` instead of
  importing from internal triton path and fp4_utils
- Replace all fp4_utils.moe_mxfp4_sort() calls with mxfp4_moe_sort_fwd()
  using correct parameter names (cols= instead of block_size=)
- Remove all moe_buf preallocated buffer additions (PR #2687 rejected):
  parameter defaults, if-guards, and pass-throughs in _moe_sorting_impl,
  moe_sorting, fused_moe, fused_moe_fake, and fused_moe_
- Fix moe_sorting_dispatch_policy type annotation: bool -> int in
  fused_moe_fake and fused_moe_
- Remove moe_buf pass-through test from test_moe_sorting.py
- Preserve legitimate fp4_utils usage (mxfp4_to_f32, e8m0_to_f32) with
  local imports in stage1/stage2 fallback functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants