Add tuned a8w8 blockscale GEMM config for Qwen3-Next-80B-A3B on MI355X by nholmber · Pull Request #2868 · ROCm/aiter

nholmber · 2026-04-22T19:46:01Z

Motivation

Add tuned and untuned blockscale a8w8 GEMM configs for Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 on MI355X (gfx950, 256 CUs). The tuned config covers 1482 shapes across TP1, TP2, and TP4, delivering +3.9% to +8.5% e2e output throughput over the unmodified vLLM 0.19 baseline (heuristic default kernels).

Depends on:

Bump CK for a stride fix in CKTile Block-Scale GEMM #2862 (CK bump for stride fix in CK-TILE blockscale 8-wave pipeline)
Add SplitK support for CK/CKTile Block-Scale GEMMs #2541 (splitK support for CK/CK-TILE blockscale GEMMs)
Expose AQLayout as tunable parameter for CKTile blockscale 8-warp GEMM kernels #2487 (AQLayout tunable for CK-TILE blockscale 8-warp kernels)

Technical Detailserm

Kernel distribution in the tuned CSV:

Backend	splitK	Count	%
CK	0	433	29.2%
CK	1-3	613	41.4%
CK-TILE	0	436	29.4%

72% of CK shapes benefit from splitK > 0. CK-TILE wins 29% of shapes (all splitK=0), primarily at large N (gate/up projections, MoE shared experts).

Test Plan

Element-wise correctness via tuner built-in checks (all 1482 shapes)
GSM8K 5-shot accuracy benchmark
Coherence test at concurrency 1, 4, 8, 16, 32, 64, 128
E2E serving throughput via vllm bench serve (random dataset, --request-rate inf)

Test Result

Accuracy: GSM8K 5-shot flexible-extract 0.8522 ± 0.0098 (matches CK baseline 0.8499 ± 0.0098). Coherence PASS at all concurrency levels.

E2E throughput (output tok/s, TP1, MI355X):

ISL	OSL	Conc	Baseline	Tuned	vs Baseline
1024	1024	1	92.1	97.5	+5.9%
1024	1024	2	170.5	182.0	+6.8%
1024	1024	4	328.1	355.8	+8.5%
1024	1024	8	660.9	699.4	+5.8%
1024	1024	16	1223.0	1303.0	+6.5%
1024	1024	32	2126.8	2295.2	+7.9%
1024	1024	64	3419.9	3577.2	+4.6%
1024	1024	128	5243.9	5499.6	+4.9%
8192	1024	1	55.5	57.7	+3.9%
8192	1024	2	104.3	109.3	+4.8%
8192	1024	4	190.5	201.6	+5.8%
8192	1024	8	330.6	347.8	+5.2%
8192	1024	16	511.2	542.8	+6.2%
8192	1024	32	744.3	777.0	+4.4%
8192	1024	64	1107.5	1172.0	+5.8%
8192	1024	128	1493.3	1619.0	+8.4%

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Tuned 1482 shapes (TP1/TP2/TP4) for Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 on MI355X using CK + CK-TILE backends with splitK support. Depends on: - PR ROCm#2862 (CK bump for stride fix in CK-TILE blockscale) - PR ROCm#2541 (splitK support for CK/CK-TILE blockscale GEMMs) - PR ROCm#2487 (AQLayout tunable for CK-TILE blockscale 8-warp kernels)

github-actions · 2026-04-22T19:46:50Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-355`	Run Triton tests on MI355 in addition to MI325
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2868 --add-label <label>

…tions Full retune of all 1482 shapes on MI355X (gfx950, cu_num=256). Key changes: - SplitK usage dropped from 613 to 88 CK shapes (splitK > 0) - All shapes validated via --run_config (1482/1482 OK) - E2e perf: 2-8% output throughput improvement vs untuned heuristic

nholmber requested review from a team and samremes April 22, 2026 19:46

sunway513 mentioned this pull request May 1, 2026

[Silo] Bulk merge: kernel fixes and features (SplitK, MoE fixes, Qwen3-Next, pa_mqa OOB) #3005

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tuned a8w8 blockscale GEMM config for Qwen3-Next-80B-A3B on MI355X#2868

Add tuned a8w8 blockscale GEMM config for Qwen3-Next-80B-A3B on MI355X#2868
nholmber wants to merge 2 commits intoROCm:mainfrom
nholmber:tuned-qwen3next-blockscale-v3

nholmber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nholmber commented Apr 22, 2026

Motivation

Technical Detailserm

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented Apr 22, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant