Skip to content

Add tuned a8w8 blockscale GEMM config for Qwen3-Next-80B-A3B on MI355X#2868

Open
nholmber wants to merge 2 commits intoROCm:mainfrom
nholmber:tuned-qwen3next-blockscale-v3
Open

Add tuned a8w8 blockscale GEMM config for Qwen3-Next-80B-A3B on MI355X#2868
nholmber wants to merge 2 commits intoROCm:mainfrom
nholmber:tuned-qwen3next-blockscale-v3

Conversation

@nholmber
Copy link
Copy Markdown
Contributor

Motivation

Add tuned and untuned blockscale a8w8 GEMM configs for Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 on MI355X (gfx950, 256 CUs). The tuned config covers 1482 shapes across TP1, TP2, and TP4, delivering +3.9% to +8.5% e2e output throughput over the unmodified vLLM 0.19 baseline (heuristic default kernels).

Depends on:

Technical Detailserm

Kernel distribution in the tuned CSV:

Backend splitK Count %
CK 0 433 29.2%
CK 1-3 613 41.4%
CK-TILE 0 436 29.4%

72% of CK shapes benefit from splitK > 0. CK-TILE wins 29% of shapes (all splitK=0), primarily at large N (gate/up projections, MoE shared experts).

Test Plan

  • Element-wise correctness via tuner built-in checks (all 1482 shapes)
  • GSM8K 5-shot accuracy benchmark
  • Coherence test at concurrency 1, 4, 8, 16, 32, 64, 128
  • E2E serving throughput via vllm bench serve (random dataset, --request-rate inf)

Test Result

Accuracy: GSM8K 5-shot flexible-extract 0.8522 ± 0.0098 (matches CK baseline 0.8499 ± 0.0098). Coherence PASS at all concurrency levels.

E2E throughput (output tok/s, TP1, MI355X):

ISL OSL Conc Baseline Tuned vs Baseline
1024 1024 1 92.1 97.5 +5.9%
1024 1024 2 170.5 182.0 +6.8%
1024 1024 4 328.1 355.8 +8.5%
1024 1024 8 660.9 699.4 +5.8%
1024 1024 16 1223.0 1303.0 +6.5%
1024 1024 32 2126.8 2295.2 +7.9%
1024 1024 64 3419.9 3577.2 +4.6%
1024 1024 128 5243.9 5499.6 +4.9%
8192 1024 1 55.5 57.7 +3.9%
8192 1024 2 104.3 109.3 +4.8%
8192 1024 4 190.5 201.6 +5.8%
8192 1024 8 330.6 347.8 +5.2%
8192 1024 16 511.2 542.8 +6.2%
8192 1024 32 744.3 777.0 +4.4%
8192 1024 64 1107.5 1172.0 +5.8%
8192 1024 128 1493.3 1619.0 +8.4%

Submission Checklist

Tuned 1482 shapes (TP1/TP2/TP4) for Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
on MI355X using CK + CK-TILE backends with splitK support.

Depends on:
- PR ROCm#2862 (CK bump for stride fix in CK-TILE blockscale)
- PR ROCm#2541 (splitK support for CK/CK-TILE blockscale GEMMs)
- PR ROCm#2487 (AQLayout tunable for CK-TILE blockscale 8-warp kernels)
@nholmber nholmber requested review from a team and samremes April 22, 2026 19:46
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2868 --add-label <label>

…tions

Full retune of all 1482 shapes on MI355X (gfx950, cu_num=256).
Key changes:
- SplitK usage dropped from 613 to 88 CK shapes (splitK > 0)
- All shapes validated via --run_config (1482/1482 OK)
- E2e perf: 2-8% output throughput improvement vs untuned heuristic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant