[TRITON] Tune fp4xfp4 GEMM #641

willzhou-amd · 2025-07-10T16:42:15Z

A few config changes to speed up the FP4xFP4 GEMM.

…_model.py * Why? The idea is to have a master script that benchmarks the full set of associated kernels when given a model name. It's a little cleaner to place all kernel benchmarking scripts in /kernels and have the bench_model script call them. * How? See `bench_model.py`. Pytests are in bench_tests/

…emm_a8w8 benchmarking.

…ructure

…rk_utils

…d add --model arg

…ROCm/aiter into willz/benchmarking-improvements

azaidy

LGTM!

willzhou-amd · 2025-07-18T19:06:21Z

Note: This kernel was tuned on the rocm/triton compiler - the upstream Triton compiler yields fairly poor performance.

* Finish tuning afp4wfp4 GEMM - increase ~2x * Improve performance on standard model (e.g Llama) shapes * Add new shape tune * Optimizing performance of default config for standard model shapes * Tune performance on N=1280,K=8192 w/compiler flags

willzhou-amd added 30 commits June 27, 2025 19:13

Merge branch 'main' into willz/benchmarking-improvements

937b27e

Update table formatting for bench_gemm_a8w8 and add tests for bench_g…

a4b0e50

…emm_a8w8 benchmarking.

Add tensor parallel in bench_gemm_a8w8.py

a53a847

Add -no_glu arg, fix error in tensor parallelism, and reset folder st…

c4d03ee

…ructure

Fix argparse & tensor parallel bug

f09d24e

Update bench_gemm_a8w8_blockscale.py and add repeated code to benchma…

fa770c1

…rk_utils

Merge branch 'main' into willz/benchmarking-improvements

cc5fa63

Merge branch 'main' into willz/benchmarking-improvements

d2c4817

Consolidate bench fn

0537963

Consolidate bench fn: int8 blockscale

649f953

Merge branch 'main' into willz/benchmarking-improvements

71c0172

Unify argparse for MHA benchmarking

32eabab

Update configs for mha bench

00dc362

Broadcast updates to bench_batched_gemm_afp4wfp4.py

8e2266c

Fix issue with arg names in bench_batched_gemm_afp4wfp4

d541d1b

Add stride shape upcasting

59e9c93

Broadcast changes to batch_gemm_afp4wfp4_pre_quant

78b70fa

Improve code reuse + fix benchmarking FLOP computation bug

6539b54

Fix shape order to allow plots to display properly

6762797

Merge branch 'main' into willz/benchmarking-improvements

ecd55db

Sweep through moe, extend_attn, prefill, rmsnorm, rope to fix bugs an…

30db901

…d add --model arg

Add --model and --shape support to bench_routing.py

d327305

Add MOE information to deepseek model config

577b79c

Merge branch 'main' into willz/benchmarking-improvements

a97417c

Revert linting changes in the CK dir

adadc86

Revert linting changes to ck dir

9fafe20

Black linting change

7db6fac

Fix f-string issue

dad6093

Merge branch 'main' into willz/benchmarking-improvements

2140a72

willzhou-amd added 10 commits July 9, 2025 10:27

Merge branch 'main' into willz/benchmarking-improvements

4926570

Merge branch 'main' into willz/benchmarking-improvements

f440ce9

Add --model support to bench_topk.py & set int64 stride flag in mha

e595628

Merge branch 'willz/benchmarking-improvements' of https://github.com/…

c11ec5d

…ROCm/aiter into willz/benchmarking-improvements

Undo linting changes to csrc

e33a0de

Add informative error when trying to benchmark non-MoE models

9cc6cb4

Format with Black

1cee921

<100 -> 460 TFLOPs on afp4wfp4 GEMM

fa44757

Finish tuning afp4wfp4 GEMM - increase ~2x

07dca44

Merge branch 'main' into willz/350x-gemm-tuning

99dfae5

willzhou-amd self-assigned this Jul 10, 2025

willzhou-amd added 3 commits July 10, 2025 17:02

Remove changes made in #594

85c151d

Merge branch 'main' into willz/fp4-gemm-tuning

55e9606

Improve performance on standard model (e.g Llama) shapes

6a6d3e1

willzhou-amd requested a review from azaidy July 11, 2025 20:28

willzhou-amd added 5 commits July 11, 2025 20:54

Add new shape tune

101c4cd

Merge branch 'main' into willz/fp4-gemm-tuning

e0cf6c5

Merge branch 'main' into willz/fp4-gemm-tuning

140deed

Optimizing performance of default config for standard model shapes

5eb94bb

Tune performance on N=1280,K=8192 w/compiler flags

b76574f

willzhou-amd requested a review from rahulbatra85 July 15, 2025 21:07

azaidy approved these changes Jul 18, 2025

View reviewed changes

willzhou-amd merged commit 53647f6 into main Jul 18, 2025
13 checks passed

willzhou-amd deleted the willz/fp4-gemm-tuning branch July 18, 2025 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRITON] Tune fp4xfp4 GEMM #641

[TRITON] Tune fp4xfp4 GEMM #641

Uh oh!

willzhou-amd commented Jul 10, 2025 •

edited

Loading

Uh oh!

azaidy left a comment

Uh oh!

willzhou-amd commented Jul 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[TRITON] Tune fp4xfp4 GEMM #641

[TRITON] Tune fp4xfp4 GEMM #641

Uh oh!

Conversation

willzhou-amd commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azaidy left a comment

Choose a reason for hiding this comment

Uh oh!

willzhou-amd commented Jul 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

willzhou-amd commented Jul 10, 2025 •

edited

Loading