[TRITON]: Add missing GEMM benchmarks #680

willzhou-amd · 2025-07-18T15:30:15Z

A few of the our GEMM kernels don't have corresponding benchmarks.

Added GEMM benchmarking scripts:

Additional changes:

Previously, some benchmarking scripts generated their benchmarking inputs internally and broke upon API change. This change makes sure that all GEMM benchmarks call a helper function in the test scripts (e.g generate_gemm_axwx_inputs) to get their inputs.
Added --print_vgpr flag to all benchmarking scripts

Testing/validation:

Run GEMM test scripts to ensure they still pass after changes.
- Passes on 300x.
- Passes on 350x.
Test benchmarking scripts with all flag combinations.

…_model.py * Why? The idea is to have a master script that benchmarks the full set of associated kernels when given a model name. It's a little cleaner to place all kernel benchmarking scripts in /kernels and have the bench_model script call them. * How? See `bench_model.py`. Pytests are in bench_tests/

…emm_a8w8 benchmarking.

…ructure

…rk_utils

…d add --model arg

…type conversions.

…loat16 to fp32)

…play when using --model all

… single spacing would fail

* Modify op_benchmark directory structure to add bench_tests/ and bench_model.py * Why? The idea is to have a master script that benchmarks the full set of associated kernels when given a model name. It's a little cleaner to place all kernel benchmarking scripts in /kernels and have the bench_model script call them. * How? See `bench_model.py`. Pytests are in bench_tests/ * Update table formatting for bench_gemm_a8w8 and add tests for bench_gemm_a8w8 benchmarking. * Add tensor parallel in bench_gemm_a8w8.py * Add -no_glu arg, fix error in tensor parallelism, and reset folder structure * Fix argparse & tensor parallel bug * Update bench_gemm_a8w8_blockscale.py and add repeated code to benchmark_utils * Consolidate bench fn * Consolidate bench fn: int8 blockscale * Unify argparse for MHA benchmarking * Update configs for mha bench * Broadcast updates to bench_batched_gemm_afp4wfp4.py * Fix issue with arg names in bench_batched_gemm_afp4wfp4 * Add stride shape upcasting * Broadcast changes to batch_gemm_afp4wfp4_pre_quant * Improve code reuse + fix benchmarking FLOP computation bug * Fix shape order to allow plots to display properly * Sweep through moe, extend_attn, prefill, rmsnorm, rope to fix bugs and add --model arg * Add --model and --shape support to bench_routing.py * Add MOE information to deepseek model config * Revert linting changes in the CK dir * Revert linting changes to ck dir * Black linting change * Fix f-string issue * Add --model support to bench_topk.py & set int64 stride flag in mha * Undo linting changes to csrc * Add informative error when trying to benchmark non-MoE models * Format with Black * Support model flag for bench_gemm_a16w16 * Add --layout flag support to int8 and fp16 GEMMs + set graph axes to logscale * Add --layout support to afp4wfp4 GEMM * Fix function naming in bench_gemm_afp4wfp4 * Replace missing comma * Add --layout support to batched afp4wfp4 pre quant gemm * Remove linting changes that removed CN comments * Remove merge duplicates * Undo linting changes that removed CN comments * Fix bug with -M flag * Add --layout support to a8w8 blockscale gemm * Add --layout support to batched afp4wfp4 GEMM * Formatting changes * Formatting changes * Debug shape issue that causes segfault when K > M * Black linting change * Fix issue where running batched GEMM benchmarking scripts with no args would yield a shape failure * Add batched a8w8 benchmark * Add batched bf16 benchmark * Update a8fp4 tests to add input generating function * Update test shapes * Add benchmarking script for afp4wfp4 pre quant GEMM * Linting changes * Linting changes * Stash changes * Add -o flag and other fixes for benchmark scripts * Fix moe_routing_sigmoid benchmark * add Mi350 config json for extend attention * Linting fixes * More formatting fixes * batched_gemm mxfp4 fixes * Linting changes * Update mla decode benchmark * Update argparse for mla decode rope benchmark * Stashing changes * Fix kpack bug on MI350x * Complete support for --model flag for mla_decode_rope benchmarking * Linting changes * Slight tune * Revert unintentional changes from main in merge * Remove MLA decode from PR - will be in next one * Remove changes made to attention benchmarking scripts for this PR * Undo accidental deletions & fix linting errors * Add a8wfp4 benchmarking script & fix minimal test error * Add --atomic flag for a16w16 GEMM benchmark * Linting fix * Update a8w8 benchmark - fix --shape flag for 4 args * Update a16w16 benchmark - misc fixes for GEMM layout flag, -B flag, dtype conversions. * Fix bug with -M flag for all new batched GEMM kernels * Fix errors with afp4wfp4_pre_quant_atomic tests (change accum from bfloat16 to fp32) * Arg changes to batched benchmarking scripts to support model name display when using --model all * Fold common batched model benchmarking config code into utility script * Add --get_vgpr flag to all GEMM benchmarking scripts * Fix .json.json config (whoops) * Fix issue with vgpr table output generation parsing where tables with single spacing would fail * Fix vgpr bug with --model on benched batched gemm afp4wfp4 pre quant --------- Co-authored-by: Rahul Batra <rahbatra@amd.com>

willzhou-amd added 30 commits June 27, 2025 19:13

Merge branch 'main' into willz/benchmarking-improvements

937b27e

Update table formatting for bench_gemm_a8w8 and add tests for bench_g…

a4b0e50

…emm_a8w8 benchmarking.

Add tensor parallel in bench_gemm_a8w8.py

a53a847

Add -no_glu arg, fix error in tensor parallelism, and reset folder st…

c4d03ee

…ructure

Fix argparse & tensor parallel bug

f09d24e

Update bench_gemm_a8w8_blockscale.py and add repeated code to benchma…

fa770c1

…rk_utils

Merge branch 'main' into willz/benchmarking-improvements

cc5fa63

Merge branch 'main' into willz/benchmarking-improvements

d2c4817

Consolidate bench fn

0537963

Consolidate bench fn: int8 blockscale

649f953

Merge branch 'main' into willz/benchmarking-improvements

71c0172

Unify argparse for MHA benchmarking

32eabab

Update configs for mha bench

00dc362

Broadcast updates to bench_batched_gemm_afp4wfp4.py

8e2266c

Fix issue with arg names in bench_batched_gemm_afp4wfp4

d541d1b

Add stride shape upcasting

59e9c93

Broadcast changes to batch_gemm_afp4wfp4_pre_quant

78b70fa

Improve code reuse + fix benchmarking FLOP computation bug

6539b54

Fix shape order to allow plots to display properly

6762797

Merge branch 'main' into willz/benchmarking-improvements

ecd55db

Sweep through moe, extend_attn, prefill, rmsnorm, rope to fix bugs an…

30db901

…d add --model arg

Add --model and --shape support to bench_routing.py

d327305

Add MOE information to deepseek model config

577b79c

Merge branch 'main' into willz/benchmarking-improvements

a97417c

Revert linting changes in the CK dir

adadc86

Revert linting changes to ck dir

9fafe20

Black linting change

7db6fac

Fix f-string issue

dad6093

Merge branch 'main' into willz/benchmarking-improvements

2140a72

willzhou-amd added 7 commits July 17, 2025 22:00

Fix kpack bug on MI350x

aba1a1a

Complete support for --model flag for mla_decode_rope benchmarking

c403e3f

Linting changes

935041e

Slight tune

32e30a6

Merge branch 'main' into willz/additional-gemm-benchmarks

ca3b340

Revert unintentional changes from main in merge

a67adb1

Remove MLA decode from PR - will be in next one

66c9d4c

willzhou-amd self-assigned this Jul 18, 2025

willzhou-amd added 16 commits July 18, 2025 15:31

Remove changes made to attention benchmarking scripts for this PR

08fec10

Undo accidental deletions & fix linting errors

406bdfb

Merge branch 'main' into willz/additional-gemm-benchmarks

c5aad51

Add a8wfp4 benchmarking script & fix minimal test error

a328795

Add --atomic flag for a16w16 GEMM benchmark

28b9143

Linting fix

536cbfd

Update a8w8 benchmark - fix --shape flag for 4 args

1f7055c

Update a16w16 benchmark - misc fixes for GEMM layout flag, -B flag, d…

28ad9f0

…type conversions.

Fix bug with -M flag for all new batched GEMM kernels

68538fe

Fix errors with afp4wfp4_pre_quant_atomic tests (change accum from bf…

b33b333

…loat16 to fp32)

Arg changes to batched benchmarking scripts to support model name dis…

3cd4829

…play when using --model all

Fold common batched model benchmarking config code into utility script

046c261

Add --get_vgpr flag to all GEMM benchmarking scripts

b21080b

Fix .json.json config (whoops)

d4bd3fe

Fix issue with vgpr table output generation parsing where tables with…

dd6d73d

… single spacing would fail

Fix vgpr bug with --model on benched batched gemm afp4wfp4 pre quant

a85d39a

willzhou-amd requested a review from rahulbatra85 July 22, 2025 20:47

rahulbatra85 approved these changes Jul 23, 2025

View reviewed changes

rahulbatra85 merged commit dbfe43b into main Jul 23, 2025
13 checks passed

rahulbatra85 deleted the willz/additional-gemm-benchmarks branch July 23, 2025 21:48

willzhou-amd mentioned this pull request Jul 24, 2025

[TRITON]: MLA and Lean Attention updates #720

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRITON]: Add missing GEMM benchmarks #680

[TRITON]: Add missing GEMM benchmarks #680

Uh oh!

willzhou-amd commented Jul 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[TRITON]: Add missing GEMM benchmarks #680

[TRITON]: Add missing GEMM benchmarks #680

Uh oh!

Conversation

willzhou-amd commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

willzhou-amd commented Jul 18, 2025 •

edited

Loading