[TRITON]: Benchmarking scripts updates #650

willzhou-amd · 2025-07-11T21:16:46Z

Feature left over from #594.

Example usage (TN is default layout):
python op_tests/op_benchmarks/triton/bench_gemm_afp4wfp4.py --model llama3-8B

GEMM MXFP4 x MXFP4 Benchmark:
          M  hidden_dim  intermediate_dim          fc1          fc2
0       1.0      4096.0           14336.0     2.178623     1.067121
1       2.0      4096.0           14336.0     4.361116     2.192569
2       4.0      4096.0           14336.0     8.585310     4.366088
3       8.0      4096.0           14336.0    17.086730     8.803934
4      16.0      4096.0           14336.0    35.114516    17.421856
5      32.0      4096.0           14336.0    71.882269    34.751653
6      64.0      4096.0           14336.0   230.024845    75.359828
7     128.0      4096.0           14336.0   463.229775   116.112126
8     256.0      4096.0           14336.0   980.903823   340.895389
9     512.0      4096.0           14336.0  1840.106305   399.608586
10   1024.0      4096.0           14336.0  2057.388603   815.243167
11   2048.0      4096.0           14336.0  2312.689157  1728.132125
12   4096.0      4096.0           14336.0  2522.819179  2896.957587
13   8192.0      4096.0           14336.0  2460.754141  2830.013473
14  16384.0      4096.0           14336.0  2637.954270  2771.990673

python op_tests/op_benchmarks/triton/bench_gemm_afp4wfp4.py --model llama3-8B --layout NN

GEMM MXFP4 x MXFP4 Benchmark:
          M  hidden_dim  intermediate_dim          fc1          fc2
0       1.0      4096.0           14336.0     2.245305     1.108923
1       2.0      4096.0           14336.0     4.619294     2.232645
2       4.0      4096.0           14336.0     9.049059     4.464549
3       8.0      4096.0           14336.0    18.133700     9.005187
4      16.0      4096.0           14336.0    35.591152    17.875154
5      32.0      4096.0           14336.0    70.403234    35.310685
6      64.0      4096.0           14336.0   237.979005    47.589362
7     128.0      4096.0           14336.0   382.201763    62.635731
8     256.0      4096.0           14336.0   840.920849   209.801130
9     512.0      4096.0           14336.0  1069.735320   189.859473
10   1024.0      4096.0           14336.0  1119.609327   374.350702
11   2048.0      4096.0           14336.0  1151.436867   733.300420
12   4096.0      4096.0           14336.0  1301.814199  1379.689386
13   8192.0      4096.0           14336.0  1303.733058  1391.629535
14  16384.0      4096.0           14336.0  1308.855620  1388.704204

…_model.py * Why? The idea is to have a master script that benchmarks the full set of associated kernels when given a model name. It's a little cleaner to place all kernel benchmarking scripts in /kernels and have the bench_model script call them. * How? See `bench_model.py`. Pytests are in bench_tests/

…emm_a8w8 benchmarking.

…ructure

…rk_utils

…d add --model arg

…s would yield a shape failure

…mory-layout

azaidy

LGTM!

* Modify op_benchmark directory structure to add bench_tests/ and bench_model.py * Why? The idea is to have a master script that benchmarks the full set of associated kernels when given a model name. It's a little cleaner to place all kernel benchmarking scripts in /kernels and have the bench_model script call them. * How? See `bench_model.py`. Pytests are in bench_tests/ * Update table formatting for bench_gemm_a8w8 and add tests for bench_gemm_a8w8 benchmarking. * Add tensor parallel in bench_gemm_a8w8.py * Add -no_glu arg, fix error in tensor parallelism, and reset folder structure * Fix argparse & tensor parallel bug * Update bench_gemm_a8w8_blockscale.py and add repeated code to benchmark_utils * Consolidate bench fn * Consolidate bench fn: int8 blockscale * Unify argparse for MHA benchmarking * Update configs for mha bench * Broadcast updates to bench_batched_gemm_afp4wfp4.py * Fix issue with arg names in bench_batched_gemm_afp4wfp4 * Add stride shape upcasting * Broadcast changes to batch_gemm_afp4wfp4_pre_quant * Improve code reuse + fix benchmarking FLOP computation bug * Fix shape order to allow plots to display properly * Sweep through moe, extend_attn, prefill, rmsnorm, rope to fix bugs and add --model arg * Add --model and --shape support to bench_routing.py * Add MOE information to deepseek model config * Revert linting changes in the CK dir * Revert linting changes to ck dir * Black linting change * Fix f-string issue * Add --model support to bench_topk.py & set int64 stride flag in mha * Undo linting changes to csrc * Add informative error when trying to benchmark non-MoE models * Format with Black * Support model flag for bench_gemm_a16w16 * Add --layout flag support to int8 and fp16 GEMMs + set graph axes to logscale * Add --layout support to afp4wfp4 GEMM * Fix function naming in bench_gemm_afp4wfp4 * Replace missing comma * Add --layout support to batched afp4wfp4 pre quant gemm * Remove merge duplicates * Undo linting changes that removed CN comments * Fix bug with -M flag * Add --layout support to a8w8 blockscale gemm * Add --layout support to batched afp4wfp4 GEMM * Formatting changes * Formatting changes * Debug shape issue that causes segfault when K > M * Black linting change * Fix issue where running batched GEMM benchmarking scripts with no args would yield a shape failure * Linting changes * Add -o flag and other fixes for benchmark scripts * Fix moe_routing_sigmoid benchmark * add Mi350 config json for extend attention * Linting fixes * More formatting fixes * batched_gemm mxfp4 fixes * Linting changes * Fix batched_gemm_afp4wfp4_pre_quant benchmark --------- Co-authored-by: Rahul Batra <rahbatra@amd.com>

willzhou-amd added 30 commits June 27, 2025 19:13

Merge branch 'main' into willz/benchmarking-improvements

937b27e

Update table formatting for bench_gemm_a8w8 and add tests for bench_g…

a4b0e50

…emm_a8w8 benchmarking.

Add tensor parallel in bench_gemm_a8w8.py

a53a847

Add -no_glu arg, fix error in tensor parallelism, and reset folder st…

c4d03ee

…ructure

Fix argparse & tensor parallel bug

f09d24e

Update bench_gemm_a8w8_blockscale.py and add repeated code to benchma…

fa770c1

…rk_utils

Merge branch 'main' into willz/benchmarking-improvements

cc5fa63

Merge branch 'main' into willz/benchmarking-improvements

d2c4817

Consolidate bench fn

0537963

Consolidate bench fn: int8 blockscale

649f953

Merge branch 'main' into willz/benchmarking-improvements

71c0172

Unify argparse for MHA benchmarking

32eabab

Update configs for mha bench

00dc362

Broadcast updates to bench_batched_gemm_afp4wfp4.py

8e2266c

Fix issue with arg names in bench_batched_gemm_afp4wfp4

d541d1b

Add stride shape upcasting

59e9c93

Broadcast changes to batch_gemm_afp4wfp4_pre_quant

78b70fa

Improve code reuse + fix benchmarking FLOP computation bug

6539b54

Fix shape order to allow plots to display properly

6762797

Merge branch 'main' into willz/benchmarking-improvements

ecd55db

Sweep through moe, extend_attn, prefill, rmsnorm, rope to fix bugs an…

30db901

…d add --model arg

Add --model and --shape support to bench_routing.py

d327305

Add MOE information to deepseek model config

577b79c

Merge branch 'main' into willz/benchmarking-improvements

a97417c

Revert linting changes in the CK dir

adadc86

Revert linting changes to ck dir

9fafe20

Black linting change

7db6fac

Fix f-string issue

dad6093

Merge branch 'main' into willz/benchmarking-improvements

2140a72

Add --layout support to batched afp4wfp4 GEMM

4834300

willzhou-amd requested a review from rahulbatra85 July 11, 2025 21:16

willzhou-amd self-assigned this Jul 11, 2025

willzhou-amd changed the title ~~[TRITON]: Add memory layout flag (TN, TT, NT, NN) to benchmarking scripts.~~ [TRITON]: Add memory layout flag (TN, TT, NT, NN) to GEMM benchmarking scripts. Jul 11, 2025

willzhou-amd added 2 commits July 11, 2025 21:18

Formatting changes

c35e57e

Formatting changes

a6ea3ab

willzhou-amd requested review from azaidy and vgokhale July 11, 2025 21:48

willzhou-amd and others added 10 commits July 15, 2025 15:01

Debug shape issue that causes segfault when K > M

a21a1f5

Black linting change

ea4b16c

Fix issue where running batched GEMM benchmarking scripts with no arg…

ddab4d2

…s would yield a shape failure

Linting changes

7c4683a

Add -o flag and other fixes for benchmark scripts

923dbe3

Fix moe_routing_sigmoid benchmark

57a4823

add Mi350 config json for extend attention

e43f37f

Linting fixes

2ec741d

Merge remote-tracking branch 'origin/main' into willz/benchmarking-me…

355266f

…mory-layout

More formatting fixes

ec5803e

rahulbatra85 changed the title ~~[TRITON]: Add memory layout flag (TN, TT, NT, NN) to GEMM benchmarking scripts.~~ [TRITON]: Benchmarking scripts updates Jul 17, 2025

Rahul Batra and others added 4 commits July 17, 2025 03:27

batched_gemm mxfp4 fixes

f64ec83

Merge branch 'main' into willz/benchmarking-memory-layout

ba10d1c

Linting changes

eb0b066

Fix batched_gemm_afp4wfp4_pre_quant benchmark

54f9478

rahulbatra85 force-pushed the willz/benchmarking-memory-layout branch from fb3b15e to 54f9478 Compare July 17, 2025 16:40

rahulbatra85 approved these changes Jul 18, 2025

View reviewed changes

rahulbatra85 requested a review from zhanglx13 July 18, 2025 13:58

azaidy approved these changes Jul 18, 2025

View reviewed changes

rahulbatra85 merged commit 43c2a7f into main Jul 18, 2025
13 checks passed

rahulbatra85 deleted the willz/benchmarking-memory-layout branch July 18, 2025 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRITON]: Benchmarking scripts updates #650

[TRITON]: Benchmarking scripts updates #650

Uh oh!

willzhou-amd commented Jul 11, 2025

Uh oh!

azaidy left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[TRITON]: Benchmarking scripts updates #650

[TRITON]: Benchmarking scripts updates #650

Uh oh!

Conversation

willzhou-amd commented Jul 11, 2025

Uh oh!

azaidy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants