Skip to content

Conversation

@willzhou-amd
Copy link
Contributor

Feature left over from #594.

Example usage (TN is default layout):
python op_tests/op_benchmarks/triton/bench_gemm_afp4wfp4.py --model llama3-8B

GEMM MXFP4 x MXFP4 Benchmark:
          M  hidden_dim  intermediate_dim          fc1          fc2
0       1.0      4096.0           14336.0     2.178623     1.067121
1       2.0      4096.0           14336.0     4.361116     2.192569
2       4.0      4096.0           14336.0     8.585310     4.366088
3       8.0      4096.0           14336.0    17.086730     8.803934
4      16.0      4096.0           14336.0    35.114516    17.421856
5      32.0      4096.0           14336.0    71.882269    34.751653
6      64.0      4096.0           14336.0   230.024845    75.359828
7     128.0      4096.0           14336.0   463.229775   116.112126
8     256.0      4096.0           14336.0   980.903823   340.895389
9     512.0      4096.0           14336.0  1840.106305   399.608586
10   1024.0      4096.0           14336.0  2057.388603   815.243167
11   2048.0      4096.0           14336.0  2312.689157  1728.132125
12   4096.0      4096.0           14336.0  2522.819179  2896.957587
13   8192.0      4096.0           14336.0  2460.754141  2830.013473
14  16384.0      4096.0           14336.0  2637.954270  2771.990673

python op_tests/op_benchmarks/triton/bench_gemm_afp4wfp4.py --model llama3-8B --layout NN

GEMM MXFP4 x MXFP4 Benchmark:
          M  hidden_dim  intermediate_dim          fc1          fc2
0       1.0      4096.0           14336.0     2.245305     1.108923
1       2.0      4096.0           14336.0     4.619294     2.232645
2       4.0      4096.0           14336.0     9.049059     4.464549
3       8.0      4096.0           14336.0    18.133700     9.005187
4      16.0      4096.0           14336.0    35.591152    17.875154
5      32.0      4096.0           14336.0    70.403234    35.310685
6      64.0      4096.0           14336.0   237.979005    47.589362
7     128.0      4096.0           14336.0   382.201763    62.635731
8     256.0      4096.0           14336.0   840.920849   209.801130
9     512.0      4096.0           14336.0  1069.735320   189.859473
10   1024.0      4096.0           14336.0  1119.609327   374.350702
11   2048.0      4096.0           14336.0  1151.436867   733.300420
12   4096.0      4096.0           14336.0  1301.814199  1379.689386
13   8192.0      4096.0           14336.0  1303.733058  1391.629535
14  16384.0      4096.0           14336.0  1308.855620  1388.704204

…_model.py

* Why? The idea is to have a master script that benchmarks the full set of associated kernels when given a model name. It's a little cleaner to place all kernel benchmarking scripts in /kernels and have the bench_model script
call them.
* How? See `bench_model.py`. Pytests are in bench_tests/
@willzhou-amd willzhou-amd self-assigned this Jul 11, 2025
@willzhou-amd willzhou-amd changed the title [TRITON]: Add memory layout flag (TN, TT, NT, NN) to benchmarking scripts. [TRITON]: Add memory layout flag (TN, TT, NT, NN) to GEMM benchmarking scripts. Jul 11, 2025
@willzhou-amd willzhou-amd requested review from azaidy and vgokhale July 11, 2025 21:48
@rahulbatra85 rahulbatra85 changed the title [TRITON]: Add memory layout flag (TN, TT, NT, NN) to GEMM benchmarking scripts. [TRITON]: Benchmarking scripts updates Jul 17, 2025
@rahulbatra85 rahulbatra85 force-pushed the willz/benchmarking-memory-layout branch from fb3b15e to 54f9478 Compare July 17, 2025 16:40
@rahulbatra85 rahulbatra85 requested a review from zhanglx13 July 18, 2025 13:58
Copy link
Contributor

@azaidy azaidy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rahulbatra85 rahulbatra85 merged commit 43c2a7f into main Jul 18, 2025
13 checks passed
@rahulbatra85 rahulbatra85 deleted the willz/benchmarking-memory-layout branch July 18, 2025 14:03
cagrikymk pushed a commit that referenced this pull request Jul 30, 2025
* Modify op_benchmark directory structure to add bench_tests/ and bench_model.py

* Why? The idea is to have a master script that benchmarks the full set of associated kernels when given a model name. It's a little cleaner to place all kernel benchmarking scripts in /kernels and have the bench_model script
call them.
* How? See `bench_model.py`. Pytests are in bench_tests/

* Update table formatting for bench_gemm_a8w8 and add tests for bench_gemm_a8w8 benchmarking.

* Add tensor parallel in bench_gemm_a8w8.py

* Add -no_glu arg, fix error in tensor parallelism, and reset folder structure

* Fix argparse & tensor parallel bug

* Update bench_gemm_a8w8_blockscale.py and add repeated code to benchmark_utils

* Consolidate bench fn

* Consolidate bench fn: int8 blockscale

* Unify argparse for MHA benchmarking

* Update configs for mha bench

* Broadcast updates to bench_batched_gemm_afp4wfp4.py

* Fix issue with arg names in bench_batched_gemm_afp4wfp4

* Add stride shape upcasting

* Broadcast changes to batch_gemm_afp4wfp4_pre_quant

* Improve code reuse + fix benchmarking FLOP computation bug

* Fix shape order to allow plots to display properly

* Sweep through moe, extend_attn, prefill, rmsnorm, rope to fix bugs and add --model arg

* Add --model and --shape support to bench_routing.py

* Add MOE information to deepseek model config

* Revert linting changes in the CK dir

* Revert linting changes to ck dir

* Black linting change

* Fix f-string issue

* Add --model support to bench_topk.py & set int64 stride flag in mha

* Undo linting changes to csrc

* Add informative error when trying to benchmark non-MoE models

* Format with Black

* Support model flag for bench_gemm_a16w16

* Add --layout flag support to int8 and fp16 GEMMs + set graph axes to logscale

* Add --layout support to afp4wfp4 GEMM

* Fix function naming in bench_gemm_afp4wfp4

* Replace missing comma

* Add --layout support to batched afp4wfp4 pre quant gemm

* Remove merge duplicates

* Undo linting changes that removed CN comments

* Fix bug with -M flag

* Add --layout support to a8w8 blockscale gemm

* Add --layout support to batched afp4wfp4 GEMM

* Formatting changes

* Formatting changes

* Debug shape issue that causes segfault when K > M

* Black linting change

* Fix issue where running batched GEMM benchmarking scripts with no args would yield a shape failure

* Linting changes

* Add -o flag and other fixes for benchmark scripts

* Fix moe_routing_sigmoid benchmark

* add Mi350 config json for extend attention

* Linting fixes

* More formatting fixes

* batched_gemm mxfp4 fixes

* Linting changes

* Fix batched_gemm_afp4wfp4_pre_quant benchmark

---------

Co-authored-by: Rahul Batra <rahbatra@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants