Skip to content

Conversation

@willzhou-amd
Copy link
Contributor

@willzhou-amd willzhou-amd commented Jul 18, 2025

A few of the our GEMM kernels don't have corresponding benchmarks.

Added GEMM benchmarking scripts:

  • Batched a8w8 GEMM
  • Batched a16w16 GEMM
  • afp4wfp4 pre_quant_atomic GEMM
  • a8wfp4 GEMM
  • --atomic flag for a16w16 GEMM

Additional changes:

  • Previously, some benchmarking scripts generated their benchmarking inputs internally and broke upon API change. This change makes sure that all GEMM benchmarks call a helper function in the test scripts (e.g generate_gemm_axwx_inputs) to get their inputs.
  • Added --print_vgpr flag to all benchmarking scripts

Testing/validation:

  • Run GEMM test scripts to ensure they still pass after changes.
    • Passes on 300x.
    • Passes on 350x.
  • Test benchmarking scripts with all flag combinations.

…_model.py

* Why? The idea is to have a master script that benchmarks the full set of associated kernels when given a model name. It's a little cleaner to place all kernel benchmarking scripts in /kernels and have the bench_model script
call them.
* How? See `bench_model.py`. Pytests are in bench_tests/
@willzhou-amd willzhou-amd self-assigned this Jul 18, 2025
@rahulbatra85 rahulbatra85 merged commit dbfe43b into main Jul 23, 2025
13 checks passed
@rahulbatra85 rahulbatra85 deleted the willz/additional-gemm-benchmarks branch July 23, 2025 21:48
cagrikymk pushed a commit that referenced this pull request Jul 30, 2025
* Modify op_benchmark directory structure to add bench_tests/ and bench_model.py

* Why? The idea is to have a master script that benchmarks the full set of associated kernels when given a model name. It's a little cleaner to place all kernel benchmarking scripts in /kernels and have the bench_model script
call them.
* How? See `bench_model.py`. Pytests are in bench_tests/

* Update table formatting for bench_gemm_a8w8 and add tests for bench_gemm_a8w8 benchmarking.

* Add tensor parallel in bench_gemm_a8w8.py

* Add -no_glu arg, fix error in tensor parallelism, and reset folder structure

* Fix argparse & tensor parallel bug

* Update bench_gemm_a8w8_blockscale.py and add repeated code to benchmark_utils

* Consolidate bench fn

* Consolidate bench fn: int8 blockscale

* Unify argparse for MHA benchmarking

* Update configs for mha bench

* Broadcast updates to bench_batched_gemm_afp4wfp4.py

* Fix issue with arg names in bench_batched_gemm_afp4wfp4

* Add stride shape upcasting

* Broadcast changes to batch_gemm_afp4wfp4_pre_quant

* Improve code reuse + fix benchmarking FLOP computation bug

* Fix shape order to allow plots to display properly

* Sweep through moe, extend_attn, prefill, rmsnorm, rope to fix bugs and add --model arg

* Add --model and --shape support to bench_routing.py

* Add MOE information to deepseek model config

* Revert linting changes in the CK dir

* Revert linting changes to ck dir

* Black linting change

* Fix f-string issue

* Add --model support to bench_topk.py & set int64 stride flag in mha

* Undo linting changes to csrc

* Add informative error when trying to benchmark non-MoE models

* Format with Black

* Support model flag for bench_gemm_a16w16

* Add --layout flag support to int8 and fp16 GEMMs + set graph axes to logscale

* Add --layout support to afp4wfp4 GEMM

* Fix function naming in bench_gemm_afp4wfp4

* Replace missing comma

* Add --layout support to batched afp4wfp4 pre quant gemm

* Remove linting changes that removed CN comments

* Remove merge duplicates

* Undo linting changes that removed CN comments

* Fix bug with -M flag

* Add --layout support to a8w8 blockscale gemm

* Add --layout support to batched afp4wfp4 GEMM

* Formatting changes

* Formatting changes

* Debug shape issue that causes segfault when K > M

* Black linting change

* Fix issue where running batched GEMM benchmarking scripts with no args would yield a shape failure

* Add batched a8w8 benchmark

* Add batched bf16 benchmark

* Update a8fp4 tests to add input generating function

* Update test shapes

* Add benchmarking script for afp4wfp4 pre quant GEMM

* Linting changes

* Linting changes

* Stash changes

* Add -o flag and other fixes for benchmark scripts

* Fix moe_routing_sigmoid benchmark

* add Mi350 config json for extend attention

* Linting fixes

* More formatting fixes

* batched_gemm mxfp4 fixes

* Linting changes

* Update mla decode benchmark

* Update argparse for mla decode rope benchmark

* Stashing changes

* Fix kpack bug on MI350x

* Complete support for --model flag for mla_decode_rope benchmarking

* Linting changes

* Slight tune

* Revert unintentional changes from main in merge

* Remove MLA decode from PR - will be in next one

* Remove changes made to attention benchmarking scripts for this PR

* Undo accidental deletions & fix linting errors

* Add a8wfp4 benchmarking script & fix minimal test error

* Add --atomic flag for a16w16 GEMM benchmark

* Linting fix

* Update a8w8 benchmark - fix --shape flag for 4 args

* Update a16w16 benchmark - misc fixes for GEMM layout flag, -B flag, dtype conversions.

* Fix bug with -M flag for all new batched GEMM kernels

* Fix errors with afp4wfp4_pre_quant_atomic tests (change accum from bfloat16 to fp32)

* Arg changes to batched benchmarking scripts to support model name display when using --model all

* Fold common batched model benchmarking config code into utility script

* Add --get_vgpr flag to all GEMM benchmarking scripts

* Fix .json.json config (whoops)

* Fix issue with vgpr table output generation parsing where tables with single spacing would fail

* Fix vgpr bug with --model on benched batched gemm afp4wfp4 pre quant

---------

Co-authored-by: Rahul Batra <rahbatra@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants