-
Notifications
You must be signed in to change notification settings - Fork 167
[TRITON]: Add missing GEMM benchmarks #680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…_model.py * Why? The idea is to have a master script that benchmarks the full set of associated kernels when given a model name. It's a little cleaner to place all kernel benchmarking scripts in /kernels and have the bench_model script call them. * How? See `bench_model.py`. Pytests are in bench_tests/
…emm_a8w8 benchmarking.
…d add --model arg
…type conversions.
…play when using --model all
… single spacing would fail
rahulbatra85
approved these changes
Jul 23, 2025
6 tasks
cagrikymk
pushed a commit
that referenced
this pull request
Jul 30, 2025
* Modify op_benchmark directory structure to add bench_tests/ and bench_model.py * Why? The idea is to have a master script that benchmarks the full set of associated kernels when given a model name. It's a little cleaner to place all kernel benchmarking scripts in /kernels and have the bench_model script call them. * How? See `bench_model.py`. Pytests are in bench_tests/ * Update table formatting for bench_gemm_a8w8 and add tests for bench_gemm_a8w8 benchmarking. * Add tensor parallel in bench_gemm_a8w8.py * Add -no_glu arg, fix error in tensor parallelism, and reset folder structure * Fix argparse & tensor parallel bug * Update bench_gemm_a8w8_blockscale.py and add repeated code to benchmark_utils * Consolidate bench fn * Consolidate bench fn: int8 blockscale * Unify argparse for MHA benchmarking * Update configs for mha bench * Broadcast updates to bench_batched_gemm_afp4wfp4.py * Fix issue with arg names in bench_batched_gemm_afp4wfp4 * Add stride shape upcasting * Broadcast changes to batch_gemm_afp4wfp4_pre_quant * Improve code reuse + fix benchmarking FLOP computation bug * Fix shape order to allow plots to display properly * Sweep through moe, extend_attn, prefill, rmsnorm, rope to fix bugs and add --model arg * Add --model and --shape support to bench_routing.py * Add MOE information to deepseek model config * Revert linting changes in the CK dir * Revert linting changes to ck dir * Black linting change * Fix f-string issue * Add --model support to bench_topk.py & set int64 stride flag in mha * Undo linting changes to csrc * Add informative error when trying to benchmark non-MoE models * Format with Black * Support model flag for bench_gemm_a16w16 * Add --layout flag support to int8 and fp16 GEMMs + set graph axes to logscale * Add --layout support to afp4wfp4 GEMM * Fix function naming in bench_gemm_afp4wfp4 * Replace missing comma * Add --layout support to batched afp4wfp4 pre quant gemm * Remove linting changes that removed CN comments * Remove merge duplicates * Undo linting changes that removed CN comments * Fix bug with -M flag * Add --layout support to a8w8 blockscale gemm * Add --layout support to batched afp4wfp4 GEMM * Formatting changes * Formatting changes * Debug shape issue that causes segfault when K > M * Black linting change * Fix issue where running batched GEMM benchmarking scripts with no args would yield a shape failure * Add batched a8w8 benchmark * Add batched bf16 benchmark * Update a8fp4 tests to add input generating function * Update test shapes * Add benchmarking script for afp4wfp4 pre quant GEMM * Linting changes * Linting changes * Stash changes * Add -o flag and other fixes for benchmark scripts * Fix moe_routing_sigmoid benchmark * add Mi350 config json for extend attention * Linting fixes * More formatting fixes * batched_gemm mxfp4 fixes * Linting changes * Update mla decode benchmark * Update argparse for mla decode rope benchmark * Stashing changes * Fix kpack bug on MI350x * Complete support for --model flag for mla_decode_rope benchmarking * Linting changes * Slight tune * Revert unintentional changes from main in merge * Remove MLA decode from PR - will be in next one * Remove changes made to attention benchmarking scripts for this PR * Undo accidental deletions & fix linting errors * Add a8wfp4 benchmarking script & fix minimal test error * Add --atomic flag for a16w16 GEMM benchmark * Linting fix * Update a8w8 benchmark - fix --shape flag for 4 args * Update a16w16 benchmark - misc fixes for GEMM layout flag, -B flag, dtype conversions. * Fix bug with -M flag for all new batched GEMM kernels * Fix errors with afp4wfp4_pre_quant_atomic tests (change accum from bfloat16 to fp32) * Arg changes to batched benchmarking scripts to support model name display when using --model all * Fold common batched model benchmarking config code into utility script * Add --get_vgpr flag to all GEMM benchmarking scripts * Fix .json.json config (whoops) * Fix issue with vgpr table output generation parsing where tables with single spacing would fail * Fix vgpr bug with --model on benched batched gemm afp4wfp4 pre quant --------- Co-authored-by: Rahul Batra <rahbatra@amd.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A few of the our GEMM kernels don't have corresponding benchmarks.
Added GEMM benchmarking scripts:
--atomicflag for a16w16 GEMMAdditional changes:
generate_gemm_axwx_inputs) to get their inputs.--print_vgprflag to all benchmarking scriptsTesting/validation: