Skip to content

Conversation

@willzhou-amd
Copy link
Contributor

@willzhou-amd willzhou-amd commented Jul 10, 2025

A few config changes to speed up the FP4xFP4 GEMM.

…_model.py

* Why? The idea is to have a master script that benchmarks the full set of associated kernels when given a model name. It's a little cleaner to place all kernel benchmarking scripts in /kernels and have the bench_model script
call them.
* How? See `bench_model.py`. Pytests are in bench_tests/
@willzhou-amd willzhou-amd self-assigned this Jul 10, 2025
@willzhou-amd willzhou-amd requested a review from azaidy July 11, 2025 20:28
Copy link
Contributor

@azaidy azaidy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@willzhou-amd
Copy link
Contributor Author

Note: This kernel was tuned on the rocm/triton compiler - the upstream Triton compiler yields fairly poor performance.

@willzhou-amd willzhou-amd merged commit 53647f6 into main Jul 18, 2025
13 checks passed
@willzhou-amd willzhou-amd deleted the willz/fp4-gemm-tuning branch July 18, 2025 19:10
cagrikymk pushed a commit that referenced this pull request Jul 30, 2025
* Finish tuning afp4wfp4 GEMM - increase ~2x

* Improve performance on standard model (e.g Llama) shapes

* Add new shape tune

* Optimizing performance of default config for standard model shapes

* Tune performance on N=1280,K=8192 w/compiler flags
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants