Skip to content

HIP: fix RDNA3 FP16/BF16 matrix multiplication#17817

Merged
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:hip-fix-rdna3-mmf
Dec 6, 2025
Merged

HIP: fix RDNA3 FP16/BF16 matrix multiplication#17817
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:hip-fix-rdna3-mmf

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Fixes #17797 by simply adding an explicit RDNA4 requirement to MMF. @jiachengjason as outlined in https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#pull-requests-for-contributors--collaborators , please test changes to the CUDA/HIP backend for correctness using test-backend-ops.

Copy link
Copy Markdown
Contributor

@Beinsezii Beinsezii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full test-backend-ops green now on gfx1100

@Beinsezii
Copy link
Copy Markdown
Contributor

though interesting according to llama-bench I'm down 40% on pp8192 now compared to before #17576. is it not using the rocWMMA path anymore?

@Beinsezii
Copy link
Copy Markdown
Contributor

Looking at the build command in #17576 (comment) they have GGML_HIP_ROCWMMA_FATTN=OFF so I assume it was never tested ON, and it's no longer functional.

@Beinsezii
Copy link
Copy Markdown
Contributor

Beinsezii commented Dec 6, 2025

yeah.

using a realistic workload
bin/llama-bench -m ~/.cache/llama.cpp/Beinsezii_Mistral-Small-3.2-24B-Instruct-2506-Q6F-Q8A-GGUF_mistral-small-3.2-24b-instruct-2506-q6f-q8a.gguf -fa 1 -p 8192 -n 512 -pg 16384,1024 -r 1

all mmq commits + this one https://github.com/Beinsezii/llama.cpp/tree/rdna3_perf_mmq
GGML_HIP_ROCWMMA_FATTN=OFF

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |          pp8192 |        708.72 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |           tg512 |         36.72 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |  pp16384+tg1024 |        308.02 ± 0.00 |

GGML_HIP_ROCWMMA_FATTN=ON

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |          pp8192 |        726.87 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |           tg512 |         36.74 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |  pp16384+tg1024 |        314.46 ± 0.00 |

all recent mmq commits reverted
https://github.com/Beinsezii/llama.cpp/tree/rdna3_perf
GGML_HIP_ROCWMMA_FATTN=ON

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |          pp8192 |       1042.45 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |           tg512 |         36.68 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |  pp16384+tg1024 |        357.65 ± 0.00 |

Still think this should get merged first to stop the failures but maybe I should open a new issue for perf? I assume it'll be fixed by #17495 eventually since that will replace the rocWMMA path.

@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 6, 2025
@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

Any differences you see should be from MMQ vs. rocBLAS. If you compile with GGML_CUDA_FORCE_CUBLAS=ON you can use rocBLAS unconditionally.

@Beinsezii
Copy link
Copy Markdown
Contributor

Any differences you see should be from MMQ vs. rocBLAS. If you compile with GGML_CUDA_FORCE_CUBLAS=ON you can use rocBLAS unconditionally.

Rebuilt against this PR confirmed perf good with cublas. Might be worth making it the default again until the other PR is ready, people will really notice 1/3 of throughput gone.

@JohannesGaessler JohannesGaessler merged commit f334b79 into ggml-org:master Dec 6, 2025
71 of 75 checks passed
@arch-btw
Copy link
Copy Markdown
Contributor

arch-btw commented Dec 6, 2025

I don't know if it was within the scope of this PR but building with GGML_HIP_ROCWMMA_FATTN=ON is still broken.

    HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

build.log

JayZenith pushed a commit to JayZenith/llama.cpp that referenced this pull request Dec 7, 2025
0Marble pushed a commit to 0Marble/llama.cpp that referenced this pull request Dec 18, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Dec 20, 2025
Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: ROCm-compiled program outputs gibberish

4 participants