HIP: fix RDNA3 FP16/BF16 matrix multiplication by JohannesGaessler · Pull Request #17817 · ggml-org/llama.cpp

JohannesGaessler · 2025-12-06T07:17:27Z

Fixes #17797 by simply adding an explicit RDNA4 requirement to MMF. @jiachengjason as outlined in https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#pull-requests-for-contributors--collaborators , please test changes to the CUDA/HIP backend for correctness using test-backend-ops.

Beinsezii

Full test-backend-ops green now on gfx1100

Beinsezii · 2025-12-06T07:51:15Z

though interesting according to llama-bench I'm down 40% on pp8192 now compared to before #17576. is it not using the rocWMMA path anymore?

Beinsezii · 2025-12-06T08:09:36Z

Looking at the build command in #17576 (comment) they have GGML_HIP_ROCWMMA_FATTN=OFF so I assume it was never tested ON, and it's no longer functional.

Beinsezii · 2025-12-06T08:28:48Z

yeah.

using a realistic workload
bin/llama-bench -m ~/.cache/llama.cpp/Beinsezii_Mistral-Small-3.2-24B-Instruct-2506-Q6F-Q8A-GGUF_mistral-small-3.2-24b-instruct-2506-q6f-q8a.gguf -fa 1 -p 8192 -n 512 -pg 16384,1024 -r 1

all mmq commits + this one https://github.com/Beinsezii/llama.cpp/tree/rdna3_perf_mmq
GGML_HIP_ROCWMMA_FATTN=OFF

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |          pp8192 |        708.72 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |           tg512 |         36.72 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |  pp16384+tg1024 |        308.02 ± 0.00 |

GGML_HIP_ROCWMMA_FATTN=ON

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |          pp8192 |        726.87 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |           tg512 |         36.74 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |  pp16384+tg1024 |        314.46 ± 0.00 |

all recent mmq commits reverted
https://github.com/Beinsezii/llama.cpp/tree/rdna3_perf
GGML_HIP_ROCWMMA_FATTN=ON

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |          pp8192 |       1042.45 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |           tg512 |         36.68 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |  pp16384+tg1024 |        357.65 ± 0.00 |

Still think this should get merged first to stop the failures but maybe I should open a new issue for perf? I assume it'll be fixed by #17495 eventually since that will replace the rocWMMA path.

JohannesGaessler · 2025-12-06T10:28:17Z

Any differences you see should be from MMQ vs. rocBLAS. If you compile with GGML_CUDA_FORCE_CUBLAS=ON you can use rocBLAS unconditionally.

Beinsezii · 2025-12-06T10:40:50Z

Any differences you see should be from MMQ vs. rocBLAS. If you compile with GGML_CUDA_FORCE_CUBLAS=ON you can use rocBLAS unconditionally.

Rebuilt against this PR confirmed perf good with cublas. Might be worth making it the default again until the other PR is ready, people will really notice 1/3 of throughput gone.

arch-btw · 2025-12-06T13:33:32Z

I don't know if it was within the scope of this PR but building with GGML_HIP_ROCWMMA_FATTN=ON is still broken.

    HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

build.log

This reverts commit f334b79.

HIP: fix RDNA3 FP16/BF16 matrix multiplication

38d1f02

JohannesGaessler requested a review from am17an as a code owner December 6, 2025 07:17

am17an approved these changes Dec 6, 2025

View reviewed changes

loci-dev mentioned this pull request Dec 6, 2025

UPSTREAM PR #17817: HIP: fix RDNA3 FP16/BF16 matrix multiplication auroralabs-loci/llama.cpp#467

Open

Beinsezii approved these changes Dec 6, 2025

View reviewed changes

IIIIIllllIIIIIlllll mentioned this pull request Dec 6, 2025

Qwen3-Next --ubatch-size issue #17578

Closed

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 6, 2025

JohannesGaessler merged commit f334b79 into ggml-org:master Dec 6, 2025
71 of 75 checks passed

z-vishal mentioned this pull request Dec 6, 2025

ggml-zendnn : add ZenDNN backend for AMD CPUs #17690

Merged

JayZenith pushed a commit to JayZenith/llama.cpp that referenced this pull request Dec 7, 2025

HIP: fix RDNA3 FP16/BF16 matrix multiplication (ggml-org#17817)

fb9adc6

gabe-l-hart mentioned this pull request Dec 10, 2025

feat: llama.cpp bump (17f7f4) for SSM performance improvements ollama/ollama#13408

Merged

0Marble pushed a commit to 0Marble/llama.cpp that referenced this pull request Dec 18, 2025

HIP: fix RDNA3 FP16/BF16 matrix multiplication (ggml-org#17817)

c94c177

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Dec 20, 2025

Revert "HIP: fix RDNA3 FP16/BF16 matrix multiplication (ggml-org#17817)"

0f993cb

This reverts commit f334b79.

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

HIP: fix RDNA3 FP16/BF16 matrix multiplication (ggml-org#17817)

861a6da

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

HIP: fix RDNA3 FP16/BF16 matrix multiplication (#17817)

438bd18

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

HIP: fix RDNA3 FP16/BF16 matrix multiplication (ggml-org#17817)

1c00aa1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIP: fix RDNA3 FP16/BF16 matrix multiplication#17817

HIP: fix RDNA3 FP16/BF16 matrix multiplication#17817
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:hip-fix-rdna3-mmf

JohannesGaessler commented Dec 6, 2025

Uh oh!

Beinsezii left a comment

Uh oh!

Beinsezii commented Dec 6, 2025

Uh oh!

Beinsezii commented Dec 6, 2025

Uh oh!

Beinsezii commented Dec 6, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Dec 6, 2025

Uh oh!

Beinsezii commented Dec 6, 2025

Uh oh!

Uh oh!

arch-btw commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JohannesGaessler commented Dec 6, 2025

Uh oh!

Beinsezii left a comment

Choose a reason for hiding this comment

Uh oh!

Beinsezii commented Dec 6, 2025

Uh oh!

Beinsezii commented Dec 6, 2025

Uh oh!

Beinsezii commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Dec 6, 2025

Uh oh!

Beinsezii commented Dec 6, 2025

Uh oh!

Uh oh!

arch-btw commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Beinsezii commented Dec 6, 2025 •

edited

Loading