Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support by 0cc4m · Pull Request #18749 · ggml-org/llama.cpp

0cc4m · 2026-01-11T08:27:02Z

I tuned this on AMD Radeon 8060S, but a brief test also showed improvements on AMD RX 9060 XT.

model	size	params	ngl	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	pp512	815.22 ± 16.88	644.84 ± 4.49	875.24 ± 17.99	+35.7%
llama 8B Q4_0	4.33 GiB	8.03 B	99	pp512 @ d8192	321.66 ± 3.02	384.27 ± 0.29	463.72 ± 1.32	+20.7%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	pp512	995.81 ± 32.32	529.50 ± 13.07	793.51 ± 9.67	+49.9%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	pp512 @ d8192	352.08 ± 3.17	341.86 ± 1.96	435.55 ± 2.63	+27.4%
llama 8B Q8_0	7.95 GiB	8.03 B	99	pp512	755.86 ± 21.11	422.37 ± 21.61	742.46 ± 15.49	+75.8%
llama 8B Q8_0	7.95 GiB	8.03 B	99	pp512 @ d8192	317.44 ± 2.31	306.97 ± 0.36	419.24 ± 4.59	+36.6%

model	size	params	ngl	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	pp512	648.41 ± 14.37	1437.13 ± 1.77	1902.23 ± 1.65	+32.4%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	pp512 @ d8192	410.01 ± 6.22	757.59 ± 2.71	841.76 ± 3.02	+11.1%

daniandtheweb · 2026-01-11T15:22:34Z

There are some small but consistent improvements on RDNA3 as well with this.

model	size	params	backend	ngl	fa	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	2213.37 ± 25.38	1964.06 ± 17.79	1992.17 ± 24.98	+1.43%
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	2480.36 ± 2.56	2057.06 ± 6.28	2120.18 ± 2.38	+3.06%

0cc4m · 2026-01-11T15:24:38Z

Thank you for testing it! I was hoping for more, I guess it is too different from 8060S. There's probably some more tuning for RDNA3/4 dGPUs that can be done, but I don't have the hardware for that.

netrunnereve · 2026-01-11T18:04:31Z

You should try this for mul_mat_id as well!

characharm · 2026-01-11T19:14:21Z

9070xt

model	test	master t/s (±)	PR t/s (±)	diff t/s	diff %
gpt-oss 20B MXFP4 MoE	pp512	4646.61 ± 74.94	3464.81 ± 61.21	-1181.80	-25.4%
gpt-oss 20B MXFP4 MoE	tg128	175.24 ± 0.50	178.45 ± 1.65	+3.21	+1.8%
gpt-oss 20B MXFP4 MoE	pp512 @ d8192	2077.60 ± 19.08	1688.42 ± 15.70	-389.18	-18.7%
gpt-oss 20B MXFP4 MoE	tg128 @ d8192	149.41 ± 0.73	149.97 ± 0.50	+0.56	+0.37%

0cc4m · 2026-01-11T19:34:03Z

You should try this for mul_mat_id as well!

I tried, but didn't find a good parameter set yet. I'll keep trying.

@characharm Is that Windows or Linux?

characharm · 2026-01-11T19:37:10Z

@0cc4m Windows. I rebooted between tests for accuracy. The numbers are stable.

0cc4m · 2026-01-11T19:43:42Z

Thank you for testing, I wish the drivers would behave more similarly. I'll disable the change on Windows.

0cc4m · 2026-01-11T19:48:24Z

Can you also test a dense model, though? Those should be more affected than MoE.

characharm · 2026-01-11T20:02:39Z

model	test	master t/s (±)	PR t/s (±)	diff t/s	diff %
llama 8B Q4_0	pp512 @ d8192	1659.33 ± 20.44	588.10 ± 1.59	-1071.23	-64.56%
llama 8B Q4_0	tg128 @ d8192	87.20 ± 0.48	87.20 ± 0.92	0.00	0.0%

0cc4m · 2026-01-11T20:37:00Z

@characharm Please check if #18763 restores your performance.

characharm · 2026-01-11T20:59:51Z

No, the performance is the same. I compared it with the CI build, so the problem is not in my build.
clarify:
The driver exclusion fix didn't work. I thought it might be a local build issue, but I verified with b7707 and got the same results.

jeffbolznv · 2026-01-11T21:06:17Z

I think these tile sizes are also used for matmul id in some cases, so that could explain the effect on gpt-oss.

0cc4m · 2026-01-11T21:16:10Z

@characharm Sorry, I missed disabling the large tile size. Try again, please.

I think these tile sizes are also used for matmul id in some cases, so that could explain the effect on gpt-oss.

No, I didn't enable the large tile for mul_mat_id, so unless the check somewhere is wrong, it should not be used at all.

characharm · 2026-01-11T21:40:08Z

@0cc4m Yes, now dens and moe show the same numbers as before 18749.

acbits · 2026-01-12T16:48:21Z

Not sure whether this helps, but Vulkan performance on RX 7600 has been going down. I don't use such a big model on this GPU, but interesting that it has degraded.

model	size	params	backend	ngl	threads	type_k	type_v	fa	test	t/s
qwen3 14B Q4_K - Medium	8.53 GiB	14.77 B	Vulkan	99	8	q8_0	q8_0	1	pp512	12.76 ± 0.00
qwen3 14B Q4_K - Medium	8.53 GiB	14.77 B	Vulkan	99	8	q8_0	q8_0	1	tg128	1.35 ± 0.00

build: 7d77f07 (7108)

model	size	params	backend	ngl	type_k	type_v	fa	test	t/s
qwen3 14B Q4_K - Medium	8.53 GiB	14.77 B	Vulkan	99	q8_0	q8_0	1	pp512	4.67 ± 0.00
qwen3 14B Q4_K - Medium	8.53 GiB	14.77 B	Vulkan	99	q8_0	q8_0	1	tg128	1.35 ± 0.00

build: c9ced49 (7710)

0cc4m · 2026-01-12T18:44:38Z

Can you test a model that actually fits into your GPU? That would likely give more usable data.

acbits · 2026-01-12T22:09:16Z

Can you test a model that actually fits into your GPU? That would likely give more usable data.

Luckily, I had copied the results from an old build. Yeah, even for smaller models, there has been degradation. Not sure whether kernel upgrade played a role.

Kernel: 6.6 (don't remember the exact patch)

model	size	params	backend	ngl	threads	type_k	type_v	fa	test	t/s
qwen3 8B Q5_K - Medium	5.44 GiB	8.19 B	Vulkan	99	8	q8_0	q8_0	1	pp512	498.47 ± 0.00
qwen3 8B Q5_K - Medium	5.44 GiB	8.19 B	Vulkan	99	8	q8_0	q8_0	1	tg128	34.83 ± 0.00

build: dd5e8ca (6916)

Kernel: 6.17.9-200

model	size	params	backend	ngl	type_k	type_v	fa	test	t/s
qwen3 8B Q5_K - Medium	5.44 GiB	8.19 B	Vulkan	99	q8_0	q8_0	1	pp512	240.94 ± 0.00
qwen3 8B Q5_K - Medium	5.44 GiB	8.19 B	Vulkan	99	q8_0	q8_0	1	tg128	12.13 ± 0.00

build: c9ced49 (7710)

0cc4m · 2026-01-13T05:13:50Z

Can you add more information about your setup? What OS, what driver, what does your device info string say, etc?

…gml-org#18749) * vulkan: Enable and optimize large matmul parameter combination for AMD * limit tuning to AMD GPUs with coopmat support * use tx_m values instead of _l

acbits · 2026-01-13T17:32:20Z

Can you add more information about your setup? What OS, what driver, what does your device info string say, etc?

OS: Fedora 42
Kernel: 6.17.9-200
MESA: 25.1.9
RX7600-vulkaninfo.json

0cc4m · 2026-01-13T18:03:40Z

My guess would be that your driver is too old, for good Mesa coopmat performance you usually want 25.3 or higher. But I didn't want to cause an issue for older versions.

acbits · 2026-01-13T23:04:19Z

My guess would be that your driver is too old, for good Mesa coopmat performance you usually want 25.3 or higher. But I didn't want to cause an issue for older versions.

25.1.9 is the latest. No updates are available for Fedora 42.

Nindaleth · 2026-01-14T08:38:21Z

25.1.9, despite being the newest for Fedora 42, is not good enough. If upgrade to Fedora 43 is not an option for you at the moment, you could try a newer mesa build from the che/mesa COPR repo.

For example a merge request providing a significant PP speed improvement was merged into Mesa repo in August and is available since release 25.2.x or 25.3.x (not sure here).

…gml-org#18749) * vulkan: Enable and optimize large matmul parameter combination for AMD * limit tuning to AMD GPUs with coopmat support * use tx_m values instead of _l

0cc4m added 3 commits January 10, 2026 23:08

vulkan: Enable and optimize large matmul parameter combination for AMD

bf2ee19

limit tuning to AMD GPUs with coopmat support

1fec364

use tx_m values instead of _l

e1a33aa

0cc4m requested a review from jeffbolznv January 11, 2026 08:27

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 11, 2026

loci-dev mentioned this pull request Jan 11, 2026

UPSTREAM PR #18749: Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support auroralabs-loci/llama.cpp#884

Open

jeffbolznv approved these changes Jan 11, 2026

View reviewed changes

0cc4m merged commit 0e76501 into master Jan 11, 2026
75 of 76 checks passed

0cc4m deleted the 0cc4m/vulkan-amd-coopmat-opt branch January 11, 2026 16:33

0cc4m mentioned this pull request Jan 11, 2026

vulkan: Disable large coopmat matmul configuration on proprietary AMD driver #18763

Merged

loci-dev mentioned this pull request Jan 11, 2026

UPSTREAM PR #18763: vulkan: Disable large coopmat matmul configuration on proprietary AMD driver auroralabs-loci/llama.cpp#890

Open

Conversation

0cc4m commented Jan 11, 2026

Uh oh!

daniandtheweb commented Jan 11, 2026

Uh oh!

0cc4m commented Jan 11, 2026

Uh oh!

Uh oh!

netrunnereve commented Jan 11, 2026

Uh oh!

characharm commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Jan 11, 2026

Uh oh!

characharm commented Jan 11, 2026

Uh oh!

0cc4m commented Jan 11, 2026

Uh oh!

0cc4m commented Jan 11, 2026

Uh oh!

characharm commented Jan 11, 2026

Uh oh!

0cc4m commented Jan 11, 2026

Uh oh!

characharm commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Jan 11, 2026

Uh oh!

0cc4m commented Jan 11, 2026

Uh oh!

characharm commented Jan 11, 2026

Uh oh!

acbits commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Jan 12, 2026

Uh oh!

acbits commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Jan 13, 2026

Uh oh!

acbits commented Jan 13, 2026

Uh oh!

0cc4m commented Jan 13, 2026

Uh oh!

acbits commented Jan 13, 2026

Uh oh!

Nindaleth commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

characharm commented Jan 11, 2026 •

edited

Loading

characharm commented Jan 11, 2026 •

edited

Loading

acbits commented Jan 12, 2026 •

edited

Loading

acbits commented Jan 12, 2026 •

edited

Loading