Skip to content

vulkan: use graphics queue on AMD#20551

Merged
0cc4m merged 2 commits intomasterfrom
0cc4m/vulkan-amd-queue
Mar 15, 2026
Merged

vulkan: use graphics queue on AMD#20551
0cc4m merged 2 commits intomasterfrom
0cc4m/vulkan-amd-queue

Conversation

@0cc4m
Copy link
Copy Markdown
Contributor

@0cc4m 0cc4m commented Mar 14, 2026

I'm not sure why, but the graphics queue is slightly faster in tg on AMD than the compute queue, and this also fixes the partial offload issue I fixed in #19976, so the second queue no longer has to be enabled by default. I got the idea from @zedbytes reporting that tg goes up when running with RADV_DEBUG=nocompute.

AMD RX 9070 XT
model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_0 4.33 GiB 8.03 B 20 1 pp512 2288.04 ± 2.42 2225.76 ± 2.31 -2.7%
llama 8B Q4_0 4.33 GiB 8.03 B 20 1 tg128 24.33 ± 0.04 24.58 ± 0.05 +1.0%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 4886.26 ± 105.08 4901.77 ± 102.66 +0.3%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 115.78 ± 0.02 121.39 ± 0.02 +4.8%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 20 1 pp512 736.21 ± 9.37 735.19 ± 7.51 -0.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 20 1 tg128 39.53 ± 0.10 40.36 ± 0.21 +2.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 3383.58 ± 29.26 3425.38 ± 28.68 +1.2%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 200.45 ± 1.89 220.41 ± 1.46 +10.0%
AMD Radeon Pro VII
model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_0 4.33 GiB 8.03 B 20 1 pp512 636.62 ± 9.07 615.62 ± 0.79 -3.3%
llama 8B Q4_0 4.33 GiB 8.03 B 20 1 tg128 38.35 ± 0.09 38.20 ± 0.01 -0.4%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 830.30 ± 1.51 834.44 ± 1.05 +0.5%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 102.45 ± 0.64 100.28 ± 0.24 -2.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 20 1 pp512 289.76 ± 3.59 287.75 ± 3.10 -0.7%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 20 1 tg128 34.57 ± 0.32 34.05 ± 1.20 -1.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 749.65 ± 5.42 762.52 ± 5.89 +1.7%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 94.70 ± 0.46 97.55 ± 0.20 +3.0%

@0cc4m 0cc4m requested a review from jeffbolznv March 14, 2026 16:16
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 14, 2026
@zedbytes
Copy link
Copy Markdown

nice one , i am guessing that after this change RADV_DEBUG=nocompute is set by default for AMD GPUs ?

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Mar 15, 2026

It won't be needed after this change. With nocompute, it disabled the compute queue, so the backend used the graphics queue instead. That is what made it faster (but don't ask me why). This PR changes it to use the graphics queue even if a compute-only queue is available.

@0cc4m 0cc4m merged commit 1a3d8ed into master Mar 15, 2026
81 of 82 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-amd-queue branch March 15, 2026 07:18
@winstonma
Copy link
Copy Markdown

winstonma commented Mar 15, 2026

I just installed this version and I am using llama-server with immersive translate addon to translate the web article for me. I set the addon to send an API request to the llama-server every 10 seconds just to make some translation.

With this feature now my KDE desktop lag when the LLM is processing. I also tried playing youtube video on Firefox and it drop 30% of the frame while playing 1080p video. Does this happen on your side?

Sorry I forgot to provide more info. My laptop is running AMD Ryzen AI 360 with 880M iGPU.

@Neutralized
Copy link
Copy Markdown

I downloaded latest release and my tg/s dropped about 40% compared to release b8333 ! I use 2 Radeon cards 9070XT and 6700 XT, i would suggest removing changes from this update until further tested

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Mar 15, 2026

I downloaded latest release and my tg/s dropped about 40% compared to release b8333 ! I use 2 Radeon cards 9070XT and 6700 XT, i would suggest removing changes from this update until further tested

Linux or Windows?

@Neutralized
Copy link
Copy Markdown

Windows

@Neutralized
Copy link
Copy Markdown

#20597

Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
* vulkan: use graphics queue on AMD for slightly better performance

* disable async transfer queue on AMD
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* vulkan: use graphics queue on AMD for slightly better performance

* disable async transfer queue on AMD
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* vulkan: use graphics queue on AMD for slightly better performance

* disable async transfer queue on AMD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants