Skip to content

vulkan: improve partial offloading performance on AMD#19976

Merged
0cc4m merged 7 commits intomasterfrom
0cc4m/vulkan-partial-offload-fix
Mar 1, 2026
Merged

vulkan: improve partial offloading performance on AMD#19976
0cc4m merged 7 commits intomasterfrom
0cc4m/vulkan-partial-offload-fix

Conversation

@0cc4m
Copy link
Copy Markdown
Contributor

@0cc4m 0cc4m commented Feb 28, 2026

I saw a big difference between Vulkan and ROCm performance in partial offloads. I narrowed it down to transfer speeds for weight transfer from CPU to GPU with offloaded ops. One possible explanation is that using the dedicated transfer queue on AMD may be faster than using a compute queue, so I implemented using a transfer queue for async transfers as well and synchronizing transfers using a timeline semaphore. This does improve performance.

Then I checked and found that the dedicated transfer queue on AMD is not exposed by the Linux driver by default, so it's not actually being used. The difference comes from using a second queue (the graphics queue) for transfers, so I assume the issue was the compute queue being congested with other work.

This helps on AMD RDNA4, but not on GCN and not on Nvidia. I couldn't test Intel because the Linux driver only exposes a single queue.

I also changed the Vulkan backend offload_op function to work like the CUDA backend's, to cover some more cases. And I moved some redundant context code into a function to reduce code duplication.

image image

@0cc4m 0cc4m requested a review from jeffbolznv February 28, 2026 09:23
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 28, 2026
@inforithmics
Copy link
Copy Markdown

Would this help with --cpu-moe (weight offloading to cpu) ?

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Feb 28, 2026

It should, yes. Any partial offload.

@inforithmics
Copy link
Copy Markdown

inforithmics commented Feb 28, 2026

I checked out the pull request and indeed it was faster too with ---n-cpu-moe (on an 890M Integrated GPU).
Even the TG Performance went up from 13 to 15 tg/s with ---n-cpu-moe 20 and gpt-oss:20B.

@characharm
Copy link
Copy Markdown
Contributor

common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
S:\LLM\Vulkan\llama.cpp\ggml\src\ggml-vulkan\ggml-vulkan.cpp:6816: GGML_ASSERT(src->device == dst->device) failed

disabling intel gpu - no crash

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Feb 28, 2026

@characharm Thanks for testing it, there was a check missing. Can you try again?

@characharm
Copy link
Copy Markdown
Contributor

@characharm Thanks for testing it, there was a check missing. Can you try again?

Yeah, looks like it's not crashing anymore.

Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp
@rhjdvsgsgks
Copy link
Copy Markdown
Contributor

Then I checked and found that the dedicated transfer queue on AMD is not exposed by the Linux driver by default, so it's not actually being used.

transfer queue is supported since mesa 26. but hidden under a flag RADV_PERFTEST=transfer_queue

I couldn't test Intel because the Linux driver only exposes a single queue.

it is also supported on intel. but only avaliable when using xe drm driver (rather than i915) and gen > 12.5

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Mar 1, 2026

Then I checked and found that the dedicated transfer queue on AMD is not exposed by the Linux driver by default, so it's not actually being used.

transfer queue is supported since mesa 26. but hidden under a flag RADV_PERFTEST=transfer_queue

I know, I've tested it. It's slightly faster than using the graphics queue. The majority of the improvement comes from any second queue.

I couldn't test Intel because the Linux driver only exposes a single queue.

it is also supported on intel. but only avaliable when using xe drm driver (rather than i915) and gen > 12.5

That's good to know, thank you! I'll give it a try with Xe.

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Mar 1, 2026

I get mixed results with Xe, slightly improved pp, slightly worse tg. I'll leave it off there, for now.

@0cc4m 0cc4m merged commit 3191462 into master Mar 1, 2026
78 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-partial-offload-fix branch March 1, 2026 16:32
@acbits
Copy link
Copy Markdown

acbits commented Mar 1, 2026

Then I checked and found that the dedicated transfer queue on AMD is not exposed by the Linux driver by default, so it's not actually being used.

transfer queue is supported since mesa 26. but hidden under a flag RADV_PERFTEST=transfer_queue

I couldn't test Intel because the Linux driver only exposes a single queue.

it is also supported on intel. but only avaliable when using xe drm driver (rather than i915) and gen > 12.5

So do we have to upgrade to Mesa 26 or is the RADV_PERFTEST flag is also available on older versions?

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Mar 1, 2026

It's available on older versions as well. I'm not sure why it was reported again for 26.0

bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
* vulkan: fix and enable cpy_tensor_async function

* use transfer_queue for async transfers on AMD, synchronize with timeline semaphore

* update offload_op logic

* fix missing transfer submission

* disable async transfer queue on AMD GCN

* revert op batch size change

* fix cpy_tensor_async checks
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Mar 3, 2026
* vulkan: fix and enable cpy_tensor_async function

* use transfer_queue for async transfers on AMD, synchronize with timeline semaphore

* update offload_op logic

* fix missing transfer submission

* disable async transfer queue on AMD GCN

* revert op batch size change

* fix cpy_tensor_async checks
@HumerousGorgon
Copy link
Copy Markdown

This merged PR seems to be causing issues on the vulkan backend regarding gibberish output with -fa on on Intel GPUs.

@tvall43
Copy link
Copy Markdown

tvall43 commented Mar 8, 2026

This merged PR seems to be causing issues on the vulkan backend regarding gibberish output with -fa on on Intel GPUs.

im also just getting gibberish on my setup with this. qwen3.5-35b on 2 radeon pro v340l's. reverting this fixed it, havent done any other debugging

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Mar 8, 2026

Please see if #20233 helped.

@tvall43
Copy link
Copy Markdown

tvall43 commented Mar 9, 2026

Please see if #20233 helped.

seems like that fixed it!
edit: uh nevermind. also needed this #20097 (comment) but i think its working now

Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
* vulkan: fix and enable cpy_tensor_async function

* use transfer_queue for async transfers on AMD, synchronize with timeline semaphore

* update offload_op logic

* fix missing transfer submission

* disable async transfer queue on AMD GCN

* revert op batch size change

* fix cpy_tensor_async checks
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* vulkan: fix and enable cpy_tensor_async function

* use transfer_queue for async transfers on AMD, synchronize with timeline semaphore

* update offload_op logic

* fix missing transfer submission

* disable async transfer queue on AMD GCN

* revert op batch size change

* fix cpy_tensor_async checks
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* vulkan: fix and enable cpy_tensor_async function

* use transfer_queue for async transfers on AMD, synchronize with timeline semaphore

* update offload_op logic

* fix missing transfer submission

* disable async transfer queue on AMD GCN

* revert op batch size change

* fix cpy_tensor_async checks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants