vulkan: improve partial offloading performance on AMD by 0cc4m · Pull Request #19976 · ggml-org/llama.cpp

0cc4m · 2026-02-28T09:23:32Z

I saw a big difference between Vulkan and ROCm performance in partial offloads. I narrowed it down to transfer speeds for weight transfer from CPU to GPU with offloaded ops. One possible explanation is that using the dedicated transfer queue on AMD may be faster than using a compute queue, so I implemented using a transfer queue for async transfers as well and synchronizing transfers using a timeline semaphore. This does improve performance.

Then I checked and found that the dedicated transfer queue on AMD is not exposed by the Linux driver by default, so it's not actually being used. The difference comes from using a second queue (the graphics queue) for transfers, so I assume the issue was the compute queue being congested with other work.

This helps on AMD RDNA4, but not on GCN and not on Nvidia. I couldn't test Intel because the Linux driver only exposes a single queue.

I also changed the Vulkan backend offload_op function to work like the CUDA backend's, to cover some more cases. And I moved some redundant context code into a function to reduce code duplication.

…ine semaphore

inforithmics · 2026-02-28T10:03:06Z

Would this help with --cpu-moe (weight offloading to cpu) ?

0cc4m · 2026-02-28T10:06:22Z

It should, yes. Any partial offload.

inforithmics · 2026-02-28T11:47:57Z

I checked out the pull request and indeed it was faster too with ---n-cpu-moe (on an 890M Integrated GPU).
Even the TG Performance went up from 13 to 15 tg/s with ---n-cpu-moe 20 and gpt-oss:20B.

characharm · 2026-02-28T12:49:41Z

common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
S:\LLM\Vulkan\llama.cpp\ggml\src\ggml-vulkan\ggml-vulkan.cpp:6816: GGML_ASSERT(src->device == dst->device) failed

disabling intel gpu - no crash

0cc4m · 2026-02-28T13:22:52Z

@characharm Thanks for testing it, there was a check missing. Can you try again?

characharm · 2026-02-28T13:35:46Z

@characharm Thanks for testing it, there was a check missing. Can you try again?

Yeah, looks like it's not crashing anymore.

rhjdvsgsgks · 2026-02-28T23:36:23Z

Then I checked and found that the dedicated transfer queue on AMD is not exposed by the Linux driver by default, so it's not actually being used.

transfer queue is supported since mesa 26. but hidden under a flag RADV_PERFTEST=transfer_queue

I couldn't test Intel because the Linux driver only exposes a single queue.

it is also supported on intel. but only avaliable when using xe drm driver (rather than i915) and gen > 12.5

0cc4m · 2026-03-01T04:19:31Z

Then I checked and found that the dedicated transfer queue on AMD is not exposed by the Linux driver by default, so it's not actually being used.

transfer queue is supported since mesa 26. but hidden under a flag RADV_PERFTEST=transfer_queue

I know, I've tested it. It's slightly faster than using the graphics queue. The majority of the improvement comes from any second queue.

I couldn't test Intel because the Linux driver only exposes a single queue.

it is also supported on intel. but only avaliable when using xe drm driver (rather than i915) and gen > 12.5

That's good to know, thank you! I'll give it a try with Xe.

0cc4m · 2026-03-01T16:32:01Z

I get mixed results with Xe, slightly improved pp, slightly worse tg. I'll leave it off there, for now.

acbits · 2026-03-01T19:10:26Z

Then I checked and found that the dedicated transfer queue on AMD is not exposed by the Linux driver by default, so it's not actually being used.

transfer queue is supported since mesa 26. but hidden under a flag RADV_PERFTEST=transfer_queue

I couldn't test Intel because the Linux driver only exposes a single queue.

it is also supported on intel. but only avaliable when using xe drm driver (rather than i915) and gen > 12.5

So do we have to upgrade to Mesa 26 or is the RADV_PERFTEST flag is also available on older versions?

0cc4m · 2026-03-01T19:25:51Z

It's available on older versions as well. I'm not sure why it was reported again for 26.0

* vulkan: fix and enable cpy_tensor_async function * use transfer_queue for async transfers on AMD, synchronize with timeline semaphore * update offload_op logic * fix missing transfer submission * disable async transfer queue on AMD GCN * revert op batch size change * fix cpy_tensor_async checks

HumerousGorgon · 2026-03-05T02:02:04Z

This merged PR seems to be causing issues on the vulkan backend regarding gibberish output with -fa on on Intel GPUs.

tvall43 · 2026-03-08T04:13:34Z

This merged PR seems to be causing issues on the vulkan backend regarding gibberish output with -fa on on Intel GPUs.

im also just getting gibberish on my setup with this. qwen3.5-35b on 2 radeon pro v340l's. reverting this fixed it, havent done any other debugging

0cc4m · 2026-03-08T11:13:57Z

Please see if #20233 helped.

tvall43 · 2026-03-09T02:26:29Z

Please see if #20233 helped.

seems like that fixed it!
edit: uh nevermind. also needed this #20097 (comment) but i think its working now

* vulkan: fix and enable cpy_tensor_async function * use transfer_queue for async transfers on AMD, synchronize with timeline semaphore * update offload_op logic * fix missing transfer submission * disable async transfer queue on AMD GCN * revert op batch size change * fix cpy_tensor_async checks

0cc4m added 6 commits February 27, 2026 11:40

vulkan: fix and enable cpy_tensor_async function

6943f83

use transfer_queue for async transfers on AMD, synchronize with timel…

5abb7d5

…ine semaphore

update offload_op logic

ca3481f

fix missing transfer submission

e72fb93

disable async transfer queue on AMD GCN

29955d3

revert op batch size change

32adb28

0cc4m requested a review from jeffbolznv February 28, 2026 09:23

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 28, 2026

fix cpy_tensor_async checks

b2bc5eb

jeffbolznv approved these changes Feb 28, 2026

View reviewed changes

Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp

loci-dev mentioned this pull request Mar 1, 2026

UPSTREAM PR #19976: vulkan: improve partial offloading performance on AMD auroralabs-loci/llama.cpp#1213

Open

0cc4m merged commit 3191462 into master Mar 1, 2026
78 checks passed

0cc4m deleted the 0cc4m/vulkan-partial-offload-fix branch March 1, 2026 16:32

neilopet mentioned this pull request Mar 1, 2026

vulkan: add UMA zero-copy async transfers and fix event_record deferred memcpy handling #20018

Open

RipleyTom mentioned this pull request Mar 12, 2026

Eval bug: Regression when trying to load a big model #20439

Closed

0cc4m mentioned this pull request Mar 14, 2026

vulkan: use graphics queue on AMD #20551

Merged

inforithmics mentioned this pull request Mar 15, 2026

Revert revert vendor update (Vendor Update to b8353) ollama/ollama#14134

Closed

5 tasks

Conversation

0cc4m commented Feb 28, 2026

Uh oh!

inforithmics commented Feb 28, 2026

Uh oh!

0cc4m commented Feb 28, 2026

Uh oh!

inforithmics commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

characharm commented Feb 28, 2026

Uh oh!

0cc4m commented Feb 28, 2026

Uh oh!

characharm commented Feb 28, 2026

Uh oh!

Uh oh!

rhjdvsgsgks commented Feb 28, 2026

Uh oh!

0cc4m commented Mar 1, 2026

Uh oh!

0cc4m commented Mar 1, 2026

Uh oh!

Uh oh!

acbits commented Mar 1, 2026

Uh oh!

0cc4m commented Mar 1, 2026

Uh oh!

HumerousGorgon commented Mar 5, 2026

Uh oh!

tvall43 commented Mar 8, 2026

Uh oh!

0cc4m commented Mar 8, 2026

Uh oh!

tvall43 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

inforithmics commented Feb 28, 2026 •

edited

Loading

tvall43 commented Mar 9, 2026 •

edited

Loading