Skip to content

Optimize Vulkan buffer transfers on UMA (Unified Memory Architecture) devices#22462

Open
winstonma wants to merge 2 commits intoggml-org:masterfrom
winstonma:winston/vk-uma-read-threshold
Open

Optimize Vulkan buffer transfers on UMA (Unified Memory Architecture) devices#22462
winstonma wants to merge 2 commits intoggml-org:masterfrom
winstonma:winston/vk-uma-read-threshold

Conversation

@winstonma
Copy link
Copy Markdown

@winstonma winstonma commented Apr 28, 2026

Overview

This PR optimizes Vulkan buffer transfers on UMA (Unified Memory Architecture) devices by bypassing GPU staging buffers when possible and using direct CPU memory access instead. The changes target situations where GPU and CPU memory are physically the same, making direct copies more efficient.

Additional information

This is the ran benchmark result:

Metric Baseline GGML_VK_UMA_NON_CACHED_DIRECT_READ_THRESHOLD_DEFAULT = 512 KiB Gain (vs. Baseline)
Avg. (14 CPY Cases) 33.16 GB/s 60.67 GB/s +82.9%
Small CPY Case 85.46 GB/s 87.18 GB/s +2.0%

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, for finding/implementing/bench-marking UMA optimization

@winstonma winstonma requested a review from a team as a code owner April 28, 2026 08:48
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 28, 2026

Hi @winstonma, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@winstonma
Copy link
Copy Markdown
Author

So how can I get this PR reviewed? Thanks

@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Apr 28, 2026
@engrtipusultan
Copy link
Copy Markdown

engrtipusultan commented Apr 28, 2026

I have ryzen-7-5825u with Vega 8. I am seeing almost 200% packet processing increase. Thank you very much.

Master with #22462 #22455 and #21751 merged.

bash  ./llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Coder-Next/Qwen3-Coder-Next-UD-Q5_K_S-00001-of-00003.gguf -m /home/tipu/AI/models/unsloth/Qwen36-35-A3B/Qwen36-35B-A3B-Q8.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model size params backend threads n_ubatch fa mmap test t/s
qwen3next 80B.A3B Q5_K - Small 51.98 GiB 79.67 B Vulkan,BLAS 8 1088 1 0 pp512 141.67 ± 0.03
qwen3next 80B.A3B Q5_K - Small 51.98 GiB 79.67 B Vulkan,BLAS 8 1088 1 0 tg128 10.72 ± 0.00
qwen3next 80B.A3B Q5_K - Small 51.98 GiB 79.67 B Vulkan,BLAS 8 1088 1 0 pp512 @ d8096 105.16 ± 0.05
qwen3next 80B.A3B Q5_K - Small 51.98 GiB 79.67 B Vulkan,BLAS 8 1088 1 0 tg128 @ d8096 10.17 ± 0.01
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan,BLAS 8 1088 1 0 pp512 200.42 ± 0.83
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan,BLAS 8 1088 1 0 tg128 11.98 ± 0.03
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan,BLAS 8 1088 1 0 pp512 @ d8096 142.51 ± 0.13
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan,BLAS 8 1088 1 0 tg128 @ d8096 11.47 ± 0.00

build: 4e522bfe4 (8961)

Original Master:

bash  ./llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Coder-Next/Qwen3-Coder-Next-UD-Q5_K_S-00001-of-00003.gguf -m /home/tipu/AI/models/unsloth/Qwen36-35-A3B/Qwen36-35B-A3B-Q8.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model size params backend threads n_ubatch fa mmap test t/s
qwen3next 80B.A3B Q5_K - Small 51.98 GiB 79.67 B Vulkan,BLAS 8 1088 1 0 pp512 70.44 ± 1.19
qwen3next 80B.A3B Q5_K - Small 51.98 GiB 79.67 B Vulkan,BLAS 8 1088 1 0 tg128 10.78 ± 0.01
qwen3next 80B.A3B Q5_K - Small 51.98 GiB 79.67 B Vulkan,BLAS 8 1088 1 0 pp512 @ d8096 60.12 ± 0.65
qwen3next 80B.A3B Q5_K - Small 51.98 GiB 79.67 B Vulkan,BLAS 8 1088 1 0 tg128 @ d8096 10.22 ± 0.00
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan,BLAS 8 1088 1 0 pp512 120.22 ± 2.95
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan,BLAS 8 1088 1 0 tg128 12.01 ± 0.02
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan,BLAS 8 1088 1 0 pp512 @ d8096 90.34 ± 2.31
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan,BLAS 8 1088 1 0 tg128 @ d8096 11.37 ± 0.04

build: b1a5bd4 (8938)

@engrtipusultan
Copy link
Copy Markdown

I was too quick to get excited. Benchmarks are wild but output is gibberish on all models. Reverting.

image image

@winstonma
Copy link
Copy Markdown
Author

Okie I will take a look

I just ran llama-bench and didn't ran llama-cli to check the output

@winstonma
Copy link
Copy Markdown
Author

winstonma commented Apr 28, 2026

@engrtipusultan I ran the LLM model but I couldn't repeat what you saw.

  1. The LLM output is fine
  2. The llama-bench result on my machine is more or less the same

Did you see any good result after reverting only this commit?

Here is the llama-bench result on my machine:

Using version 8966:

❯ llama-bench -m ~/model/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           pp512 |        343.84 ± 1.55 |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           tg128 |         20.88 ± 0.09 |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   pp512 @ d8096 |        280.71 ± 0.91 |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   tg128 @ d8096 |         18.56 ± 0.03 |

build: 7b8443ac7 (8966)

With this PR:

❯ llama-bench -m ~/model/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           pp512 |        342.38 ± 2.64 |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           tg128 |         21.20 ± 0.04 |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   pp512 @ d8096 |        283.93 ± 1.25 |
| gemma4 26B.A4B Q4_K - Medium   |  15.90 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   tg128 @ d8096 |         18.07 ± 0.52 |

build: 7b8443ac7 (8966)

@winstonma winstonma force-pushed the winston/vk-uma-read-threshold branch from e95b92d to da5e315 Compare April 28, 2026 13:43
@engrtipusultan
Copy link
Copy Markdown

Yes reverting to latest master resolves the issue. So it is one of your two PRs that caused it. I checked on llama-server like shown in screenshots.

@engrtipusultan
Copy link
Copy Markdown

If you want, tomorrow, I can check both PRs one by one

Adds a configurable threshold via env var: GGML_VK_UMA_NON_CACHED_DIRECT_READ_THRESHOLD (default now 512 * 1024).

Introduces ggml_vk_uma_non_cached_direct_read_threshold() to parse/cache that env var once, with validation and warning logs on invalid/overflow values.

Introduces ggml_vk_use_uma_direct_read(vk_buffer &, size_t) to centralize the direct-read decision logic.

Replaces duplicated inline heuristics in three read paths with the shared helper:

- ggml_vk_buffer_read_2d_async()

- ggml_vk_buffer_read()

- ggml_backend_vk_get_tensor_async()

Keeps the small non-cached UMA async behavior explicit: if direct read is not preferred and sync staging is unavailable, it returns false so caller falls back.

Adds needed headers for parsing/error handling: <cstdlib> and <cerrno>.
@winstonma
Copy link
Copy Markdown
Author

winstonma commented Apr 28, 2026

@engrtipusultan I just updated the PR code. Could you please see if it break on your side?

From the performance perspective I don't see a huge difference on the pp and tg performance. I would consider this as a micro-optimization for UMA device.

@arch-btw
Copy link
Copy Markdown
Contributor

This PR (#22462)

./llama-bench -m google_gemma-4-26B-A4B-it-Q4_K_L.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           pp512 |        170.95 ± 0.88 |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           tg128 |         20.60 ± 0.04 |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   pp512 @ d8096 |         75.79 ± 0.32 |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   tg128 @ d8096 |         16.72 ± 0.00 |

Current master (commit 5d56eff)

./llama-bench -m google_gemma-4-26B-A4B-it-Q4_K_L.gguf -ngl 100 --ubatch-size 1088 --batch-size 2048 --mmap 0 -fa 1 -d 0,8096 -r 3
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           pp512 |        176.07 ± 1.37 |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |           tg128 |         22.33 ± 0.05 |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   pp512 @ d8096 |         76.54 ± 0.34 |
| gemma4 26B.A4B Q4_K - Medium   |  16.02 GiB |    25.23 B | Vulkan     | 100 |     1088 |  1 |    0 |   tg128 @ d8096 |         18.19 ± 0.08 |

@winstonma
Copy link
Copy Markdown
Author

winstonma commented Apr 28, 2026

@arch-btw Could you try llama-cli and see if the output is gibberish? Thanks

Also if you feel free could you also test my other PR #22455 alongside with this?

@arch-btw
Copy link
Copy Markdown
Contributor

@winstonma the output is good:

> What is llama.cpp?

[Start thinking]
Here's a thinking process that leads to the explanation of llama.cpp:

1.  **Analyze the Request:**
    *   **Question:** "What is llama.cpp?"
    *   **Intent:** The user wants a clear, concise, and informative overview of the software library known as `llama.cpp`.
    *   **Target Audience:** Likely someone interested in AI, LLMs (Large Language Models), open-source software, or local AI deployment. The explanation should cover technical aspects without being overly dense, and highlight *why* it matters.

etc.

I will try the other commit next! Thank you for these PRs.

@winstonma
Copy link
Copy Markdown
Author

winstonma commented Apr 29, 2026

Good to hear the results. Actually I start seeing benchmark improvement only when both PR are merged together.

Apart front these two commits AI also identified serveral smaller optimization for UMA vulkan path so I will implement, test and create PR if benchmark show promising result.

if (dst->device->uma && (dst->memory_property_flags & vk::MemoryPropertyFlagBits::eHostVisible)) {
GGML_ASSERT(dst->memory_property_flags & vk::MemoryPropertyFlagBits::eHostCoherent);
if (width == spitch) {
deferred_memcpy((uint8_t *) dst->ptr + offset, src, width * height, &subctx->in_memcpys);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct for the same reasons I commented in #20018. The async copies need to run on the queue to stay in order with other commands.

Copy link
Copy Markdown
Author

@winstonma winstonma Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I am not familiar with these. I asked Codex to write a test case to verify the the async copies and passes the test case. And here is the follow up question that I asked:

Yes, the code is implemented to stay ordered with other backend work.

  1. In the UMA host-visible branch at if (dst->device->uma && (dst->memory_property_flags & vk::MemoryPropertyFlagBits::eHostVisible)), the copy is not executed immediately. It is queued via deferred_memcpy into subctx in_memcpys.
  2. Those queued host writes are flushed only when the context is submitted, in ggml_vk_run_deferred_uploads and ggml_vk_submit_transfer_ctx.
  3. For compute-path submission, deferred uploads are run right before submit in ggml_vk_run_deferred_uploads(compute_ctx);. For transfer-path submission, same behavior is in ggml_vk_run_deferred_uploads(cpy_ctx);.
  4. The async tensor API routes into this path from ggml_backend_vk_set_tensor_async, so these copies participate in the same submission/sync chain as other backend commands.
  5. If transfer queue is enabled, cross-queue ordering is linked by timeline semaphore signal/wait in ctx->transfer_semaphore.value++;, and result->s->wait_semaphores.push_back(ctx->transfer_semaphore);.

So for the code specifically, ordering is preserved because writes are deferred and then flushed at queue-submit boundaries, not applied out-of-band.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to be familiar with it. Copy-pasting AI responses into maintainer questions is not allowed because we do not have time or patience to debate an AI that can make up wrong claims way faster than any human could debunk them.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly I'm not quite sure I follow the question. But I tried to add some log and see if the question is being answered. This is the debug log:

❯ ./build-vk-debug/bin/llama-cli -m ~/model/gemma-4-E4B-it-UD-Q4_K_XL.gguf -p "Hello" -n 16 2>&1 | grep VK_TIMELINE_HANDSHAKE

Loading model... |VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=1 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=1 last_waited=0 source=ggml_vk_synchronize
VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=2 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=2 last_waited=1 source=ggml_vk_synchronize                                                                        \VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=3 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=3 last_waited=2 source=ggml_vk_synchronize
VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=4 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=4 last_waited=3 source=ggml_vk_synchronize                                                                         


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8960-fe1eb0302
model      : gemma-4-E4B-it-UD-Q4_K_XL.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> Hello

|VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=5 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=5 last_waited=4 source=ggml_vk_synchronize
VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=6 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=6 last_waited=5 source=ggml_vk_synchronize                                                                        -VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=7 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=7 last_waited=6 source=ggml_vk_synchronize
VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=8 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=8 last_waited=7 source=ggml_vk_synchronize                                                                        HelloVK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=9 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=9 last_waited=8 source=ggml_vk_synchronize
VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=10 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=10 last_waited=9 source=ggml_vk_synchronize
!VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=11 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=11 last_waited=10 source=ggml_vk_synchronize
VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=12 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=12 last_waited=11 source=ggml_vk_synchronize
 HowVK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=13 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=13 last_waited=12 source=ggml_vk_synchronize
VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=14 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=14 last_waited=13 source=ggml_vk_synchronize
 canVK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=15 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=15 last_waited=14 source=ggml_vk_synchronize
VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=16 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=16 last_waited=15 source=ggml_vk_synchronize
 IVK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=17 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=17 last_waited=16 source=ggml_vk_synchronize
VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=18 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=18 last_waited=17 source=ggml_vk_synchronize
 helpVK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=19 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=19 last_waited=18 source=ggml_vk_synchronize
VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=20 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=20 last_waited=19 source=ggml_vk_synchronize
 youVK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=21 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=21 last_waited=20 source=ggml_vk_synchronize
VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=22 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=22 last_waited=21 source=ggml_vk_synchronize
 todayVK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=23 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=23 last_waited=22 source=ggml_vk_synchronize
VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=24 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=24 last_waited=23 source=ggml_vk_synchronize
?VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=25 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=25 last_waited=24 source=ggml_vk_synchronize
VK_TIMELINE_HANDSHAKE SIGNAL TQ->CQ: signal_value=26 source=ggml_vk_submit_transfer_ctx
VK_TIMELINE_HANDSHAKE WAIT_SUBMIT CQ<-TQ: wait_value=26 last_waited=25 source=ggml_vk_synchronize


[ Prompt: 71.1 t/s | Generation: 18.2 t/s ]

According to the log, the Vulkan Timeline Semaphore have created a system where the Compute Queue is physically incapable of outrunning the data being moved by the Transfer Queue. Thus the ordering is maintained. Also, the Compute Queue is hardware-blocked (bound by a Vulkan Timeline Semaphore wait operation) until the Transfer Queue signals completion, there is no risk of the GPU reading "stale" or partially written memory.

Disabling Transfer Queue on AMD UMA

I also submitted another PR to disable to the transfer queue on the AMD UMA. If the Transfer Queue is disabled, the code path would naturally fall back to a single-queue model where all operations are submitted to the Compute Queue. In this scenario, ordering is maintained by default due to the sequential nature of command submission within a single Vulkan queue.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of the transfer queue or compute queue, ordering is maintained for commands you submit to the queue. That does not apply to deferred memcpys. in_memcpys run on queue submission. out_memcpys run (in specific cases) after a fence wait that makes sure all queue commands are done. This will not work with the backend async read/write functions because those assume that the commands run in the right order in the queue.

It may work in your tests because you get lucky and the order works out, but this is not guaranteed. This change is fundamentally unsafe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants