Sched: Reintroduce less synchronizations between token, with fixed pipeline parallelism. by aendk · Pull Request #20793 · ggml-org/llama.cpp

aendk · 2026-03-20T10:19:40Z

Follow up to #20463 (comment).

#17795 improved performance in the single GPU setting on CUDA, but it was rolled back due to a bug surfacing in multi-GPU pipeline parallel settings.

For the single GPU setting, it moved the scheduling from sassassasg to the more efficient saaasg pattern, where s= sync, a= async copy, g= graph execution.
Each asynchronous copy was enclosed in two synchronizations. Removing some superfluous synchronizations improved performance, especially on windows. The change was to only do a single synchronization between memory copies and graph execution.

However in multi-GPU settings, we saw llama-perplexity regressions indicating incorrect scheduling (#20463).

I found that the event-based pipeline parallelism scheduling mechanism very likely implicitly relies on synchronous copies, as (i) in my testing copy_from_host worked as intended, and (ii) disabling it and therefore introducing synchronous copies fixed the bug, llama-perplexity perplexity was then identical to master.

The proposed fix here is therefore to enroll pipeline parallelism into the same synchronization between async copies and graph execution as the single GPU case already has.
I think this can be a good solution as it keeps scheduling similar between single GPU and multi GPU, and because it is simpler and safer than reworking the event-driven pipeline parallelism logic.

In my testing, this proposal has same performance benefits as the initial PR, and it yields correct perplexity scores both in single and multi-GPU.
As this bug surfaced in the community with their more diverse hardware setups and usage scenarios, it would be awesome if you could test-drive this change both with llama-bench and llama-perplexity with your usual model and launch-options usage!
@mxxm-t @slavap @Superbobo75 @thejacer

If you can, check out this branch and compare this against its master (git checkout HEAD~2). Let me know if you run into performance or accuracy issues!

mxxm-t · 2026-03-22T14:30:40Z

Will test soon.

aendk · 2026-03-24T09:30:29Z

@mxxm-t don't bother right now.
I will keep this open for now, but the solution is incomplete. This just adds a "speed bump" of some sorts so the race condition does not appear, but it is not a real scheduling fix.

Once I think I have the correct solution, I'll force-push and ping you again.

aendk · 2026-03-24T14:39:03Z

With #20927 being merged, I now see identical PPL in master and the reapplied original PR on my RTX PRO 6000 Max-Q / RTX PRO 4500 setup of Final estimate: PPL = 28.8008 +/- 1.50705

@mxxm-t feel free to try it if time allows.

Linux Performance

scripts$ python compare-llama-bench.py -c akieslinger/rework-reduce-per-token-syncs -b master -i ../llama-bench.sqlite
| Model                    | Test   |   t/s master |   t/s akieslinger/rework-reduce-per-token-syncs |   Speedup |
|:-------------------------|:-------|-------------:|------------------------------------------------:|----------:|
| gpt-oss 20B MXFP4 MoE    | tg128  |       285.17 |                                          286.79 |      1.01 |
| gpt-oss 20B MXFP4 MoE    | tg256  |       285.33 |                                          287.04 |      1.01 |
| gpt-oss 20B MXFP4 MoE    | tg512  |       278.70 |                                          281.61 |      1.01 |
| qwen3next 80B.A3B Q4_K_M | tg128  |       163.12 |                                          163.91 |      1.00 |
| qwen3next 80B.A3B Q4_K_M | tg256  |       163.15 |                                          164.56 |      1.01 |
| qwen3next 80B.A3B Q4_K_M | tg512  |       163.67 |                                          164.46 |      1.00 |

sjoerdmaessen · 2026-03-26T13:13:27Z

Benchmark: 2x NVIDIA L40S (sm_89 / Lovelace)

Tested with model: Qwen3.5-122B-A10B Q5_K_S

Hardware: 2x NVIDIA L40S 48GB, AMD EPYC 9354P, Linux
Model: Qwen3.5-122B-A10B Q5_K_S (80.44 GiB, split across both GPUs)
Flags: -ngl 99 -fa 1 -t 4, 3 repetitions per test

Master (`dc8d14c`, b8537)

test	t/s
pp512	2103.93 ± 42.88
pp1024	2472.55 ± 11.66
pp2048	2694.09 ± 6.61
tg128	62.05, 62.13, 62.14

PR branch (`06e8b36`, b8508)

test	t/s
pp512	2122.19 ± 6.42
pp1024	2454.35 ± 5.46
pp2048	2677.33 ± 8.63
tg128	62.20, 62.20, 62.23

Summary

All results within noise on this setup. No regression, no measurable improvement. tg128 ~+0.1 t/s (within margin of error). This is consistent with your observation that the benefit is primarily on Windows I think, Linux multi-GPU pipeline parallelism scheduling appears unaffected by this change for my setup.

aendk · 2026-03-31T09:46:52Z

@mxxm-t @slavap @Superbobo75 @thejacer if you have the time, please test PPL (as outlined in #20463 (comment)) and performance on your setups.

thejacer · 2026-04-01T23:51:02Z

PR #20793

perplexity: calculating perplexity over 8 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 3.82 seconds per pass - ETA 0.12 minutes
[1]2.7627,[2]2.0251,[3]2.4352,[4]2.2139,[5]2.3001,[6]2.3767,[7]2.3087,[8]2.3420,
Final estimate: PPL = 2.3420 +/- 0.10001

| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.27 GiB |    34.66 B | ROCm       |  99 |  1 |    0 |           pp512 |        870.66 ± 4.24 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.27 GiB |    34.66 B | ROCm       |  99 |  1 |    0 |           tg128 |         55.22 ± 0.11 |

build: 06e8b36a6 (8508)

Master

perplexity: calculating perplexity over 8 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 3.84 seconds per pass - ETA 0.12 minutes
[1]2.7627,[2]2.0251,[3]2.4352,[4]2.2139,[5]2.3001,[6]2.3767,[7]2.3087,[8]2.3420,
Final estimate: PPL = 2.3420 +/- 0.10001

| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.27 GiB |    34.66 B | ROCm       |  99 |  1 |    0 |           pp512 |        872.38 ± 2.69 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.27 GiB |    34.66 B | ROCm       |  99 |  1 |    0 |           tg128 |         55.27 ± 0.09 |

build: 6de97b9d3 (8623)

aendk · 2026-04-07T12:49:48Z

@ggerganov
Looks like my initial PR works with @am17an's multi-GPU fix, as it yields correct PPL in the multi-GPU, too.
Do you think additional testing is required to merge this again?

JohannesGaessler

Sorry, I'm confused. The way I read the linked PR in which the original one got reverted @ggerganov is saying that the original PR had to be reverted to restore correct results. But isn't this PR making the exact same changes as before?

aendk · 2026-04-08T08:08:58Z

Sorry, I'm confused. The way I read the linked PR in which the original one got reverted @ggerganov is saying that the original PR had to be reverted to restore correct results. But isn't this PR making the exact same changes as before?

correct in both aspects. Since then @am17an merged his fix. Reapplying my changes now still yields correct results, since the scheduling bug was not part of my PR. My PR just removed a lot of "speed bumps" in the form of synchronizations, which exposed the bug.

JohannesGaessler · 2026-04-08T08:13:05Z

Can you also link the fix in question?

am17an · 2026-04-08T08:13:18Z

I think we should wait for the TP PR (#13776) to be merged and stable before merging this again. Debugging synchronization issues is a pain.

am17an · 2026-04-08T08:15:12Z

@JohannesGaessler it's #20927

aendk · 2026-04-14T09:48:20Z

@am17an @JohannesGaessler do we have a rough timeline when we consider #19378 to be stable?

Since this closes a major perf gap between windows and linux, most of the user base is on windows, and there is no real benefit leaving it sitting, revisiting/merging this sooner than later makes sense to me.

IMbackK · 2026-04-14T10:21:13Z

I need to find the time to test this on HIP first for sure, since the previous attempt at this broke HIP entirely.

JohannesGaessler · 2026-04-14T10:21:14Z

Sorry, I forgot about this PR. Please rebase on top of master and I'll check whether there are issues (though the synchronization logic should be backend-agnostic).

@ggerganov

…ml-org#17795) * Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async() * Adds function to relax sync requirements between input copies on supported backends (CUDA for now) * Exchanges synchronous copy with async copy function. * Adds macro guards to allow compilation in non-CUDA builds * Reworked backend detection in ggml-backend.cpp to avoid linking conflicts * Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues * Minor cleanup * Makes opt-in to relax use of explicit syncs more general. Backends like vulkan which require a synchronization between HtoD copies and graph execution could also adopt this change now. * Reintroduces stricter check for CPU->CUDA backend async copy via GGML_DEVICE_TYPE_CPU. * Corrects initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization * Simplifies synchronizations to adhere to `saaasg` pattern. * Apply suggestion from @ggerganov (src->buffer to buf_src) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestion from @ggerganov (src->buffer to buf_src) v2 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

aendk · 2026-04-14T11:55:29Z

I rebased and spot-checked single and multi-GPU again. In both cases, PPL is identical to master.
Also, note that only cosmetics were changed in the rebase (dst->buffer -> buf_dst), so the findings from the community above should still apply.
@IMbackK good call. it should not break hip, otherwise let me know.
@JohannesGaessler thanks, I consider the PR to be ready now.

JohannesGaessler · 2026-04-14T12:21:48Z

Using 4x RTX 4090 I am unable to provoke issues, the PPL values I get are bit-for-bit identical both for --split-mode layer and --split-mode tensor.

thejacer · 2026-04-14T13:25:38Z

I need to find the time to test this on HIP first for sure, since the previous attempt at this broke HIP entirely.

My testing above was done on 2xMi50. Does that satisfy HIP testing?

IMbackK · 2026-04-14T13:53:40Z

@thejacer that helps, yeah

IMbackK · 2026-04-14T17:05:53Z

Unfortionatly this reintroduces #20433 on hip, repdoucer from that issue:

This pr:
[1]281.8696,[2]238.4758,[3]252.0735,[4]237.7406,[5]251.8294,[6]256.8782,[7]245.3641,[8]239.9153,[9]239.4646,[10]237.5819,[11]235.9763,[12]238.7037,[13]236.2493,

Master @b8785

[1]271.8923,[2]241.0099,[3]242.2813,[4]233.9529,[5]234.6784,[6]236.7585,[7]240.0973,[8]239.3741,[9]238.8985,[10]237.3241,[11]239.6173,[12]239.7815,[13]237.2340,

aendk · 2026-04-15T11:48:13Z

@IMbackK I guess there is a subtle difference in the inherent ordering of memcpy and compute between CUDA and hip. That means it makes sense to make the saaasg-pattern (s=sync, a=async memcpy, g=graph compute) in single-GPU explicit in multi-GPU, too.

Just out of curiosity, can you check if unguarding the ggml_backend_synchronize(split_backend); either in L1547-L1556 or in L1670-L1673 of this diff is required for hip consistency, or both? So just replace these two if-conditions to if(true) and test these 3 cases (both modified to true + top true, bottom default + top default, bottom true)?
Can you check both performance and PPL?

IMbackK · 2026-04-19T11:25:31Z

llama-bench -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -fa 1 -p 4096 -n 128 -ub 256 -sm layer,tensor

patch1.patch
patch2.patch

Master:

model	size	params	backend	ngl	sm	fa	test	t/s
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	layer	1	pp4096	4359.42 ± 6.17
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	layer	1	tg128	109.77 ± 0.56
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	tensor	1	pp4096	3490.72 ± 24.08
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	tensor	1	tg128	98.57 ± 0.45

PR (rebased):
PPL: fail

model	size	params	backend	ngl	sm	fa	test	t/s
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	layer	1	pp4096	4388.10 ± 9.35
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	layer	1	tg128	110.74 ± 0.66
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	tensor	1	pp4096	3544.33 ± 9.81
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	tensor	1	tg128	98.67 ± 0.37

Patch1:
PPL: pass

model	size	params	backend	ngl	sm	fa	test	t/s
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	layer	1	pp4096	4294.21 ± 9.30
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	layer	1	tg128	109.74 ± 0.66
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	tensor	1	pp4096	3505.95 ± 14.07
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	tensor	1	tg128	98.47 ± 0.36

Patch2:
PPL: pass

model	size	params	backend	ngl	sm	fa	test	t/s
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	layer	1	pp4096	4352.22 ± 8.19
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	layer	1	tg128	109.03 ± 0.45
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	tensor	1	pp4096	3476.17 ± 14.28
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	tensor	1	tg128	97.45 ± 0.37

Both:

PPL: pass

model	size	params	backend	ngl	sm	fa	test	t/s
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	layer	1	pp4096	4312.89 ± 11.18
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	layer	1	tg128	109.52 ± 0.73
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	tensor	1	pp4096	3559.21 ± 4.11
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	ROCm	99	tensor	1	tg128	98.42 ± 0.29

All of the performance values are within run to run variance. From the perspective of purely the hip backend, this pr is not worthwhile in the first place.

…kend pipeline parallel bugs.

aendk · 2026-04-20T14:56:23Z

Thanks @IMbackK for the exhaustive testing.
Since the single-GPU setup works with the saags-pattern [0], I decided to enroll the multi-GPU setting into this, too.
This reduces the scheduling differences between single-GPU and multi-GPU and even simplifies the scheduling code.

The original change was:
single-GPU: sasassasg -> saaasg
multi-GPU: sasassasg + event-based logic -> aaag + event-based logic

The new change proposed now is:
single-GPU: sasassasg -> saaasg
multi-GPU: sasassasg + event-based logic -> saaasg + event-based logic
(so even if commit b1993f1 appears to "add" syncs to multi-GPU, there is still a net loss compared to master)

So in both cases, we reduce the amount of synchronizations to the same minimum. As mentioned in the previous PR linked above, this brings a measurable/significant speed-up for the majority of llama.cpp users (windows + CUDA).

The PR is now ready for review @ggerganov @JohannesGaessler @ORippler @IMbackK.

[0] s=synchronization, a=async copy, g=graph execution

IMbackK · 2026-04-20T15:07:02Z

I mean ok, but really we should understand why the sync in this position is required and have some kind of spec for what dose or dose not cause implicit sync or ordering in a ggml backend rather than just sortof guessing based on observed behavior.

aendk · 2026-04-20T15:31:34Z

@IMbackK the core answer is that the multi-GPU event based synchronization mechanism ist not suitable as the only scheduling mechanism for multi-GPU. It is not clear if it even was designed to that, or if it was and there is a bug.

The good side is that this PR proposes to exchange some implicit synchronizations with two explicit synchronizations. This makes the scheduling mechanism more explicit, and easier to understand & maintain.

Right now, the multi-GPU scheduling on master only works because it implicitly relies on the fact that there are some synchronizations for asynchronous copies in its scheduling stages.
In the future, these synchronizations could be optimized away, or scheduling stages could change contents. Multi-GPU could thus break as an unintended side-effect.
In my opinion, keeping the master as-is is the same as leaving the door open for bugs down the line.

JohannesGaessler

Looking at the code on master, I don't think the events are being used correctly. The backend scheduler has events per backend and per copy. At the end of the loop in ggml_backend_sched_compute_splits an event is recorded for the ggml backend with its own ID split_backend_id. At the beginning of the loop a backend waits for the event of its own ID. This is correct if the synchronization is between copies of the same backend but incorrect if the synchronization is between different backends such as for multiple GPUs. If however the copies between backends is synchronous then this defect does not manifest as a bug.

JohannesGaessler · 2026-04-20T15:21:45Z

-                } else {
-                    ggml_backend_synchronize(split_backend);


Why is the else branch being removed here? In ggml_backend_sched_new the events are created unconditionally. If the event is null here that would imply that events are not supported for a backend. However, because that backend could still have implemented asynchronous execution and/or tensor copies (as is I think the case for Vulkan) we could get a race condition here.

From my notes when I worked on this part of the code, if (sched->events[split_backend_id][sched->cur_copy] != NULL) determines if single device/GPU or multi-GPU (pipeline parallelism) scheduling logic should be applied. The else-case is thus always applied for single-GPU settings.

I removed the else case because I determined it to be unnecessary. It adds a single synchronization between async copies to the same backend.
In the previous PR, we clarified that the ideal backends design should support multiple concurrent async copies, similar to a CUDA stream or a vk command queue.
If they don't, they should not implement async copies anyways. In that case, the fallback is a fully synchronous copy.

So this was an extra synchronization which is only applied in single-GPU settings, I removed it as the backend design did not require it, and no bugs appeared for this part of the PR.
Additionally, @ggerganov also indicated that removing it is ok (by suggesting the saasg-pattern).

JohannesGaessler · 2026-04-20T15:45:26Z

This is correct if the synchronization is between copies of the same backend

Actually, I think it would also be incorrect for variations in the copy index but because ggml backends are supposed to have the same synchronization behavior as CUDA streams it should not be necessary to synchronize between copies.

aendk · 2026-04-21T11:53:13Z

@JohannesGaessler if I understood you correctly:

you agree that the event-driven scheduling mechanism (multi-GPU only) might be incorrect / unsuitable on its own.
you think that if copies are synchronous, possible bugs of the event-driven scheduling do not appear.
if backends behave like CUDA streams, there is no need to synchronize.

My stance here is that:

if backends do not behave like CUDA streams / vk command queue behavior, they do (and shall) not implement cpy_tensor_async. The copies are then fully synchronous due to the fallback logic.
With the current proposal, individual copies to the same backend are asynchronous. In my eyes, this should be ok with CUDA stream/ vk command queue like behavior, unless they write to the same location (which would be bad design and a race condition).
Other than that, there is a strict synchronization to the next copies to the same backend, in the case where this backend is reused for an another split in the same inference pass, or for the next µ-batch.

Do you agree? What do you think should be the next steps to get this merged?

JohannesGaessler · 2026-04-21T12:25:01Z

First and foremost, I would suggest this patch:

diff --git a/ggml/src/ggml-backend.cpp b/ggml/src/ggml-backend.cpp
index d9f8aaec5..60d8939dc 100644
--- a/ggml/src/ggml-backend.cpp
+++ b/ggml/src/ggml-backend.cpp
@@ -1553,22 +1553,23 @@ static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t s
 
         // copy the input tensors to the split backend
         for (int input_id = 0; input_id < split->n_inputs; input_id++) {
+            int input_backend_id = tensor_backend_id(split->inputs[input_id]);
             ggml_backend_t input_backend = ggml_backend_sched_get_tensor_backend(sched, split->inputs[input_id]);
             struct ggml_tensor * input = split->inputs[input_id];
             struct ggml_tensor * input_cpy = tensor_copy(input, split_backend_id, sched->cur_copy);
 
             if (input->flags & GGML_TENSOR_FLAG_INPUT) {
                 // inputs from the user must be copied immediately to prevent the user overwriting the data before the copy is done
-                if (sched->events[split_backend_id][sched->cur_copy] != NULL) {
-                    ggml_backend_event_synchronize(sched->events[split_backend_id][sched->cur_copy]);
+                if (sched->events[input_backend_id][sched->cur_copy] != NULL) {
+                    ggml_backend_event_synchronize(sched->events[input_backend_id][sched->cur_copy]);
                 } else {
                     ggml_backend_synchronize(split_backend);
                 }
                 ggml_backend_tensor_copy(input, input_cpy);
             } else {
                 // wait for the split backend to finish using the input before overwriting it
-                if (sched->events[split_backend_id][sched->cur_copy] != NULL) {
-                    ggml_backend_event_wait(split_backend, sched->events[split_backend_id][sched->cur_copy]);
+                if (sched->events[input_backend_id][sched->cur_copy] != NULL) {
+                    ggml_backend_event_wait(split_backend, sched->events[input_backend_id][sched->cur_copy]);
                 } else {
                     ggml_backend_synchronize(split_backend);
                 }

I did not write the backend scheduler code but to my understanding this is how events should be handled. I have no particular preference whether we fix this as part of this PR or as a standalone one.

aendk · 2026-04-21T13:14:12Z

I'll dig into your proposed fix.

From my perspective, I think it makes sense to fix the event-based multi-GPU scheduling in a standalone PR. This PR is very beneficial on single GPU windows environments, and has been in flight for a long time now.

JohannesGaessler · 2026-04-21T16:20:40Z

Just so there's no misunderstanding: by "standalone PR" I meant a standalone PR that would be a precondition for this one, not one after the fact.

aendk · 2026-04-24T09:11:16Z

I took the time to look into the event-driven mechanism of 38a6f1e. My findings are the following:

CPU->GPU synchronization and GPU->GPU are two different things.
GPU->GPU: event recordings and synchronizations are localized in ggml_backend_cuda_cpy_tensor_async
- This looks watertight to me. cudaMemcpyPeerAsync + cudaEventRecord on src stream, with cudaStreamWaitEvent on the dst stream. This is how synchronization between two streams should be done. It is also implicitly synched with the preceding graph execution (because its on the same src stream), and explicitly synced with the preceding graph execution via cudaStreamWaitEvent, which waits on the cudaEventRecord called after the graph execution.
- CPU->GPU:
  - works via cudaEventSynchronize.
  - There is a bug (or needlessly loose scheduling), and your suggested approach fixes it.
  - Edit: see Sched: Reintroduce less synchronizations between token, with fixed pipeline parallelism. #20793 (comment), it might not be a bug.
  - Observe the [0][0] indices of this nvtx-trace of a inference pass over 2 GPUs:

Unchanged, all the events (shown in grey) wait/check on the same event which is dispatched **after** them by the following graph execution. This is incorrect, it only guarantees order between executions on the same backend. They should be waiting on the previous graph execution (likely on another backend) for stricter scheduling.

Your fix fixes it, but leads to zero syncs in the first GPU split:

We discussed the same pattern in the single GPU setting.
To ensure correctness, we implemented syncs for the single-GPU case for non-CUDA backends. We should therefore keep b1993f1 to have the same syncs in multi-GPU, as we do in single-GPU:

Above, we see the status without the syncs added in b1993f1 (analogue to single GPU). On the very left, there are no barriers between the asynchronous memcpys (group of red bubbles) and the subsequent graph computation. (green bar) on GPU0. This is fine for CUDA, but as discussed in #17795, we should separate the memcpy operations from the preceding operations and the subsequent graph execution.

Regarding the bug still surfacing in hip, @IMbackK ~~could you try 38a6f1e with the patch suggested in #20793 (comment)~~ what is your exact hardware and software setup?
Note that @thejacer also runs on hip/amd and reported no bugs, so it might be something specific to your setup. I also asked AI regarding bugs in hipMemcpyPeerAsync. Depending on the rocm version and AMD hardware, different bugs could surface:

| ROCm range | Correctness on XGMI | Correctness on PCIe P2P | Correctness on host-staging fallback | Notes |
|--------------|-----------------------|--------------------------|--------------------------------------|-------|
| 4.x        | broken              | broken                  | broken                                | upgrade |
| 5.0 – 5.4  | mostly ok           | mostly ok               | under-synced in some cases            | fixes landing incrementally |
| 5.5 – 5.7  | ok                  | ok                      | fixed by 5.7                          | recommended minimum |
| 6.0 – 6.2  | ok                  | ok                      | ok but slow (serialized)              | consumer RDNA falls here |
| 6.3+       | ok                  | ok                      | ok                                    | current target |

My stance is therefore:

Apply @JohannesGaessler patch for tighter and more correct scheduling
Keep b1993f1 for the same reasons as we did in the single GPU case
Compare the hardware and software stack of @thejacer and @IMbackK. Both run hip on amd, only one sees faulty scheduling without applying b1993f1 and the suggested patch. There is a non-zero chance that this is a hip/rocm bug.
Regardless of Step 3, we all should see correct results with b1993f1 and the patch of @JohannesGaessler (will push this shortly). If that is the case (please test if time allows @IMbackK @JohannesGaessler), I think this PR is ready to merge.

aendk · 2026-04-24T09:28:03Z

Regarding @JohannesGaessler 's patch, I need to think about this again.
The original implementation might be ok as well, since we only need to schedule CPU->GPU copies and graph execution on the same backend; So once the graph execution end-event has taken place, the new indices and masks for the next graph execution can be copied to this backend, regardless of what the other backends are currently doing.
It might be not necessary to wait/synchronize these copies with graph executions on other backends/GPUs.

aendk · 2026-04-24T10:02:38Z

Still need to give it more thought, but I now think the truth is in the middle:

CPU to GPU transfer of input tensors:
- memcpy need only to be synced to the previous graph execution on the same backend, so that indices/masks are only updated when no graph execution takes place.
GPU to GPU weight tensor memcpys:
- need to be synched to graph execution finalization event of previous backend (their source backend), so that they do not start before the graph execution is done (bug possibility for non-CUDA backends only).

JohannesGaessler · 2026-04-24T11:08:13Z

The logic should be in terms of ggml_backend_event_t, not CUDA-specific constructs. Ideally the code in the backend scheduler should unconditionally try to create and use those ggml backend events. If a backend does not support events and thus returns nullptr or if ggml_backend_event::device is incompatible for the backend used in ggml_backend_event_wait then the code should fall back to synchronization that does not rely on events.

This comment was marked as outdated.

Sign in to view

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 20, 2026

aendk force-pushed the akieslinger/rework-reduce-per-token-syncs branch from a48fd3b to 06e8b36 Compare March 24, 2026 14:38

hai-pilgrim mentioned this pull request Mar 29, 2026

perf(cuda): reduce redundant per-token synchronizations heiervang-technologies/ht-llama.cpp#22

Closed

4 tasks

aendk marked this pull request as ready for review March 31, 2026 09:44

aendk requested a review from a team as a code owner March 31, 2026 09:44

JohannesGaessler reviewed Apr 7, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated

aendk and others added 2 commits April 14, 2026 13:20

Apply suggestions from @JohannesGaessler code review

38a6f1e

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

aendk force-pushed the akieslinger/rework-reduce-per-token-syncs branch from 5002405 to 38a6f1e Compare April 14, 2026 11:41

Adds single-GPU synchronizations to multi-GPU settings to fix hip bac…

b1993f1

…kend pipeline parallel bugs.

JohannesGaessler reviewed Apr 20, 2026

View reviewed changes

ORippler mentioned this pull request Apr 30, 2026

ggml-cuda: enable concurrent streams for linear attention #21897

Open

Conversation

aendk commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

mxxm-t commented Mar 22, 2026

Uh oh!

aendk commented Mar 24, 2026

Uh oh!

aendk commented Mar 24, 2026

Uh oh!

sjoerdmaessen commented Mar 26, 2026

Benchmark: 2x NVIDIA L40S (sm_89 / Lovelace)

Master (dc8d14c, b8537)

PR branch (06e8b36, b8508)

Summary

Uh oh!

aendk commented Mar 31, 2026

Uh oh!

thejacer commented Apr 1, 2026

Uh oh!

aendk commented Apr 7, 2026

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aendk commented Apr 8, 2026

Uh oh!

JohannesGaessler commented Apr 8, 2026

Uh oh!

am17an commented Apr 8, 2026

Uh oh!

am17an commented Apr 8, 2026

Uh oh!

aendk commented Apr 14, 2026

Uh oh!

IMbackK commented Apr 14, 2026

Uh oh!

JohannesGaessler commented Apr 14, 2026

Uh oh!

aendk commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Apr 14, 2026

Uh oh!

thejacer commented Apr 14, 2026

Uh oh!

IMbackK commented Apr 14, 2026

Uh oh!

IMbackK commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aendk commented Apr 15, 2026

Uh oh!

IMbackK commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aendk commented Apr 20, 2026

Uh oh!

IMbackK commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aendk commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

aendk Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Apr 20, 2026

Uh oh!

aendk commented Apr 21, 2026

Uh oh!

aendk commented Mar 20, 2026 •

edited

Loading

Master (`dc8d14c`, b8537)

PR branch (`06e8b36`, b8508)

aendk commented Apr 14, 2026 •

edited

Loading

IMbackK commented Apr 14, 2026 •

edited

Loading

IMbackK commented Apr 19, 2026 •

edited

Loading

IMbackK commented Apr 20, 2026 •

edited

Loading

aendk commented Apr 20, 2026 •

edited

Loading

aendk commented Apr 24, 2026 •

edited

Loading