Skip to content

Sched: Reintroduce less synchronizations between token, with fixed pipeline parallelism.#20793

Open
aendk wants to merge 3 commits intoggml-org:masterfrom
aendk:akieslinger/rework-reduce-per-token-syncs
Open

Sched: Reintroduce less synchronizations between token, with fixed pipeline parallelism.#20793
aendk wants to merge 3 commits intoggml-org:masterfrom
aendk:akieslinger/rework-reduce-per-token-syncs

Conversation

@aendk
Copy link
Copy Markdown
Contributor

@aendk aendk commented Mar 20, 2026

Follow up to #20463 (comment).

#17795 improved performance in the single GPU setting on CUDA, but it was rolled back due to a bug surfacing in multi-GPU pipeline parallel settings.

For the single GPU setting, it moved the scheduling from sassassasg to the more efficient saaasg pattern, where s= sync, a= async copy, g= graph execution.
Each asynchronous copy was enclosed in two synchronizations. Removing some superfluous synchronizations improved performance, especially on windows. The change was to only do a single synchronization between memory copies and graph execution.

However in multi-GPU settings, we saw llama-perplexity regressions indicating incorrect scheduling (#20463).

I found that the event-based pipeline parallelism scheduling mechanism very likely implicitly relies on synchronous copies, as (i) in my testing copy_from_host worked as intended, and (ii) disabling it and therefore introducing synchronous copies fixed the bug, llama-perplexity perplexity was then identical to master.

The proposed fix here is therefore to enroll pipeline parallelism into the same synchronization between async copies and graph execution as the single GPU case already has.
I think this can be a good solution as it keeps scheduling similar between single GPU and multi GPU, and because it is simpler and safer than reworking the event-driven pipeline parallelism logic.

In my testing, this proposal has same performance benefits as the initial PR, and it yields correct perplexity scores both in single and multi-GPU.
As this bug surfaced in the community with their more diverse hardware setups and usage scenarios, it would be awesome if you could test-drive this change both with llama-bench and llama-perplexity with your usual model and launch-options usage!
@mxxm-t @slavap @Superbobo75 @thejacer

If you can, check out this branch and compare this against its master (git checkout HEAD~2). Let me know if you run into performance or accuracy issues!

@ggml-gh-bot

This comment was marked as outdated.

@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 20, 2026
@mxxm-t
Copy link
Copy Markdown

mxxm-t commented Mar 22, 2026

Will test soon.

@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Mar 24, 2026

@mxxm-t don't bother right now.
I will keep this open for now, but the solution is incomplete. This just adds a "speed bump" of some sorts so the race condition does not appear, but it is not a real scheduling fix.

Once I think I have the correct solution, I'll force-push and ping you again.

@aendk aendk force-pushed the akieslinger/rework-reduce-per-token-syncs branch from a48fd3b to 06e8b36 Compare March 24, 2026 14:38
@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Mar 24, 2026

With #20927 being merged, I now see identical PPL in master and the reapplied original PR on my RTX PRO 6000 Max-Q / RTX PRO 4500 setup of Final estimate: PPL = 28.8008 +/- 1.50705

@mxxm-t feel free to try it if time allows.

Linux Performance
scripts$ python compare-llama-bench.py -c akieslinger/rework-reduce-per-token-syncs -b master -i ../llama-bench.sqlite
| Model                    | Test   |   t/s master |   t/s akieslinger/rework-reduce-per-token-syncs |   Speedup |
|:-------------------------|:-------|-------------:|------------------------------------------------:|----------:|
| gpt-oss 20B MXFP4 MoE    | tg128  |       285.17 |                                          286.79 |      1.01 |
| gpt-oss 20B MXFP4 MoE    | tg256  |       285.33 |                                          287.04 |      1.01 |
| gpt-oss 20B MXFP4 MoE    | tg512  |       278.70 |                                          281.61 |      1.01 |
| qwen3next 80B.A3B Q4_K_M | tg128  |       163.12 |                                          163.91 |      1.00 |
| qwen3next 80B.A3B Q4_K_M | tg256  |       163.15 |                                          164.56 |      1.01 |
| qwen3next 80B.A3B Q4_K_M | tg512  |       163.67 |                                          164.46 |      1.00 |

@sjoerdmaessen
Copy link
Copy Markdown

Benchmark: 2x NVIDIA L40S (sm_89 / Lovelace)

Tested with model: Qwen3.5-122B-A10B Q5_K_S

Hardware: 2x NVIDIA L40S 48GB, AMD EPYC 9354P, Linux
Model: Qwen3.5-122B-A10B Q5_K_S (80.44 GiB, split across both GPUs)
Flags: -ngl 99 -fa 1 -t 4, 3 repetitions per test

Master (dc8d14c, b8537)

test t/s
pp512 2103.93 ± 42.88
pp1024 2472.55 ± 11.66
pp2048 2694.09 ± 6.61
tg128 62.05, 62.13, 62.14

PR branch (06e8b36, b8508)

test t/s
pp512 2122.19 ± 6.42
pp1024 2454.35 ± 5.46
pp2048 2677.33 ± 8.63
tg128 62.20, 62.20, 62.23

Summary

All results within noise on this setup. No regression, no measurable improvement. tg128 ~+0.1 t/s (within margin of error). This is consistent with your observation that the benefit is primarily on Windows I think, Linux multi-GPU pipeline parallelism scheduling appears unaffected by this change for my setup.

@aendk aendk marked this pull request as ready for review March 31, 2026 09:44
@aendk aendk requested a review from a team as a code owner March 31, 2026 09:44
@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Mar 31, 2026

@mxxm-t @slavap @Superbobo75 @thejacer if you have the time, please test PPL (as outlined in #20463 (comment)) and performance on your setups.

@thejacer
Copy link
Copy Markdown

thejacer commented Apr 1, 2026

PR #20793

perplexity: calculating perplexity over 8 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 3.82 seconds per pass - ETA 0.12 minutes
[1]2.7627,[2]2.0251,[3]2.4352,[4]2.2139,[5]2.3001,[6]2.3767,[7]2.3087,[8]2.3420,
Final estimate: PPL = 2.3420 +/- 0.10001
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.27 GiB |    34.66 B | ROCm       |  99 |  1 |    0 |           pp512 |        870.66 ± 4.24 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.27 GiB |    34.66 B | ROCm       |  99 |  1 |    0 |           tg128 |         55.22 ± 0.11 |

build: 06e8b36a6 (8508)

Master

perplexity: calculating perplexity over 8 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 3.84 seconds per pass - ETA 0.12 minutes
[1]2.7627,[2]2.0251,[3]2.4352,[4]2.2139,[5]2.3001,[6]2.3767,[7]2.3087,[8]2.3420,
Final estimate: PPL = 2.3420 +/- 0.10001
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.27 GiB |    34.66 B | ROCm       |  99 |  1 |    0 |           pp512 |        872.38 ± 2.69 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.27 GiB |    34.66 B | ROCm       |  99 |  1 |    0 |           tg128 |         55.27 ± 0.09 |

build: 6de97b9d3 (8623)

@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Apr 7, 2026

@ggerganov
Looks like my initial PR works with @am17an's multi-GPU fix, as it yields correct PPL in the multi-GPU, too.
Do you think additional testing is required to merge this again?

Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm confused. The way I read the linked PR in which the original one got reverted @ggerganov is saying that the original PR had to be reverted to restore correct results. But isn't this PR making the exact same changes as before?

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Apr 8, 2026

Sorry, I'm confused. The way I read the linked PR in which the original one got reverted @ggerganov is saying that the original PR had to be reverted to restore correct results. But isn't this PR making the exact same changes as before?

correct in both aspects. Since then @am17an merged his fix. Reapplying my changes now still yields correct results, since the scheduling bug was not part of my PR. My PR just removed a lot of "speed bumps" in the form of synchronizations, which exposed the bug.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Can you also link the fix in question?

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Apr 8, 2026

I think we should wait for the TP PR (#13776) to be merged and stable before merging this again. Debugging synchronization issues is a pain.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Apr 8, 2026

@JohannesGaessler it's #20927

@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Apr 14, 2026

@am17an @JohannesGaessler do we have a rough timeline when we consider #19378 to be stable?

Since this closes a major perf gap between windows and linux, most of the user base is on windows, and there is no real benefit leaving it sitting, revisiting/merging this sooner than later makes sense to me.

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Apr 14, 2026

I need to find the time to test this on HIP first for sure, since the previous attempt at this broke HIP entirely.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Sorry, I forgot about this PR. Please rebase on top of master and I'll check whether there are issues (though the synchronization logic should be backend-agnostic).

aendk and others added 2 commits April 14, 2026 13:20
…ml-org#17795)

* Adds CPU-to-CUDA copy capability to
ggml_backend_cuda_cpy_tensor_async()

* Adds function to relax sync requirements between input copies on
supported backends (CUDA for now)

* Exchanges synchronous copy with async copy function.

* Adds macro guards to allow compilation in non-CUDA builds

* Reworked backend detection in ggml-backend.cpp to avoid linking
conflicts

* Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues

* Minor cleanup

* Makes opt-in to relax use of explicit syncs more general. Backends like
vulkan which require a synchronization between HtoD copies and graph
execution could also adopt this change now.

* Reintroduces stricter check for CPU->CUDA backend async copy via
GGML_DEVICE_TYPE_CPU.

* Corrects initialization of ggml_backend_sync_mode in
ggml_backend_sched_split initialization

* Simplifies synchronizations to adhere to `saaasg` pattern.

* Apply suggestion from @ggerganov (src->buffer to buf_src)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Apply suggestion from @ggerganov (src->buffer to buf_src) v2

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
@aendk aendk force-pushed the akieslinger/rework-reduce-per-token-syncs branch from 5002405 to 38a6f1e Compare April 14, 2026 11:41
@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Apr 14, 2026

I rebased and spot-checked single and multi-GPU again. In both cases, PPL is identical to master.
Also, note that only cosmetics were changed in the rebase (dst->buffer -> buf_dst), so the findings from the community above should still apply.
@IMbackK good call. it should not break hip, otherwise let me know.
@JohannesGaessler thanks, I consider the PR to be ready now.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Using 4x RTX 4090 I am unable to provoke issues, the PPL values I get are bit-for-bit identical both for --split-mode layer and --split-mode tensor.

@thejacer
Copy link
Copy Markdown

I need to find the time to test this on HIP first for sure, since the previous attempt at this broke HIP entirely.

My testing above was done on 2xMi50. Does that satisfy HIP testing?

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Apr 14, 2026

@thejacer that helps, yeah

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Apr 14, 2026

Unfortionatly this reintroduces #20433 on hip, repdoucer from that issue:

This pr:
[1]281.8696,[2]238.4758,[3]252.0735,[4]237.7406,[5]251.8294,[6]256.8782,[7]245.3641,[8]239.9153,[9]239.4646,[10]237.5819,[11]235.9763,[12]238.7037,[13]236.2493,

Master @b8785

[1]271.8923,[2]241.0099,[3]242.2813,[4]233.9529,[5]234.6784,[6]236.7585,[7]240.0973,[8]239.3741,[9]238.8985,[10]237.3241,[11]239.6173,[12]239.7815,[13]237.2340,

@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Apr 15, 2026

@IMbackK I guess there is a subtle difference in the inherent ordering of memcpy and compute between CUDA and hip. That means it makes sense to make the saaasg-pattern (s=sync, a=async memcpy, g=graph compute) in single-GPU explicit in multi-GPU, too.

Just out of curiosity, can you check if unguarding the ggml_backend_synchronize(split_backend); either in L1547-L1556 or in L1670-L1673 of this diff is required for hip consistency, or both? So just replace these two if-conditions to if(true) and test these 3 cases (both modified to true + top true, bottom default + top default, bottom true)?
Can you check both performance and PPL?

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Apr 19, 2026

llama-bench -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -fa 1 -p 4096 -n 128 -ub 256 -sm layer,tensor

patch1.patch
patch2.patch

Master:

model size params backend ngl sm fa test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 layer 1 pp4096 4359.42 ± 6.17
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 layer 1 tg128 109.77 ± 0.56
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 tensor 1 pp4096 3490.72 ± 24.08
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 tensor 1 tg128 98.57 ± 0.45

PR (rebased):
PPL: fail

model size params backend ngl sm fa test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 layer 1 pp4096 4388.10 ± 9.35
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 layer 1 tg128 110.74 ± 0.66
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 tensor 1 pp4096 3544.33 ± 9.81
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 tensor 1 tg128 98.67 ± 0.37

Patch1:
PPL: pass

model size params backend ngl sm fa test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 layer 1 pp4096 4294.21 ± 9.30
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 layer 1 tg128 109.74 ± 0.66
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 tensor 1 pp4096 3505.95 ± 14.07
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 tensor 1 tg128 98.47 ± 0.36

Patch2:
PPL: pass

model size params backend ngl sm fa test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 layer 1 pp4096 4352.22 ± 8.19
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 layer 1 tg128 109.03 ± 0.45
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 tensor 1 pp4096 3476.17 ± 14.28
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 tensor 1 tg128 97.45 ± 0.37

Both:

PPL: pass

model size params backend ngl sm fa test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 layer 1 pp4096 4312.89 ± 11.18
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 layer 1 tg128 109.52 ± 0.73
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 tensor 1 pp4096 3559.21 ± 4.11
llama 8B Q4_K - Medium 4.58 GiB 8.03 B ROCm 99 tensor 1 tg128 98.42 ± 0.29

All of the performance values are within run to run variance. From the perspective of purely the hip backend, this pr is not worthwhile in the first place.

@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Apr 20, 2026

Thanks @IMbackK for the exhaustive testing.
Since the single-GPU setup works with the saags-pattern [0], I decided to enroll the multi-GPU setting into this, too.
This reduces the scheduling differences between single-GPU and multi-GPU and even simplifies the scheduling code.

The original change was:
single-GPU: sasassasg -> saaasg
multi-GPU: sasassasg + event-based logic -> aaag + event-based logic

The new change proposed now is:
single-GPU: sasassasg -> saaasg
multi-GPU: sasassasg + event-based logic -> saaasg + event-based logic
(so even if commit b1993f1 appears to "add" syncs to multi-GPU, there is still a net loss compared to master)

So in both cases, we reduce the amount of synchronizations to the same minimum. As mentioned in the previous PR linked above, this brings a measurable/significant speed-up for the majority of llama.cpp users (windows + CUDA).

The PR is now ready for review @ggerganov @JohannesGaessler @ORippler @IMbackK.

[0] s=synchronization, a=async copy, g=graph execution

@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Apr 20, 2026

I mean ok, but really we should understand why the sync in this position is required and have some kind of spec for what dose or dose not cause implicit sync or ordering in a ggml backend rather than just sortof guessing based on observed behavior.

@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Apr 20, 2026

@IMbackK the core answer is that the multi-GPU event based synchronization mechanism ist not suitable as the only scheduling mechanism for multi-GPU. It is not clear if it even was designed to that, or if it was and there is a bug.

The good side is that this PR proposes to exchange some implicit synchronizations with two explicit synchronizations. This makes the scheduling mechanism more explicit, and easier to understand & maintain.

Right now, the multi-GPU scheduling on master only works because it implicitly relies on the fact that there are some synchronizations for asynchronous copies in its scheduling stages.
In the future, these synchronizations could be optimized away, or scheduling stages could change contents. Multi-GPU could thus break as an unintended side-effect.
In my opinion, keeping the master as-is is the same as leaving the door open for bugs down the line.

Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code on master, I don't think the events are being used correctly. The backend scheduler has events per backend and per copy. At the end of the loop in ggml_backend_sched_compute_splits an event is recorded for the ggml backend with its own ID split_backend_id. At the beginning of the loop a backend waits for the event of its own ID. This is correct if the synchronization is between copies of the same backend but incorrect if the synchronization is between different backends such as for multiple GPUs. If however the copies between backends is synchronous then this defect does not manifest as a bug.

Comment thread ggml/src/ggml-backend.cpp
Comment on lines -1557 to -1558
} else {
ggml_backend_synchronize(split_backend);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the else branch being removed here? In ggml_backend_sched_new the events are created unconditionally. If the event is null here that would imply that events are not supported for a backend. However, because that backend could still have implemented asynchronous execution and/or tensor copies (as is I think the case for Vulkan) we could get a race condition here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my notes when I worked on this part of the code, if (sched->events[split_backend_id][sched->cur_copy] != NULL) determines if single device/GPU or multi-GPU (pipeline parallelism) scheduling logic should be applied. The else-case is thus always applied for single-GPU settings.

I removed the else case because I determined it to be unnecessary. It adds a single synchronization between async copies to the same backend.
In the previous PR, we clarified that the ideal backends design should support multiple concurrent async copies, similar to a CUDA stream or a vk command queue.
If they don't, they should not implement async copies anyways. In that case, the fallback is a fully synchronous copy.

So this was an extra synchronization which is only applied in single-GPU settings, I removed it as the backend design did not require it, and no bugs appeared for this part of the PR.
Additionally, @ggerganov also indicated that removing it is ok (by suggesting the saasg-pattern).

@JohannesGaessler
Copy link
Copy Markdown
Contributor

This is correct if the synchronization is between copies of the same backend

Actually, I think it would also be incorrect for variations in the copy index but because ggml backends are supposed to have the same synchronization behavior as CUDA streams it should not be necessary to synchronize between copies.

@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Apr 21, 2026

@JohannesGaessler if I understood you correctly:

  • you agree that the event-driven scheduling mechanism (multi-GPU only) might be incorrect / unsuitable on its own.
  • you think that if copies are synchronous, possible bugs of the event-driven scheduling do not appear.
  • if backends behave like CUDA streams, there is no need to synchronize.

My stance here is that:

  • if backends do not behave like CUDA streams / vk command queue behavior, they do (and shall) not implement cpy_tensor_async. The copies are then fully synchronous due to the fallback logic.
  • With the current proposal, individual copies to the same backend are asynchronous. In my eyes, this should be ok with CUDA stream/ vk command queue like behavior, unless they write to the same location (which would be bad design and a race condition).
  • Other than that, there is a strict synchronization to the next copies to the same backend, in the case where this backend is reused for an another split in the same inference pass, or for the next µ-batch.

Do you agree? What do you think should be the next steps to get this merged?

@JohannesGaessler
Copy link
Copy Markdown
Contributor

First and foremost, I would suggest this patch:

diff --git a/ggml/src/ggml-backend.cpp b/ggml/src/ggml-backend.cpp
index d9f8aaec5..60d8939dc 100644
--- a/ggml/src/ggml-backend.cpp
+++ b/ggml/src/ggml-backend.cpp
@@ -1553,22 +1553,23 @@ static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t s
 
         // copy the input tensors to the split backend
         for (int input_id = 0; input_id < split->n_inputs; input_id++) {
+            int input_backend_id = tensor_backend_id(split->inputs[input_id]);
             ggml_backend_t input_backend = ggml_backend_sched_get_tensor_backend(sched, split->inputs[input_id]);
             struct ggml_tensor * input = split->inputs[input_id];
             struct ggml_tensor * input_cpy = tensor_copy(input, split_backend_id, sched->cur_copy);
 
             if (input->flags & GGML_TENSOR_FLAG_INPUT) {
                 // inputs from the user must be copied immediately to prevent the user overwriting the data before the copy is done
-                if (sched->events[split_backend_id][sched->cur_copy] != NULL) {
-                    ggml_backend_event_synchronize(sched->events[split_backend_id][sched->cur_copy]);
+                if (sched->events[input_backend_id][sched->cur_copy] != NULL) {
+                    ggml_backend_event_synchronize(sched->events[input_backend_id][sched->cur_copy]);
                 } else {
                     ggml_backend_synchronize(split_backend);
                 }
                 ggml_backend_tensor_copy(input, input_cpy);
             } else {
                 // wait for the split backend to finish using the input before overwriting it
-                if (sched->events[split_backend_id][sched->cur_copy] != NULL) {
-                    ggml_backend_event_wait(split_backend, sched->events[split_backend_id][sched->cur_copy]);
+                if (sched->events[input_backend_id][sched->cur_copy] != NULL) {
+                    ggml_backend_event_wait(split_backend, sched->events[input_backend_id][sched->cur_copy]);
                 } else {
                     ggml_backend_synchronize(split_backend);
                 }

I did not write the backend scheduler code but to my understanding this is how events should be handled. I have no particular preference whether we fix this as part of this PR or as a standalone one.

@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Apr 21, 2026

I'll dig into your proposed fix.

From my perspective, I think it makes sense to fix the event-based multi-GPU scheduling in a standalone PR. This PR is very beneficial on single GPU windows environments, and has been in flight for a long time now.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Just so there's no misunderstanding: by "standalone PR" I meant a standalone PR that would be a precondition for this one, not one after the fact.

@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Apr 24, 2026

I took the time to look into the event-driven mechanism of 38a6f1e. My findings are the following:

  • CPU->GPU synchronization and GPU->GPU are two different things.
  • GPU->GPU: event recordings and synchronizations are localized in ggml_backend_cuda_cpy_tensor_async
    • This looks watertight to me. cudaMemcpyPeerAsync + cudaEventRecord on src stream, with cudaStreamWaitEvent on the dst stream. This is how synchronization between two streams should be done. It is also implicitly synched with the preceding graph execution (because its on the same src stream), and explicitly synced with the preceding graph execution via cudaStreamWaitEvent, which waits on the cudaEventRecord called after the graph execution.
    • CPU->GPU:
Screenshot 2026-04-23 at 17 11 06 Unchanged, all the events (shown in grey) wait/check on the same event which is dispatched **after** them by the following graph execution. This is incorrect, it only guarantees order between executions on the same backend. They should be waiting on the previous graph execution (likely on another backend) for stricter scheduling.

Your fix fixes it, but leads to zero syncs in the first GPU split:
Screenshot 2026-04-23 at 16 45 34
We discussed the same pattern in the single GPU setting.
To ensure correctness, we implemented syncs for the single-GPU case for non-CUDA backends. We should therefore keep b1993f1 to have the same syncs in multi-GPU, as we do in single-GPU:
Screenshot 2026-04-23 at 16 47 03
Above, we see the status without the syncs added in b1993f1 (analogue to single GPU). On the very left, there are no barriers between the asynchronous memcpys (group of red bubbles) and the subsequent graph computation. (green bar) on GPU0. This is fine for CUDA, but as discussed in #17795, we should separate the memcpy operations from the preceding operations and the subsequent graph execution.

Regarding the bug still surfacing in hip, @IMbackK could you try 38a6f1e with the patch suggested in #20793 (comment) what is your exact hardware and software setup?
Note that @thejacer also runs on hip/amd and reported no bugs, so it might be something specific to your setup. I also asked AI regarding bugs in hipMemcpyPeerAsync. Depending on the rocm version and AMD hardware, different bugs could surface:

| ROCm range | Correctness on XGMI | Correctness on PCIe P2P | Correctness on host-staging fallback | Notes |
|--------------|-----------------------|--------------------------|--------------------------------------|-------|
| 4.x        | broken              | broken                  | broken                                | upgrade |
| 5.0 – 5.4  | mostly ok           | mostly ok               | under-synced in some cases            | fixes landing incrementally |
| 5.5 – 5.7  | ok                  | ok                      | fixed by 5.7                          | recommended minimum |
| 6.0 – 6.2  | ok                  | ok                      | ok but slow (serialized)              | consumer RDNA falls here |
| 6.3+       | ok                  | ok                      | ok                                    | current target |

My stance is therefore:

  1. Apply @JohannesGaessler patch for tighter and more correct scheduling
  2. Keep b1993f1 for the same reasons as we did in the single GPU case
  3. Compare the hardware and software stack of @thejacer and @IMbackK. Both run hip on amd, only one sees faulty scheduling without applying b1993f1 and the suggested patch. There is a non-zero chance that this is a hip/rocm bug.
  4. Regardless of Step 3, we all should see correct results with b1993f1 and the patch of @JohannesGaessler (will push this shortly). If that is the case (please test if time allows @IMbackK @JohannesGaessler), I think this PR is ready to merge.

@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Apr 24, 2026

Regarding @JohannesGaessler 's patch, I need to think about this again.
The original implementation might be ok as well, since we only need to schedule CPU->GPU copies and graph execution on the same backend; So once the graph execution end-event has taken place, the new indices and masks for the next graph execution can be copied to this backend, regardless of what the other backends are currently doing.
It might be not necessary to wait/synchronize these copies with graph executions on other backends/GPUs.

@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented Apr 24, 2026

Still need to give it more thought, but I now think the truth is in the middle:

  • CPU to GPU transfer of input tensors:
    • memcpy need only to be synced to the previous graph execution on the same backend, so that indices/masks are only updated when no graph execution takes place.
  • GPU to GPU weight tensor memcpys:
    • need to be synched to graph execution finalization event of previous backend (their source backend), so that they do not start before the graph execution is done (bug possibility for non-CUDA backends only).

@JohannesGaessler
Copy link
Copy Markdown
Contributor

The logic should be in terms of ggml_backend_event_t, not CUDA-specific constructs. Ideally the code in the backend scheduler should unconditionally try to create and use those ggml backend events. If a backend does not support events and thus returns nullptr or if ggml_backend_event::device is incompatible for the backend used in ggml_backend_event_wait then the code should fall back to synchronization that does not rely on events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants