Sched: Reintroduce less synchronizations between token, with fixed pipeline parallelism.#20793
Sched: Reintroduce less synchronizations between token, with fixed pipeline parallelism.#20793aendk wants to merge 3 commits intoggml-org:masterfrom
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
Will test soon. |
|
@mxxm-t don't bother right now. Once I think I have the correct solution, I'll force-push and ping you again. |
a48fd3b to
06e8b36
Compare
|
With #20927 being merged, I now see identical PPL in master and the reapplied original PR on my RTX PRO 6000 Max-Q / RTX PRO 4500 setup of @mxxm-t feel free to try it if time allows. Linux Performance |
Benchmark: 2x NVIDIA L40S (sm_89 / Lovelace)Tested with model: Qwen3.5-122B-A10B Q5_K_S Hardware: 2x NVIDIA L40S 48GB, AMD EPYC 9354P, Linux Master (dc8d14c, b8537)
PR branch (06e8b36, b8508)
SummaryAll results within noise on this setup. No regression, no measurable improvement. tg128 ~+0.1 t/s (within margin of error). This is consistent with your observation that the benefit is primarily on Windows I think, Linux multi-GPU pipeline parallelism scheduling appears unaffected by this change for my setup. |
|
@mxxm-t @slavap @Superbobo75 @thejacer if you have the time, please test PPL (as outlined in #20463 (comment)) and performance on your setups. |
|
PR #20793 Master |
|
@ggerganov |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Sorry, I'm confused. The way I read the linked PR in which the original one got reverted @ggerganov is saying that the original PR had to be reverted to restore correct results. But isn't this PR making the exact same changes as before?
correct in both aspects. Since then @am17an merged his fix. Reapplying my changes now still yields correct results, since the scheduling bug was not part of my PR. My PR just removed a lot of "speed bumps" in the form of synchronizations, which exposed the bug. |
|
Can you also link the fix in question? |
|
I think we should wait for the TP PR (#13776) to be merged and stable before merging this again. Debugging synchronization issues is a pain. |
|
@JohannesGaessler it's #20927 |
|
@am17an @JohannesGaessler do we have a rough timeline when we consider #19378 to be stable? Since this closes a major perf gap between windows and linux, most of the user base is on windows, and there is no real benefit leaving it sitting, revisiting/merging this sooner than later makes sense to me. |
|
I need to find the time to test this on HIP first for sure, since the previous attempt at this broke HIP entirely. |
|
Sorry, I forgot about this PR. Please rebase on top of master and I'll check whether there are issues (though the synchronization logic should be backend-agnostic). |
…ml-org#17795) * Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async() * Adds function to relax sync requirements between input copies on supported backends (CUDA for now) * Exchanges synchronous copy with async copy function. * Adds macro guards to allow compilation in non-CUDA builds * Reworked backend detection in ggml-backend.cpp to avoid linking conflicts * Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues * Minor cleanup * Makes opt-in to relax use of explicit syncs more general. Backends like vulkan which require a synchronization between HtoD copies and graph execution could also adopt this change now. * Reintroduces stricter check for CPU->CUDA backend async copy via GGML_DEVICE_TYPE_CPU. * Corrects initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization * Simplifies synchronizations to adhere to `saaasg` pattern. * Apply suggestion from @ggerganov (src->buffer to buf_src) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestion from @ggerganov (src->buffer to buf_src) v2 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
5002405 to
38a6f1e
Compare
|
I rebased and spot-checked single and multi-GPU again. In both cases, PPL is identical to master. |
|
Using 4x RTX 4090 I am unable to provoke issues, the PPL values I get are bit-for-bit identical both for |
My testing above was done on 2xMi50. Does that satisfy HIP testing? |
|
@thejacer that helps, yeah |
|
Unfortionatly this reintroduces #20433 on hip, repdoucer from that issue: This pr: Master @b8785
|
|
@IMbackK I guess there is a subtle difference in the inherent ordering of memcpy and compute between CUDA and hip. That means it makes sense to make the Just out of curiosity, can you check if unguarding the |
|
Master:
PR (rebased):
Patch1:
Patch2:
Both: PPL: pass
All of the performance values are within run to run variance. From the perspective of purely the hip backend, this pr is not worthwhile in the first place. |
…kend pipeline parallel bugs.
|
Thanks @IMbackK for the exhaustive testing. The original change was: The new change proposed now is: So in both cases, we reduce the amount of synchronizations to the same minimum. As mentioned in the previous PR linked above, this brings a measurable/significant speed-up for the majority of llama.cpp users (windows + CUDA). The PR is now ready for review @ggerganov @JohannesGaessler @ORippler @IMbackK. [0] |
|
I mean ok, but really we should understand why the sync in this position is required and have some kind of spec for what dose or dose not cause implicit sync or ordering in a ggml backend rather than just sortof guessing based on observed behavior. |
|
@IMbackK the core answer is that the multi-GPU event based synchronization mechanism ist not suitable as the only scheduling mechanism for multi-GPU. It is not clear if it even was designed to that, or if it was and there is a bug. The good side is that this PR proposes to exchange some implicit synchronizations with two explicit synchronizations. This makes the scheduling mechanism more explicit, and easier to understand & maintain. Right now, the multi-GPU scheduling on |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Looking at the code on master, I don't think the events are being used correctly. The backend scheduler has events per backend and per copy. At the end of the loop in ggml_backend_sched_compute_splits an event is recorded for the ggml backend with its own ID split_backend_id. At the beginning of the loop a backend waits for the event of its own ID. This is correct if the synchronization is between copies of the same backend but incorrect if the synchronization is between different backends such as for multiple GPUs. If however the copies between backends is synchronous then this defect does not manifest as a bug.
| } else { | ||
| ggml_backend_synchronize(split_backend); |
There was a problem hiding this comment.
Why is the else branch being removed here? In ggml_backend_sched_new the events are created unconditionally. If the event is null here that would imply that events are not supported for a backend. However, because that backend could still have implemented asynchronous execution and/or tensor copies (as is I think the case for Vulkan) we could get a race condition here.
There was a problem hiding this comment.
From my notes when I worked on this part of the code, if (sched->events[split_backend_id][sched->cur_copy] != NULL) determines if single device/GPU or multi-GPU (pipeline parallelism) scheduling logic should be applied. The else-case is thus always applied for single-GPU settings.
I removed the else case because I determined it to be unnecessary. It adds a single synchronization between async copies to the same backend.
In the previous PR, we clarified that the ideal backends design should support multiple concurrent async copies, similar to a CUDA stream or a vk command queue.
If they don't, they should not implement async copies anyways. In that case, the fallback is a fully synchronous copy.
So this was an extra synchronization which is only applied in single-GPU settings, I removed it as the backend design did not require it, and no bugs appeared for this part of the PR.
Additionally, @ggerganov also indicated that removing it is ok (by suggesting the saasg-pattern).
Actually, I think it would also be incorrect for variations in the copy index but because ggml backends are supposed to have the same synchronization behavior as CUDA streams it should not be necessary to synchronize between copies. |
|
@JohannesGaessler if I understood you correctly:
My stance here is that:
Do you agree? What do you think should be the next steps to get this merged? |
|
First and foremost, I would suggest this patch: diff --git a/ggml/src/ggml-backend.cpp b/ggml/src/ggml-backend.cpp
index d9f8aaec5..60d8939dc 100644
--- a/ggml/src/ggml-backend.cpp
+++ b/ggml/src/ggml-backend.cpp
@@ -1553,22 +1553,23 @@ static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t s
// copy the input tensors to the split backend
for (int input_id = 0; input_id < split->n_inputs; input_id++) {
+ int input_backend_id = tensor_backend_id(split->inputs[input_id]);
ggml_backend_t input_backend = ggml_backend_sched_get_tensor_backend(sched, split->inputs[input_id]);
struct ggml_tensor * input = split->inputs[input_id];
struct ggml_tensor * input_cpy = tensor_copy(input, split_backend_id, sched->cur_copy);
if (input->flags & GGML_TENSOR_FLAG_INPUT) {
// inputs from the user must be copied immediately to prevent the user overwriting the data before the copy is done
- if (sched->events[split_backend_id][sched->cur_copy] != NULL) {
- ggml_backend_event_synchronize(sched->events[split_backend_id][sched->cur_copy]);
+ if (sched->events[input_backend_id][sched->cur_copy] != NULL) {
+ ggml_backend_event_synchronize(sched->events[input_backend_id][sched->cur_copy]);
} else {
ggml_backend_synchronize(split_backend);
}
ggml_backend_tensor_copy(input, input_cpy);
} else {
// wait for the split backend to finish using the input before overwriting it
- if (sched->events[split_backend_id][sched->cur_copy] != NULL) {
- ggml_backend_event_wait(split_backend, sched->events[split_backend_id][sched->cur_copy]);
+ if (sched->events[input_backend_id][sched->cur_copy] != NULL) {
+ ggml_backend_event_wait(split_backend, sched->events[input_backend_id][sched->cur_copy]);
} else {
ggml_backend_synchronize(split_backend);
}I did not write the backend scheduler code but to my understanding this is how events should be handled. I have no particular preference whether we fix this as part of this PR or as a standalone one. |
|
I'll dig into your proposed fix. From my perspective, I think it makes sense to fix the event-based multi-GPU scheduling in a standalone PR. This PR is very beneficial on single GPU windows environments, and has been in flight for a long time now. |
|
Just so there's no misunderstanding: by "standalone PR" I meant a standalone PR that would be a precondition for this one, not one after the fact. |
|
I took the time to look into the event-driven mechanism of 38a6f1e. My findings are the following:
Unchanged, all the events (shown in grey) wait/check on the same event which is dispatched **after** them by the following graph execution. This is incorrect, it only guarantees order between executions on the same backend. They should be waiting on the previous graph execution (likely on another backend) for stricter scheduling.
Your fix fixes it, but leads to zero syncs in the first GPU split: Regarding the bug still surfacing in hip, @IMbackK My stance is therefore:
|
|
Regarding @JohannesGaessler 's patch, I need to think about this again. |
|
Still need to give it more thought, but I now think the truth is in the middle:
|
|
The logic should be in terms of |



Follow up to #20463 (comment).
#17795 improved performance in the single GPU setting on CUDA, but it was rolled back due to a bug surfacing in multi-GPU pipeline parallel settings.
For the single GPU setting, it moved the scheduling from
sassassasgto the more efficientsaaasgpattern, wheres= sync,a= async copy,g= graph execution.Each asynchronous copy was enclosed in two synchronizations. Removing some superfluous synchronizations improved performance, especially on windows. The change was to only do a single synchronization between memory copies and graph execution.
However in multi-GPU settings, we saw
llama-perplexityregressions indicating incorrect scheduling (#20463).I found that the event-based pipeline parallelism scheduling mechanism very likely implicitly relies on synchronous copies, as (i) in my testing
copy_from_hostworked as intended, and (ii) disabling it and therefore introducing synchronous copies fixed the bug,llama-perplexityperplexity was then identical to master.The proposed fix here is therefore to enroll pipeline parallelism into the same synchronization between async copies and graph execution as the single GPU case already has.
I think this can be a good solution as it keeps scheduling similar between single GPU and multi GPU, and because it is simpler and safer than reworking the event-driven pipeline parallelism logic.
In my testing, this proposal has same performance benefits as the initial PR, and it yields correct perplexity scores both in single and multi-GPU.
As this bug surfaced in the community with their more diverse hardware setups and usage scenarios, it would be awesome if you could test-drive this change both with
llama-benchandllama-perplexitywith your usual model and launch-options usage!@mxxm-t @slavap @Superbobo75 @thejacer
If you can, check out this branch and compare this against its master (
git checkout HEAD~2). Let me know if you run into performance or accuracy issues!