ggml: add graph_reused by am17an · Pull Request #21764 · ggml-org/llama.cpp

am17an · 2026-04-11T08:37:16Z

Overview

Add reused member variable to ggml_cgraph so backends can take advantage of the graph reuse functionality. Currently when graph_reuse in invoked, the CUDA backend still does the props change check to figure out if the graph has changed or not, where in fact graph_reuse (to my understanding) guarantees this to be true. This helps bypass a mildly expensive O(n) check.

Additional information

Testing: I tested various combinations like --n-cpu-moe, -nkvo and -ngl and verified it works. Additional testing would be welcome.

Results on a 5090:

Model	Test	t/s cuda_mul_fused	t/s ggml_graph_reuse	Speedup
gemma4 ?B Q4_0	tg128	219.43	231.50	1.06
gemma4 ?B Q4_0	tg128@d16384	183.23	191.91	1.05
gemma4 ?B Q4_0	tg128@d32768	175.15	182.58	1.04
gpt-oss 20B MXFP4 MoE	tg128	313.29	323.01	1.03
gpt-oss 20B MXFP4 MoE	tg128@d16384	279.59	288.05	1.03
gpt-oss 20B MXFP4 MoE	tg128@d32768	258.80	266.12	1.03
qwen35 27B Q4_K_M	tg128	65.04	66.61	1.02
qwen35 27B Q4_K_M	tg128@d16384	62.45	63.61	1.02
qwen35 27B Q4_K_M	tg128@d32768	59.65	60.70	1.02
qwen35moe 35B.A3B Q4_K_S	tg128	194.52	206.56	1.06
qwen35moe 35B.A3B Q4_K_S	tg128@d16384	184.11	196.36	1.07
qwen35moe 35B.A3B Q4_K_S	tg128@d32768	175.95	187.60	1.07

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, for general understanding of the scheduler

JohannesGaessler

This would work from a llama.cpp perspective but I'm not convinced it's the right way to handle this from a ggml perspective. In my opinion we should attach a ggml_guid and write the code around that. With a GUID the contract for "user code" would be "if you change the graph you must change the GUID", it's not clear to me what the correct way to use a reused flag would be.

am17an · 2026-04-11T14:14:26Z

I added a version number, currently we don't have a guid generator in master. If the current version is acceptable, I can add a guid generator however I think a simple static version counter is also fine imo. The idea of versioning that anything that mutates the cgraph should have increment the version number.

jeffbolznv · 2026-04-11T14:19:48Z

    }

+    for (int i = 0; i < sched->n_splits; i++) {
+        sched->splits[i].graph.version = graph->version;


Should all splits really be the same uuid?

I don't see any additional benefit of having a per split uuid as of now

I think you're right though, it makes sense to have a per split uuid

Splits should definitely have different identifiers than the main graph.

Splits should definitely have different identifiers than the main graph.

Do you have something in mind? I think if all splits have the same uuid as the main graph, the logic would be simpler. Just not sure if you see something that would prevent doing so.

To clarify, I think that the cuda graph key (i.e. nodes[0] ptr) + the common uuid would uniquely identify the splits.

Since we allow arbitrary --tensor-overrides we can have multiple splits per backend, no? If a user assigns some tensor in the middle to the CPU you will end up with 2 splits running on the GPU beforehand and afterwards. And if the graphs for those splits have the same UID then they can be incorrectly re-used.

We could also simplify the logic by not requiring that the UIDs of the split graphs share bits with the original graph. I think that is only really useful for debugging, otherwise we can just call the function to get a UID again.

In CUDA it will not be an issue because we store the node ptrs as the key, but for other backend it might cause an issue that you arrive at graph_compute_async with the same uid as before (e.g. in case of the CPU split), even though it's a different split. I like the split being part of the id as it ties it to a particular graph, as it is not a separate from the main compute graph.

JohannesGaessler · 2026-04-12T08:51:54Z

I suppose an index that is incremented per graph would also work, even at 1000 new graphs / second it would take like 500 million years until a 64 bit integer is exhausted.

am17an · 2026-04-12T09:39:34Z

1000 new graphs / second it would take like 500 million years until a 64 bit integer is exhausted.

I've shortened this time frame by using the top 12 bits for the split index. It would only last ~500 years now. The counter needs to move to a .c file though, otherwise each TU will get it's own copy

JohannesGaessler

The function ggml_backend_graph_optimize should regenerate the graph identifier.
In ggml-backend.cpp there are graph manipulations where the sources of nodes are replaced, this should also result in a change in the graph identifier.
I think that ggml-opt.cpp should be fine with these changes.
ggml-backend-meta.cpp the manually created and re-used graphs cgraphs_aux will need to have their identifiers regenerated.

Please let me know if any of these are unclear or if you want me to take over some of these.

am17an · 2026-04-12T15:20:07Z

In ggml-backend.cpp there are graph manipulations where the sources of nodes are replaced, this should also result in a change in the graph identifier.

Could you point me where? From what I understand bumping the counter only when sched_reset is called should be good. Since everything is reset from this point and the graph's id should remain the same till the next sched_reset

JohannesGaessler · 2026-04-12T16:02:32Z

Consider this line. On that line node is just a pointer to one of the nodes of the original graph. So before and after calling ggml_backend_split_graph the passed graph is different and the corresponding identifier should also be different to reflect this.

Also looking at the code again, I think the correct place to set the split graph identifiers would be at the end of ggml_backend_split_graph rather than in ggml_backend_sched_graph_compute_async.

am17an · 2026-04-12T16:16:12Z

I see, since ggml_backend_sched_split_graph is a part of the public API it can be used like you said. So we should also increment the id there (or maybe only there? because I don't see how you can create a graph without calling that function)

JohannesGaessler · 2026-04-12T20:55:01Z

So we should also increment the id there (or maybe only there?

That would I think be the correct way to do it.

JohannesGaessler

From my end these changes would be good otherwise.

JohannesGaessler · 2026-04-13T12:15:00Z

ggml-backend-meta.cpp the manually created and re-used graphs cgraphs_aux will need to have their identifiers regenerated.

Sorry, I had misremembered. At some WIP version I had used ggml_cgraph references but at some later point I replaced these with ggml graphs that are properly created from a ggml_context * so no changes were necessary after all.

gaugarg-nv · 2026-04-16T01:07:57Z

@am17an @JohannesGaessler It will be great to have this PR merged, as it improves perf quite a bit and is also a prerequisite to reducing CPU overhead in the TP implementation.

My understanding is that this will break CUDA graphs in the Meta backend as we recreate subgraphs in every compute call. This can be fixed if sub-graphs are cached into the meta backend, similar to JohannesGaessler@543e30d. I am happy to help fix it either within this PR or a separate PR.

am17an · 2026-04-16T03:43:59Z

This is still pending @ggerganov's review, after that we can merge. The TP stuff can be in a follow-up PR

JohannesGaessler · 2026-04-16T06:53:45Z

My understanding is that this will break CUDA graphs in the Meta backend as we recreate subgraphs in every compute call.

This PR will not provide any benefit but it should not break anything because the fallback for a UID mismatch is the behavior on master.

am17an · 2026-04-16T07:25:11Z

@JohannesGaessler can you also approve?

* ggml: add graph_reused * use versioning instead of reuse flag * increment version with atomic * use top bits for split numbering * add assert * move counter to ggml.c * set uid in split_graph only * fix windows * address further review comments * get next_uid rather than doing bit manipulation * rename + add comment about uid

am17an requested review from a team and ggerganov as code owners April 11, 2026 08:37

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 11, 2026

JohannesGaessler reviewed Apr 11, 2026

View reviewed changes

gaugarg-nv reviewed Apr 11, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated

jeffbolznv reviewed Apr 11, 2026

View reviewed changes

am17an force-pushed the ggml_graph_reuse branch from 72a96cc to 5dd2823 Compare April 12, 2026 08:21

JohannesGaessler reviewed Apr 12, 2026

View reviewed changes

Comment thread ggml/src/ggml-impl.h Outdated

Comment thread ggml/src/ggml.c Outdated

am17an added 6 commits April 12, 2026 23:48

ggml: add graph_reused

6c80ace

use versioning instead of reuse flag

b510b59

increment version with atomic

6d0a18b

use top bits for split numbering

4bb088b

add assert

fd54724

move counter to ggml.c

5d967fb

set uid in split_graph only

266c3e5

am17an force-pushed the ggml_graph_reuse branch from e795977 to 266c3e5 Compare April 13, 2026 02:51

fix windows

ce409bc

JohannesGaessler reviewed Apr 13, 2026

View reviewed changes

Comment thread ggml/src/ggml.c Outdated

JohannesGaessler reviewed Apr 13, 2026

View reviewed changes

Comment thread ggml/src/ggml.c

am17an force-pushed the ggml_graph_reuse branch from eebe3f0 to b674f92 Compare April 13, 2026 13:35

gaugarg-nv reviewed Apr 13, 2026

View reviewed changes

Comment thread ggml/src/ggml.c

Comment thread ggml/src/ggml.c Outdated

Comment thread ggml/src/ggml.c Outdated

am17an force-pushed the ggml_graph_reuse branch from b674f92 to 001e33a Compare April 13, 2026 17:11

address further review comments

1816cc7

am17an force-pushed the ggml_graph_reuse branch from 001e33a to 1816cc7 Compare April 13, 2026 17:16

ggerganov reviewed Apr 16, 2026

View reviewed changes

Comment thread ggml/include/ggml.h Outdated

Comment thread ggml/src/ggml-backend.cpp

am17an force-pushed the ggml_graph_reuse branch from 7449178 to cf9dee0 Compare April 16, 2026 05:53

get next_uid rather than doing bit manipulation

b3a370f

am17an force-pushed the ggml_graph_reuse branch from cf9dee0 to b3a370f Compare April 16, 2026 05:54

ggerganov approved these changes Apr 16, 2026

View reviewed changes

Comment thread ggml/src/ggml-impl.h

Comment thread ggml/src/ggml-cuda/common.cuh Outdated

rename + add comment about uid

d656780

JohannesGaessler approved these changes Apr 16, 2026

View reviewed changes

am17an merged commit 3f7c29d into ggml-org:master Apr 16, 2026
50 of 51 checks passed

am17an deleted the ggml_graph_reuse branch April 16, 2026 09:21

Conversation

am17an commented Apr 11, 2026

Overview

Additional information

Requirements

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

am17an commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

am17an Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

am17an Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

am17an Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JohannesGaessler commented Apr 12, 2026

Uh oh!

am17an commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

am17an commented Apr 12, 2026

Uh oh!

JohannesGaessler commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Apr 12, 2026

Uh oh!

JohannesGaessler commented Apr 12, 2026

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JohannesGaessler commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaugarg-nv commented Apr 16, 2026

Uh oh!

am17an commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Apr 16, 2026

Uh oh!

am17an commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

am17an commented Apr 11, 2026 •

edited

Loading

ggerganov Apr 14, 2026 •

edited

Loading

am17an commented Apr 12, 2026 •

edited

Loading

JohannesGaessler commented Apr 12, 2026 •

edited

Loading