Skip to content

ggml: add graph_reused#21764

Merged
am17an merged 11 commits intoggml-org:masterfrom
am17an:ggml_graph_reuse
Apr 16, 2026
Merged

ggml: add graph_reused#21764
am17an merged 11 commits intoggml-org:masterfrom
am17an:ggml_graph_reuse

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented Apr 11, 2026

Overview

Add reused member variable to ggml_cgraph so backends can take advantage of the graph reuse functionality. Currently when graph_reuse in invoked, the CUDA backend still does the props change check to figure out if the graph has changed or not, where in fact graph_reuse (to my understanding) guarantees this to be true. This helps bypass a mildly expensive O(n) check.

Additional information

Testing: I tested various combinations like --n-cpu-moe, -nkvo and -ngl and verified it works. Additional testing would be welcome.

Results on a 5090:

Model Test t/s cuda_mul_fused t/s ggml_graph_reuse Speedup
gemma4 ?B Q4_0 tg128 219.43 231.50 1.06
gemma4 ?B Q4_0 tg128@d16384 183.23 191.91 1.05
gemma4 ?B Q4_0 tg128@d32768 175.15 182.58 1.04
gpt-oss 20B MXFP4 MoE tg128 313.29 323.01 1.03
gpt-oss 20B MXFP4 MoE tg128@d16384 279.59 288.05 1.03
gpt-oss 20B MXFP4 MoE tg128@d32768 258.80 266.12 1.03
qwen35 27B Q4_K_M tg128 65.04 66.61 1.02
qwen35 27B Q4_K_M tg128@d16384 62.45 63.61 1.02
qwen35 27B Q4_K_M tg128@d32768 59.65 60.70 1.02
qwen35moe 35B.A3B Q4_K_S tg128 194.52 206.56 1.06
qwen35moe 35B.A3B Q4_K_S tg128@d16384 184.11 196.36 1.07
qwen35moe 35B.A3B Q4_K_S tg128@d32768 175.95 187.60 1.07

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, for general understanding of the scheduler

@am17an am17an requested review from a team and ggerganov as code owners April 11, 2026 08:37
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 11, 2026
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work from a llama.cpp perspective but I'm not convinced it's the right way to handle this from a ggml perspective. In my opinion we should attach a ggml_guid and write the code around that. With a GUID the contract for "user code" would be "if you change the graph you must change the GUID", it's not clear to me what the correct way to use a reused flag would be.

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Apr 11, 2026

I added a version number, currently we don't have a guid generator in master. If the current version is acceptable, I can add a guid generator however I think a simple static version counter is also fine imo. The idea of versioning that anything that mutates the cgraph should have increment the version number.

Comment thread ggml/src/ggml-backend.cpp Outdated
}

for (int i = 0; i < sched->n_splits; i++) {
sched->splits[i].graph.version = graph->version;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all splits really be the same uuid?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any additional benefit of having a per split uuid as of now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right though, it makes sense to have a per split uuid

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splits should definitely have different identifiers than the main graph.

Copy link
Copy Markdown
Member

@ggerganov ggerganov Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splits should definitely have different identifiers than the main graph.

Do you have something in mind? I think if all splits have the same uuid as the main graph, the logic would be simpler. Just not sure if you see something that would prevent doing so.

To clarify, I think that the cuda graph key (i.e. nodes[0] ptr) + the common uuid would uniquely identify the splits.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we allow arbitrary --tensor-overrides we can have multiple splits per backend, no? If a user assigns some tensor in the middle to the CPU you will end up with 2 splits running on the GPU beforehand and afterwards. And if the graphs for those splits have the same UID then they can be incorrectly re-used.

We could also simplify the logic by not requiring that the UIDs of the split graphs share bits with the original graph. I think that is only really useful for debugging, otherwise we can just call the function to get a UID again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CUDA it will not be an issue because we store the node ptrs as the key, but for other backend it might cause an issue that you arrive at graph_compute_async with the same uid as before (e.g. in case of the CPU split), even though it's a different split. I like the split being part of the id as it ties it to a particular graph, as it is not a separate from the main compute graph.

Comment thread ggml/src/ggml-impl.h Outdated
@am17an am17an force-pushed the ggml_graph_reuse branch from 72a96cc to 5dd2823 Compare April 12, 2026 08:21
@JohannesGaessler
Copy link
Copy Markdown
Contributor

I suppose an index that is incremented per graph would also work, even at 1000 new graphs / second it would take like 500 million years until a 64 bit integer is exhausted.

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Apr 12, 2026

1000 new graphs / second it would take like 500 million years until a 64 bit integer is exhausted.

I've shortened this time frame by using the top 12 bits for the split index. It would only last ~500 years now. The counter needs to move to a .c file though, otherwise each TU will get it's own copy

Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The function ggml_backend_graph_optimize should regenerate the graph identifier.
  • In ggml-backend.cpp there are graph manipulations where the sources of nodes are replaced, this should also result in a change in the graph identifier.
  • I think that ggml-opt.cpp should be fine with these changes.
  • ggml-backend-meta.cpp the manually created and re-used graphs cgraphs_aux will need to have their identifiers regenerated.

Please let me know if any of these are unclear or if you want me to take over some of these.

Comment thread ggml/src/ggml-impl.h Outdated
Comment thread ggml/src/ggml.c Outdated
@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Apr 12, 2026

In ggml-backend.cpp there are graph manipulations where the sources of nodes are replaced, this should also result in a change in the graph identifier.

Could you point me where? From what I understand bumping the counter only when sched_reset is called should be good. Since everything is reset from this point and the graph's id should remain the same till the next sched_reset

@JohannesGaessler
Copy link
Copy Markdown
Contributor

JohannesGaessler commented Apr 12, 2026

Consider this line. On that line node is just a pointer to one of the nodes of the original graph. So before and after calling ggml_backend_split_graph the passed graph is different and the corresponding identifier should also be different to reflect this.

Also looking at the code again, I think the correct place to set the split graph identifiers would be at the end of ggml_backend_split_graph rather than in ggml_backend_sched_graph_compute_async.

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Apr 12, 2026

I see, since ggml_backend_sched_split_graph is a part of the public API it can be used like you said. So we should also increment the id there (or maybe only there? because I don't see how you can create a graph without calling that function)

@JohannesGaessler
Copy link
Copy Markdown
Contributor

So we should also increment the id there (or maybe only there?

That would I think be the correct way to do it.

@am17an am17an force-pushed the ggml_graph_reuse branch from e795977 to 266c3e5 Compare April 13, 2026 02:51
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my end these changes would be good otherwise.

Comment thread ggml/src/ggml.c Outdated
@JohannesGaessler
Copy link
Copy Markdown
Contributor

ggml-backend-meta.cpp the manually created and re-used graphs cgraphs_aux will need to have their identifiers regenerated.

Sorry, I had misremembered. At some WIP version I had used ggml_cgraph references but at some later point I replaced these with ggml graphs that are properly created from a ggml_context * so no changes were necessary after all.

Comment thread ggml/src/ggml.c
@am17an am17an force-pushed the ggml_graph_reuse branch from eebe3f0 to b674f92 Compare April 13, 2026 13:35
Comment thread ggml/src/ggml.c
Comment thread ggml/src/ggml.c Outdated
Comment thread ggml/src/ggml.c Outdated
@am17an am17an force-pushed the ggml_graph_reuse branch from b674f92 to 001e33a Compare April 13, 2026 17:11
@am17an am17an force-pushed the ggml_graph_reuse branch from 001e33a to 1816cc7 Compare April 13, 2026 17:16
@gaugarg-nv
Copy link
Copy Markdown
Contributor

@am17an @JohannesGaessler It will be great to have this PR merged, as it improves perf quite a bit and is also a prerequisite to reducing CPU overhead in the TP implementation.

My understanding is that this will break CUDA graphs in the Meta backend as we recreate subgraphs in every compute call. This can be fixed if sub-graphs are cached into the meta backend, similar to JohannesGaessler@543e30d. I am happy to help fix it either within this PR or a separate PR.

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Apr 16, 2026

This is still pending @ggerganov's review, after that we can merge. The TP stuff can be in a follow-up PR

Comment thread ggml/include/ggml.h Outdated
Comment thread ggml/src/ggml-backend.cpp
@am17an am17an force-pushed the ggml_graph_reuse branch from 7449178 to cf9dee0 Compare April 16, 2026 05:53
@am17an am17an force-pushed the ggml_graph_reuse branch from cf9dee0 to b3a370f Compare April 16, 2026 05:54
Comment thread ggml/src/ggml-impl.h
Comment thread ggml/src/ggml-cuda/common.cuh Outdated
@JohannesGaessler
Copy link
Copy Markdown
Contributor

My understanding is that this will break CUDA graphs in the Meta backend as we recreate subgraphs in every compute call.

This PR will not provide any benefit but it should not break anything because the fallback for a UID mismatch is the behavior on master.

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Apr 16, 2026

@JohannesGaessler can you also approve?

@am17an am17an merged commit 3f7c29d into ggml-org:master Apr 16, 2026
50 of 51 checks passed
@am17an am17an deleted the ggml_graph_reuse branch April 16, 2026 09:21
cnsiva pushed a commit to saas-home/llama.cpp that referenced this pull request Apr 17, 2026
* ggml: add graph_reused

* use versioning instead of reuse flag

* increment version with atomic

* use top bits for split numbering

* add assert

* move counter to ggml.c

* set uid in split_graph only

* fix windows

* address further review comments

* get next_uid rather than doing bit manipulation

* rename + add comment about uid
mengqin pushed a commit to mengqin/llama.cpp that referenced this pull request Apr 20, 2026
* ggml: add graph_reused

* use versioning instead of reuse flag

* increment version with atomic

* use top bits for split numbering

* add assert

* move counter to ggml.c

* set uid in split_graph only

* fix windows

* address further review comments

* get next_uid rather than doing bit manipulation

* rename + add comment about uid
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Apr 21, 2026
* ggml: add graph_reused

* use versioning instead of reuse flag

* increment version with atomic

* use top bits for split numbering

* add assert

* move counter to ggml.c

* set uid in split_graph only

* fix windows

* address further review comments

* get next_uid rather than doing bit manipulation

* rename + add comment about uid
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026
* ggml: add graph_reused

* use versioning instead of reuse flag

* increment version with atomic

* use top bits for split numbering

* add assert

* move counter to ggml.c

* set uid in split_graph only

* fix windows

* address further review comments

* get next_uid rather than doing bit manipulation

* rename + add comment about uid
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* ggml: add graph_reused

* use versioning instead of reuse flag

* increment version with atomic

* use top bits for split numbering

* add assert

* move counter to ggml.c

* set uid in split_graph only

* fix windows

* address further review comments

* get next_uid rather than doing bit manipulation

* rename + add comment about uid
jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
* ggml: add graph_reused

* use versioning instead of reuse flag

* increment version with atomic

* use top bits for split numbering

* add assert

* move counter to ggml.c

* set uid in split_graph only

* fix windows

* address further review comments

* get next_uid rather than doing bit manipulation

* rename + add comment about uid
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants