ggml: add graph_reused#21764
Conversation
JohannesGaessler
left a comment
There was a problem hiding this comment.
This would work from a llama.cpp perspective but I'm not convinced it's the right way to handle this from a ggml perspective. In my opinion we should attach a ggml_guid and write the code around that. With a GUID the contract for "user code" would be "if you change the graph you must change the GUID", it's not clear to me what the correct way to use a reused flag would be.
|
I added a version number, currently we don't have a guid generator in master. If the current version is acceptable, I can add a guid generator however I think a simple static version counter is also fine imo. The idea of versioning that anything that mutates the cgraph should have increment the version number. |
| } | ||
|
|
||
| for (int i = 0; i < sched->n_splits; i++) { | ||
| sched->splits[i].graph.version = graph->version; |
There was a problem hiding this comment.
Should all splits really be the same uuid?
There was a problem hiding this comment.
I don't see any additional benefit of having a per split uuid as of now
There was a problem hiding this comment.
I think you're right though, it makes sense to have a per split uuid
There was a problem hiding this comment.
Splits should definitely have different identifiers than the main graph.
There was a problem hiding this comment.
Splits should definitely have different identifiers than the main graph.
Do you have something in mind? I think if all splits have the same uuid as the main graph, the logic would be simpler. Just not sure if you see something that would prevent doing so.
To clarify, I think that the cuda graph key (i.e. nodes[0] ptr) + the common uuid would uniquely identify the splits.
There was a problem hiding this comment.
Since we allow arbitrary --tensor-overrides we can have multiple splits per backend, no? If a user assigns some tensor in the middle to the CPU you will end up with 2 splits running on the GPU beforehand and afterwards. And if the graphs for those splits have the same UID then they can be incorrectly re-used.
We could also simplify the logic by not requiring that the UIDs of the split graphs share bits with the original graph. I think that is only really useful for debugging, otherwise we can just call the function to get a UID again.
There was a problem hiding this comment.
In CUDA it will not be an issue because we store the node ptrs as the key, but for other backend it might cause an issue that you arrive at graph_compute_async with the same uid as before (e.g. in case of the CPU split), even though it's a different split. I like the split being part of the id as it ties it to a particular graph, as it is not a separate from the main compute graph.
72a96cc to
5dd2823
Compare
|
I suppose an index that is incremented per graph would also work, even at 1000 new graphs / second it would take like 500 million years until a 64 bit integer is exhausted. |
I've shortened this time frame by using the top 12 bits for the split index. It would only last ~500 years now. The counter needs to move to a |
JohannesGaessler
left a comment
There was a problem hiding this comment.
- The function
ggml_backend_graph_optimizeshould regenerate the graph identifier. - In
ggml-backend.cppthere are graph manipulations where the sources of nodes are replaced, this should also result in a change in the graph identifier. - I think that
ggml-opt.cppshould be fine with these changes. ggml-backend-meta.cppthe manually created and re-used graphscgraphs_auxwill need to have their identifiers regenerated.
Please let me know if any of these are unclear or if you want me to take over some of these.
Could you point me where? From what I understand bumping the counter only when |
|
Consider this line. On that line Also looking at the code again, I think the correct place to set the split graph identifiers would be at the end of |
|
I see, since |
That would I think be the correct way to do it. |
e795977 to
266c3e5
Compare
JohannesGaessler
left a comment
There was a problem hiding this comment.
From my end these changes would be good otherwise.
Sorry, I had misremembered. At some WIP version I had used |
eebe3f0 to
b674f92
Compare
b674f92 to
001e33a
Compare
001e33a to
1816cc7
Compare
|
@am17an @JohannesGaessler It will be great to have this PR merged, as it improves perf quite a bit and is also a prerequisite to reducing CPU overhead in the TP implementation. My understanding is that this will break CUDA graphs in the Meta backend as we recreate subgraphs in every compute call. This can be fixed if sub-graphs are cached into the meta backend, similar to JohannesGaessler@543e30d. I am happy to help fix it either within this PR or a separate PR. |
|
This is still pending @ggerganov's review, after that we can merge. The TP stuff can be in a follow-up PR |
7449178 to
cf9dee0
Compare
cf9dee0 to
b3a370f
Compare
This PR will not provide any benefit but it should not break anything because the fallback for a UID mismatch is the behavior on master. |
|
@JohannesGaessler can you also approve? |
* ggml: add graph_reused * use versioning instead of reuse flag * increment version with atomic * use top bits for split numbering * add assert * move counter to ggml.c * set uid in split_graph only * fix windows * address further review comments * get next_uid rather than doing bit manipulation * rename + add comment about uid
* ggml: add graph_reused * use versioning instead of reuse flag * increment version with atomic * use top bits for split numbering * add assert * move counter to ggml.c * set uid in split_graph only * fix windows * address further review comments * get next_uid rather than doing bit manipulation * rename + add comment about uid
* ggml: add graph_reused * use versioning instead of reuse flag * increment version with atomic * use top bits for split numbering * add assert * move counter to ggml.c * set uid in split_graph only * fix windows * address further review comments * get next_uid rather than doing bit manipulation * rename + add comment about uid
* ggml: add graph_reused * use versioning instead of reuse flag * increment version with atomic * use top bits for split numbering * add assert * move counter to ggml.c * set uid in split_graph only * fix windows * address further review comments * get next_uid rather than doing bit manipulation * rename + add comment about uid
* ggml: add graph_reused * use versioning instead of reuse flag * increment version with atomic * use top bits for split numbering * add assert * move counter to ggml.c * set uid in split_graph only * fix windows * address further review comments * get next_uid rather than doing bit manipulation * rename + add comment about uid
* ggml: add graph_reused * use versioning instead of reuse flag * increment version with atomic * use top bits for split numbering * add assert * move counter to ggml.c * set uid in split_graph only * fix windows * address further review comments * get next_uid rather than doing bit manipulation * rename + add comment about uid
Overview
Add
reusedmember variable toggml_cgraphso backends can take advantage of the graph reuse functionality. Currently when graph_reuse in invoked, the CUDA backend still does the props change check to figure out if the graph has changed or not, where in factgraph_reuse(to my understanding) guarantees this to be true. This helps bypass a mildly expensive O(n) check.Additional information
Testing: I tested various combinations like
--n-cpu-moe,-nkvoand-ngland verified it works. Additional testing would be welcome.Results on a 5090:
Requirements