ggml-cuda: enable concurrent streams for linear attention#21897
ggml-cuda: enable concurrent streams for linear attention#21897am17an wants to merge 2 commits intoggml-org:masterfrom
Conversation
006809f to
2c2c8e4
Compare
| // unconditionally recreate the flag if any node has NO_ALLOC_FREE set | ||
| bool has_no_alloc_free = false; | ||
| for (int i = 0; i < sched->graph.n_nodes && !has_no_alloc_free; i++) { | ||
| has_no_alloc_free |= (sched->graph.nodes[i]->flags & GGML_TENSOR_FLAG_NO_ALLOC_FREE) != 0; |
There was a problem hiding this comment.
To my understanding, the reason this check is needed here but not for the OUTPUT flag is because the graph optimization step can set the NO_ALLOC_FREE flag. I'm wondering whether there is any legitimate use case where a user would want to set the NO_ALLOC_FREE flag themself (right now I can't think of any). Because then this check would cause an unconditional re-allocation every time. But then we should clearly mention that the NO_ALLOC_FREE flag is for internal bookkeeping only in the corresponding comment.
There was a problem hiding this comment.
I think graph_optimize could have an extension to the API allows the users to realloc based on some parameter (apart from the other cases). That is what this flag is doing essentially
| for (size_t branch_idx = 0; branch_idx < nodes_per_branch.size(); branch_idx++) { | ||
| for (const ggml_tensor * n : nodes_per_branch[branch_idx]) { | ||
| concurrent_event.stream_mapping[n] = branch_idx + 1; | ||
| // tag branch node and its sources so the allocator doesn't recycle | ||
| // their memory while concurrent streams still read/write it | ||
| const_cast<ggml_tensor *>(n)->flags |= GGML_TENSOR_FLAG_NO_ALLOC_FREE; | ||
| for (int si = 0; si < GGML_MAX_SRC; ++si) { | ||
| const ggml_tensor * s = n->src[si]; | ||
| if (!s) continue; | ||
| const_cast<ggml_tensor *>(s)->flags |= GGML_TENSOR_FLAG_NO_ALLOC_FREE; | ||
| if (s->view_src) { | ||
| const_cast<ggml_tensor *>(s->view_src)->flags |= GGML_TENSOR_FLAG_NO_ALLOC_FREE; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
I don't think adding the GGML_TENSOR_FLAG_NO_ALLOC_FREE is a good idea. The allocator memory recycling should be solved in a different way.
There was a problem hiding this comment.
Right now we are determining memory re-use under the assumption that nodes in a graph are executed in the exact order as their indices. In principle we could generalize this to also consider possible out-of-order execution.
There was a problem hiding this comment.
#16991 (comment) was the discussion earlier around making allocator stream-aware. Say we take a node and add a non-zero stream-id to it and then the allocator doesn't recycle nodes which have non-zero stream-ids, then it would be the same as this PR. It would still need to be realloc'ed after graph optimize though. The advantage of doing this would be that we can recycle nodes after each join rather than keeping the memory pinned as we're doing now.
All of this requires changes to the backend scheduler though, if those are not wanted we have to think of something else and likely would need to retire this code path altogether as it's broken even in current master
There was a problem hiding this comment.
I tried doing this the metal way of re-ordering nodes but it's even slower than master.
There was a problem hiding this comment.
Right now we are determining memory re-use under the assumption that nodes in a graph are executed in the exact order as their indices. In principle we could generalize this to also consider possible out-of-order execution.
This is definitely useful for running smaller models on bigger NVGPUs where we want to parallelize vertically. Resolving this would also allow for the graph optimization to leave the experimental state for the cuda backend.
I don't think adding the GGML_TENSOR_FLAG_NO_ALLOC_FREE is a good idea. The allocator memory recycling should be solved in a different way.
An alternative idea could be for graph_optimize to be allowed to also return a list of graphs, which the backend sched has to dispatch in succession. As such, we could split out the parallellizeable segments into individual graphs, which should be issueable independently by the backend scheduler (and we could synchronize as needed by events). Though this may involve changes in the cuda backend to expose multi-stream parallel execution properly (did not spend too much thought on this yet). Also, if I track #20793 correctly we are still missing an officially-agreed on formalization of how a backend should behave in the case where multiple calls to async_compute are issued against it (is it allowed to parallelize them? or do we require it to process them sequentially)
Overview
Enable concurrent streams for linear models (e.g. Qwen3.5). Linear attention layers also have parallelizable kernels, this PR enables them to run on different streams as done for traditional attention layers in #16991. Picture below from
nsys profileThis PR also fixes a long standing bug in the stream concurrency code which did not allow it to be enabled by default and also not scale beyond batch_size = 1. The bug was that graph allocator would re-use the
node->srctensors assuming sequential execution. #16991 had a intricate interleaving pattern to "fool" the allocator into extending lifetimes of the tensors but it did not do so for thesrctensors, which somehow worked for bs=1 but is not guaranteed to.This PR simply introduces a flag
GGML_TENSOR_FLAG_NO_ALLOC_FREEto prevent the allocator from re-using the memory, and removes the complex reshuffling of nodes. Since we reserve space for worst case graph, this does not increase the size of the compute buffer.Additional information
on a 5090 with
GGML_CUDA_GRAPH_OPT=1Note: the slowdown on qwen3.5 on bs=1 seems to be some quirk of llama-bench, it is the first model and the first run. Running a TG benchmark shows a speed-up there as well.
Requirements