Skip to content

ggml-cuda: enable concurrent streams for linear attention#21897

Open
am17an wants to merge 2 commits intoggml-org:masterfrom
am17an:linear-attn-conc
Open

ggml-cuda: enable concurrent streams for linear attention#21897
am17an wants to merge 2 commits intoggml-org:masterfrom
am17an:linear-attn-conc

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented Apr 14, 2026

Overview

Enable concurrent streams for linear models (e.g. Qwen3.5). Linear attention layers also have parallelizable kernels, this PR enables them to run on different streams as done for traditional attention layers in #16991. Picture below from nsys profile

sync2

This PR also fixes a long standing bug in the stream concurrency code which did not allow it to be enabled by default and also not scale beyond batch_size = 1. The bug was that graph allocator would re-use the node->src tensors assuming sequential execution. #16991 had a intricate interleaving pattern to "fool" the allocator into extending lifetimes of the tensors but it did not do so for the src tensors, which somehow worked for bs=1 but is not guaranteed to.

This PR simply introduces a flag GGML_TENSOR_FLAG_NO_ALLOC_FREE to prevent the allocator from re-using the memory, and removes the complex reshuffling of nodes. Since we reserve space for worst case graph, this does not increase the size of the compute buffer.

Additional information

on a 5090 with GGML_CUDA_GRAPH_OPT=1

Model Microbatch size Test t/s e21cdc1 t/s linear-attn-conc Speedup
gemma4 ?B Q4_K_M 1 pp512 213.63 229.26 1.07
gemma4 ?B Q4_K_M 2 pp512 364.59 391.76 1.07
gemma4 ?B Q4_K_M 4 pp512 571.11 603.30 1.06
gemma4 ?B Q4_K_M 8 pp512 868.01 904.15 1.04
qwen35 27B Q4_K_M 1 pp512 69.45 72.89 1.05
qwen35 27B Q4_K_M 2 pp512 129.81 137.60 1.06
qwen35 27B Q4_K_M 4 pp512 210.49 222.65 1.06
qwen35 27B Q4_K_M 8 pp512 280.24 296.04 1.06
qwen35moe 35B.A3B Q4_K_S 1 pp512 241.87 231.01 0.96
qwen35moe 35B.A3B Q4_K_S 2 pp512 332.47 355.24 1.07
qwen35moe 35B.A3B Q4_K_S 4 pp512 532.37 561.42 1.05
qwen35moe 35B.A3B Q4_K_S 8 pp512 860.46 899.61 1.05

Note: the slowdown on qwen3.5 on bs=1 seems to be some quirk of llama-bench, it is the first model and the first run. Running a TG benchmark shows a speed-up there as well.

Requirements

@am17an am17an requested review from a team and ggerganov as code owners April 14, 2026 13:11
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 14, 2026
Comment thread ggml/src/ggml-backend.cpp
@am17an am17an force-pushed the linear-attn-conc branch from 006809f to 2c2c8e4 Compare April 15, 2026 06:09
Comment thread ggml/include/ggml.h Outdated
Comment thread ggml/src/ggml-backend.cpp
// unconditionally recreate the flag if any node has NO_ALLOC_FREE set
bool has_no_alloc_free = false;
for (int i = 0; i < sched->graph.n_nodes && !has_no_alloc_free; i++) {
has_no_alloc_free |= (sched->graph.nodes[i]->flags & GGML_TENSOR_FLAG_NO_ALLOC_FREE) != 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, the reason this check is needed here but not for the OUTPUT flag is because the graph optimization step can set the NO_ALLOC_FREE flag. I'm wondering whether there is any legitimate use case where a user would want to set the NO_ALLOC_FREE flag themself (right now I can't think of any). Because then this check would cause an unconditional re-allocation every time. But then we should clearly mention that the NO_ALLOC_FREE flag is for internal bookkeeping only in the corresponding comment.

Copy link
Copy Markdown
Contributor Author

@am17an am17an Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think graph_optimize could have an extension to the API allows the users to realloc based on some parameter (apart from the other cases). That is what this flag is doing essentially

Comment on lines +4329 to +4343
for (size_t branch_idx = 0; branch_idx < nodes_per_branch.size(); branch_idx++) {
for (const ggml_tensor * n : nodes_per_branch[branch_idx]) {
concurrent_event.stream_mapping[n] = branch_idx + 1;
// tag branch node and its sources so the allocator doesn't recycle
// their memory while concurrent streams still read/write it
const_cast<ggml_tensor *>(n)->flags |= GGML_TENSOR_FLAG_NO_ALLOC_FREE;
for (int si = 0; si < GGML_MAX_SRC; ++si) {
const ggml_tensor * s = n->src[si];
if (!s) continue;
const_cast<ggml_tensor *>(s)->flags |= GGML_TENSOR_FLAG_NO_ALLOC_FREE;
if (s->view_src) {
const_cast<ggml_tensor *>(s->view_src)->flags |= GGML_TENSOR_FLAG_NO_ALLOC_FREE;
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think adding the GGML_TENSOR_FLAG_NO_ALLOC_FREE is a good idea. The allocator memory recycling should be solved in a different way.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we are determining memory re-use under the assumption that nodes in a graph are executed in the exact order as their indices. In principle we could generalize this to also consider possible out-of-order execution.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#16991 (comment) was the discussion earlier around making allocator stream-aware. Say we take a node and add a non-zero stream-id to it and then the allocator doesn't recycle nodes which have non-zero stream-ids, then it would be the same as this PR. It would still need to be realloc'ed after graph optimize though. The advantage of doing this would be that we can recycle nodes after each join rather than keeping the memory pinned as we're doing now.

All of this requires changes to the backend scheduler though, if those are not wanted we have to think of something else and likely would need to retire this code path altogether as it's broken even in current master

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried doing this the metal way of re-ordering nodes but it's even slower than master.

Copy link
Copy Markdown
Collaborator

@ORippler ORippler Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we are determining memory re-use under the assumption that nodes in a graph are executed in the exact order as their indices. In principle we could generalize this to also consider possible out-of-order execution.

This is definitely useful for running smaller models on bigger NVGPUs where we want to parallelize vertically. Resolving this would also allow for the graph optimization to leave the experimental state for the cuda backend.

I don't think adding the GGML_TENSOR_FLAG_NO_ALLOC_FREE is a good idea. The allocator memory recycling should be solved in a different way.

An alternative idea could be for graph_optimize to be allowed to also return a list of graphs, which the backend sched has to dispatch in succession. As such, we could split out the parallellizeable segments into individual graphs, which should be issueable independently by the backend scheduler (and we could synchronize as needed by events). Though this may involve changes in the cuda backend to expose multi-stream parallel execution properly (did not spend too much thought on this yet). Also, if I track #20793 correctly we are still missing an officially-agreed on formalization of how a backend should behave in the case where multiple calls to async_compute are issued against it (is it allowed to parallelize them? or do we require it to process them sequentially)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants