ggml-cuda: enable concurrent streams for linear attention by am17an · Pull Request #21897 · ggml-org/llama.cpp

am17an · 2026-04-14T13:11:37Z

Overview

Enable concurrent streams for linear models (e.g. Qwen3.5). Linear attention layers also have parallelizable kernels, this PR enables them to run on different streams as done for traditional attention layers in #16991. Picture below from nsys profile

This PR also fixes a long standing bug in the stream concurrency code which did not allow it to be enabled by default and also not scale beyond batch_size = 1. The bug was that graph allocator would re-use the node->src tensors assuming sequential execution. #16991 had a intricate interleaving pattern to "fool" the allocator into extending lifetimes of the tensors but it did not do so for the src tensors, which somehow worked for bs=1 but is not guaranteed to.

This PR simply introduces a flag GGML_TENSOR_FLAG_NO_ALLOC_FREE to prevent the allocator from re-using the memory, and removes the complex reshuffling of nodes. Since we reserve space for worst case graph, this does not increase the size of the compute buffer.

Additional information

on a 5090 with GGML_CUDA_GRAPH_OPT=1

Model	Microbatch size	Test	t/s `e21cdc1`	t/s linear-attn-conc	Speedup
gemma4 ?B Q4_K_M	1	pp512	213.63	229.26	1.07
gemma4 ?B Q4_K_M	2	pp512	364.59	391.76	1.07
gemma4 ?B Q4_K_M	4	pp512	571.11	603.30	1.06
gemma4 ?B Q4_K_M	8	pp512	868.01	904.15	1.04
qwen35 27B Q4_K_M	1	pp512	69.45	72.89	1.05
qwen35 27B Q4_K_M	2	pp512	129.81	137.60	1.06
qwen35 27B Q4_K_M	4	pp512	210.49	222.65	1.06
qwen35 27B Q4_K_M	8	pp512	280.24	296.04	1.06
qwen35moe 35B.A3B Q4_K_S	1	pp512	241.87	231.01	0.96
qwen35moe 35B.A3B Q4_K_S	2	pp512	332.47	355.24	1.07
qwen35moe 35B.A3B Q4_K_S	4	pp512	532.37	561.42	1.05
qwen35moe 35B.A3B Q4_K_S	8	pp512	860.46	899.61	1.05

Note: the slowdown on qwen3.5 on bs=1 seems to be some quirk of llama-bench, it is the first model and the first run. Running a TG benchmark shows a speed-up there as well.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, debugging and testing

JohannesGaessler · 2026-04-15T08:59:52Z

+    // unconditionally recreate the flag if any node has NO_ALLOC_FREE set
+    bool has_no_alloc_free = false;
+    for (int i = 0; i < sched->graph.n_nodes && !has_no_alloc_free; i++) {
+        has_no_alloc_free |= (sched->graph.nodes[i]->flags & GGML_TENSOR_FLAG_NO_ALLOC_FREE) != 0;


To my understanding, the reason this check is needed here but not for the OUTPUT flag is because the graph optimization step can set the NO_ALLOC_FREE flag. I'm wondering whether there is any legitimate use case where a user would want to set the NO_ALLOC_FREE flag themself (right now I can't think of any). Because then this check would cause an unconditional re-allocation every time. But then we should clearly mention that the NO_ALLOC_FREE flag is for internal bookkeeping only in the corresponding comment.

I think graph_optimize could have an extension to the API allows the users to realloc based on some parameter (apart from the other cases). That is what this flag is doing essentially

ggerganov · 2026-04-15T12:43:16Z

+                for (size_t branch_idx = 0; branch_idx < nodes_per_branch.size(); branch_idx++) {
+                    for (const ggml_tensor * n : nodes_per_branch[branch_idx]) {
+                        concurrent_event.stream_mapping[n] = branch_idx + 1;
+                        // tag branch node and its sources so the allocator doesn't recycle
+                        // their memory while concurrent streams still read/write it
+                        const_cast<ggml_tensor *>(n)->flags |= GGML_TENSOR_FLAG_NO_ALLOC_FREE;
+                        for (int si = 0; si < GGML_MAX_SRC; ++si) {
+                            const ggml_tensor * s = n->src[si];
+                            if (!s) continue;
+                            const_cast<ggml_tensor *>(s)->flags |= GGML_TENSOR_FLAG_NO_ALLOC_FREE;
+                            if (s->view_src) {
+                                const_cast<ggml_tensor *>(s->view_src)->flags |= GGML_TENSOR_FLAG_NO_ALLOC_FREE;
+                            }
+                        }
+                    }


I don't think adding the GGML_TENSOR_FLAG_NO_ALLOC_FREE is a good idea. The allocator memory recycling should be solved in a different way.

Right now we are determining memory re-use under the assumption that nodes in a graph are executed in the exact order as their indices. In principle we could generalize this to also consider possible out-of-order execution.

#16991 (comment) was the discussion earlier around making allocator stream-aware. Say we take a node and add a non-zero stream-id to it and then the allocator doesn't recycle nodes which have non-zero stream-ids, then it would be the same as this PR. It would still need to be realloc'ed after graph optimize though. The advantage of doing this would be that we can recycle nodes after each join rather than keeping the memory pinned as we're doing now.

All of this requires changes to the backend scheduler though, if those are not wanted we have to think of something else and likely would need to retire this code path altogether as it's broken even in current master

I tried doing this the metal way of re-ordering nodes but it's even slower than master.

Right now we are determining memory re-use under the assumption that nodes in a graph are executed in the exact order as their indices. In principle we could generalize this to also consider possible out-of-order execution.

This is definitely useful for running smaller models on bigger NVGPUs where we want to parallelize vertically. Resolving this would also allow for the graph optimization to leave the experimental state for the cuda backend.

I don't think adding the GGML_TENSOR_FLAG_NO_ALLOC_FREE is a good idea. The allocator memory recycling should be solved in a different way.

An alternative idea could be for graph_optimize to be allowed to also return a list of graphs, which the backend sched has to dispatch in succession. As such, we could split out the parallellizeable segments into individual graphs, which should be issueable independently by the backend scheduler (and we could synchronize as needed by events). Though this may involve changes in the cuda backend to expose multi-stream parallel execution properly (did not spend too much thought on this yet). Also, if I track #20793 correctly we are still missing an officially-agreed on formalization of how a backend should behave in the case where multiple calls to async_compute are issued against it (is it allowed to parallelize them? or do we require it to process them sequentially)

am17an requested review from a team and ggerganov as code owners April 14, 2026 13:11

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 14, 2026

JohannesGaessler reviewed Apr 14, 2026

View reviewed changes

Comment thread ggml/src/ggml-backend.cpp

ggml-cuda: enable concurrent streams for linear attention

2c2c8e4

am17an force-pushed the linear-attn-conc branch from 006809f to 2c2c8e4 Compare April 15, 2026 06:09

JohannesGaessler reviewed Apr 15, 2026

View reviewed changes

fix indent + expand comment

0b729b6

JohannesGaessler approved these changes Apr 15, 2026

View reviewed changes

ggerganov reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cuda: enable concurrent streams for linear attention#21897

ggml-cuda: enable concurrent streams for linear attention#21897
am17an wants to merge 2 commits intoggml-org:masterfrom
am17an:linear-attn-conc

am17an commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Apr 15, 2026

Uh oh!

am17an Apr 15, 2026 •

edited

Loading

Uh oh!

ggerganov Apr 15, 2026

Uh oh!

JohannesGaessler Apr 15, 2026

Uh oh!

am17an Apr 15, 2026

Uh oh!

am17an Apr 29, 2026

Uh oh!

ORippler Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

am17an commented Apr 14, 2026

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

am17an Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

am17an Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

am17an Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

ORippler Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

am17an Apr 15, 2026 •

edited

Loading

ORippler Apr 30, 2026 •

edited

Loading