CUDA: use LRU based eviction for cuda graphs by am17an · Pull Request #21611 · ggml-org/llama.cpp

am17an · 2026-04-08T08:54:53Z

Overview

Since introducing graphs per node to enable multiple splits to have cuda graphs in #18934, there are cases when the node pointers in ggml_cgraph keep changing and it leads to the map being unbounded leading to memory leaks (e.g #20315)

This PR fixes the memory leaks

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

ggerganov

I think we probably need some time-based LRU eviction logic. But let's merge this now as a stopgap.

gaugarg-nv · 2026-04-16T12:24:52Z

 #define MATRIX_ROW_PADDING 512 // last row of quant. matrices is a multiple of this to avoid out-of-bounds memory accesses

 #define GGML_CUDA_MAX_STREAMS 8
+#define GGML_CUDA_MAX_GRAPHS  128


I think 128 might be too low.

The TP implementation breaks each decoder layer into 2-sub graphs. So, for larger models, it will easily hit this limit.

How many layers do large models have? Is 256 a better limit?

I have seen up to 80 layers. So, yes, 256 will be a better limit.

GLM4.7 notably has 92 layers https://huggingface.co/zai-org/GLM-4.7/blob/main/config.json

Without NCCL the number of ggml graphs is currently even higher. However, these graphs only have a single node so using CUDA graphs may not be worthwhile in the first place.

Okay implemented this. Since std::priority_queue doesn't support random access, we need to a do a little bit of book-keeping on the side and keep some stale entries in the queue, but I think it should be okay.

I was thinking about a much simpler implementation:

Add a timestamp to ggml_cuda_graph

LRU purging is loop over cuda_graphs and remove outdated entires

Update timestamp each time a graph is referenced

I highly doubt the priority queue here has any advantage in terms of performance since we are dealing with very small number of entries, while it makes the logic quite complicated.

-sm tensor on peak was at 800 cuda graphs with 4x 4090 running gpt-oss-120b, so I don't think we can loop over all entries? We have to keep them sorted somehow

We don't need to purge on each iteration. For example, we can purge only if X seconds have passed since the last purge, so even with 800 graphs or more in the container, it should not be a problem to loop over all of them from time to time.

Ok I simplified this, and did not see any performance degradation

ORippler · 2026-04-16T13:09:45Z

I think we probably need some time-based LRU eviction logic. But let's merge this now as a stopgap.

I feel ring-buffers on their own are not going to reflect well more complex orchestration loops (e.g. libmtmd), where a lot of "never initialized" ggml_cuda_graph objects will be generated (cuda graphs are only used for decode).

There is a parallel LRU-based proposal at #21673 that was proposed to reflect this (but is of course more complex). Instead of being time-based, it assigns a VRAM budget.

am17an · 2026-04-16T17:29:18Z

@JohannesGaessler can you share the command for which you saw memory increasing? I can test with the current PR

JohannesGaessler · 2026-04-16T18:18:34Z

@am17an just start the llama.cpp server with any model and incrementally fill up the context with generations.

am17an · 2026-04-17T04:50:58Z

I tested with 4x 4090 GPUs, on master memory steadily increases and with this PR it is constant after a point

am17an · 2026-04-17T10:51:57Z

@ggml-org/ggml-cuda can I get another approval?

gaugarg-nv · 2026-04-17T11:30:17Z

Looks good to me.

* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up

am17an requested a review from ggerganov April 8, 2026 09:01

am17an mentioned this pull request Apr 9, 2026

Bug-Fix sets an upper VRAM limit for cached ggml_cuda graphs to prevent VRAM memory leaks #21673

Draft

am17an added 2 commits April 16, 2026 17:56

CUDA: use a ring-buffer for cuda graphs

beb4e98

bump limit to 128

193b81e

am17an marked this pull request as ready for review April 16, 2026 11:53

am17an requested a review from a team as a code owner April 16, 2026 11:53

am17an force-pushed the cuda_ring_buffer branch from 14d4c00 to 193b81e Compare April 16, 2026 11:55

ggerganov approved these changes Apr 16, 2026

View reviewed changes

gaugarg-nv reviewed Apr 16, 2026

View reviewed changes

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 16, 2026

use LRU eviction

bd942c4

am17an changed the title ~~CUDA: use a ring-buffer for cuda graphs~~ CUDA: use LRU based eviction for cuda graphs Apr 16, 2026

better naming

958fd9f

am17an force-pushed the cuda_ring_buffer branch from bab7d29 to 958fd9f Compare April 17, 2026 04:49

JohannesGaessler reviewed Apr 17, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/common.cuh Outdated

Comment thread ggml/src/ggml-cuda/common.cuh Outdated

do periodic clean-up

579972b

am17an force-pushed the cuda_ring_buffer branch from b156682 to 579972b Compare April 17, 2026 09:22

ggerganov approved these changes Apr 17, 2026

View reviewed changes

JohannesGaessler approved these changes Apr 17, 2026

View reviewed changes

am17an merged commit b94050e into ggml-org:master Apr 17, 2026
46 of 48 checks passed

am17an deleted the cuda_ring_buffer branch April 17, 2026 15:24

cnsiva pushed a commit to saas-home/llama.cpp that referenced this pull request Apr 17, 2026

CUDA: use LRU based eviction for cuda graphs (ggml-org#21611)

8a78c19

* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up

samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request Apr 19, 2026

CUDA: use LRU based eviction for cuda graphs (ggml-org#21611)

df7cf71

* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up

mengqin pushed a commit to mengqin/llama.cpp that referenced this pull request Apr 20, 2026

CUDA: use LRU based eviction for cuda graphs (ggml-org#21611)

108644b

* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026

CUDA: use LRU based eviction for cuda graphs (ggml-org#21611)

2192a26

* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up

Conversation

am17an commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ORippler commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Apr 16, 2026

Uh oh!

JohannesGaessler commented Apr 16, 2026

Uh oh!

am17an commented Apr 17, 2026

Uh oh!

Uh oh!

Uh oh!

am17an commented Apr 17, 2026

Uh oh!

gaugarg-nv commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

am17an commented Apr 8, 2026 •

edited

Loading

ORippler commented Apr 16, 2026 •

edited

Loading