Skip to content

CUDA: use LRU based eviction for cuda graphs#21611

Merged
am17an merged 5 commits intoggml-org:masterfrom
am17an:cuda_ring_buffer
Apr 17, 2026
Merged

CUDA: use LRU based eviction for cuda graphs#21611
am17an merged 5 commits intoggml-org:masterfrom
am17an:cuda_ring_buffer

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented Apr 8, 2026

Overview

Since introducing graphs per node to enable multiple splits to have cuda graphs in #18934, there are cases when the node pointers in ggml_cgraph keep changing and it leads to the map being unbounded leading to memory leaks (e.g #20315)

This PR fixes the memory leaks

Additional information

Requirements

@am17an am17an marked this pull request as ready for review April 16, 2026 11:53
@am17an am17an requested a review from a team as a code owner April 16, 2026 11:53
@am17an am17an force-pushed the cuda_ring_buffer branch from 14d4c00 to 193b81e Compare April 16, 2026 11:55
Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably need some time-based LRU eviction logic. But let's merge this now as a stopgap.

Comment thread ggml/src/ggml-cuda/common.cuh Outdated
#define MATRIX_ROW_PADDING 512 // last row of quant. matrices is a multiple of this to avoid out-of-bounds memory accesses

#define GGML_CUDA_MAX_STREAMS 8
#define GGML_CUDA_MAX_GRAPHS 128
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 128 might be too low.

The TP implementation breaks each decoder layer into 2-sub graphs. So, for larger models, it will easily hit this limit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many layers do large models have? Is 256 a better limit?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen up to 80 layers. So, yes, 256 will be a better limit.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without NCCL the number of ggml graphs is currently even higher. However, these graphs only have a single node so using CUDA graphs may not be worthwhile in the first place.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay implemented this. Since std::priority_queue doesn't support random access, we need to a do a little bit of book-keeping on the side and keep some stale entries in the queue, but I think it should be okay.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about a much simpler implementation:

  • Add a timestamp to ggml_cuda_graph
  • LRU purging is loop over cuda_graphs and remove outdated entires
  • Update timestamp each time a graph is referenced

I highly doubt the priority queue here has any advantage in terms of performance since we are dealing with very small number of entries, while it makes the logic quite complicated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-sm tensor on peak was at 800 cuda graphs with 4x 4090 running gpt-oss-120b, so I don't think we can loop over all entries? We have to keep them sorted somehow

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to purge on each iteration. For example, we can purge only if X seconds have passed since the last purge, so even with 800 graphs or more in the container, it should not be a problem to loop over all of them from time to time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I simplified this, and did not see any performance degradation

@ORippler
Copy link
Copy Markdown
Collaborator

ORippler commented Apr 16, 2026

I think we probably need some time-based LRU eviction logic. But let's merge this now as a stopgap.

I feel ring-buffers on their own are not going to reflect well more complex orchestration loops (e.g. libmtmd), where a lot of "never initialized" ggml_cuda_graph objects will be generated (cuda graphs are only used for decode).

There is a parallel LRU-based proposal at #21673 that was proposed to reflect this (but is of course more complex). Instead of being time-based, it assigns a VRAM budget.

@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 16, 2026
@am17an am17an changed the title CUDA: use a ring-buffer for cuda graphs CUDA: use LRU based eviction for cuda graphs Apr 16, 2026
@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Apr 16, 2026

@JohannesGaessler can you share the command for which you saw memory increasing? I can test with the current PR

@JohannesGaessler
Copy link
Copy Markdown
Contributor

@am17an just start the llama.cpp server with any model and incrementally fill up the context with generations.

@am17an am17an force-pushed the cuda_ring_buffer branch from bab7d29 to 958fd9f Compare April 17, 2026 04:49
@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Apr 17, 2026

I tested with 4x 4090 GPUs, on master memory steadily increases and with this PR it is constant after a point

Comment thread ggml/src/ggml-cuda/common.cuh Outdated
Comment thread ggml/src/ggml-cuda/common.cuh Outdated
@am17an am17an force-pushed the cuda_ring_buffer branch from b156682 to 579972b Compare April 17, 2026 09:22
@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Apr 17, 2026

@ggml-org/ggml-cuda can I get another approval?

@gaugarg-nv
Copy link
Copy Markdown
Contributor

Looks good to me.

@am17an am17an merged commit b94050e into ggml-org:master Apr 17, 2026
46 of 48 checks passed
@am17an am17an deleted the cuda_ring_buffer branch April 17, 2026 15:24
cnsiva pushed a commit to saas-home/llama.cpp that referenced this pull request Apr 17, 2026
* CUDA: use a ring-buffer for cuda graphs

* bump limit to 128

* use LRU eviction

* better naming

* do periodic clean-up
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request Apr 19, 2026
* CUDA: use a ring-buffer for cuda graphs

* bump limit to 128

* use LRU eviction

* better naming

* do periodic clean-up
mengqin pushed a commit to mengqin/llama.cpp that referenced this pull request Apr 20, 2026
* CUDA: use a ring-buffer for cuda graphs

* bump limit to 128

* use LRU eviction

* better naming

* do periodic clean-up
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Apr 21, 2026
* CUDA: use a ring-buffer for cuda graphs

* bump limit to 128

* use LRU eviction

* better naming

* do periodic clean-up
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026
* CUDA: use a ring-buffer for cuda graphs

* bump limit to 128

* use LRU eviction

* better naming

* do periodic clean-up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants