CUDA: use LRU based eviction for cuda graphs#21611
Conversation
14d4c00 to
193b81e
Compare
ggerganov
left a comment
There was a problem hiding this comment.
I think we probably need some time-based LRU eviction logic. But let's merge this now as a stopgap.
| #define MATRIX_ROW_PADDING 512 // last row of quant. matrices is a multiple of this to avoid out-of-bounds memory accesses | ||
|
|
||
| #define GGML_CUDA_MAX_STREAMS 8 | ||
| #define GGML_CUDA_MAX_GRAPHS 128 |
There was a problem hiding this comment.
I think 128 might be too low.
The TP implementation breaks each decoder layer into 2-sub graphs. So, for larger models, it will easily hit this limit.
There was a problem hiding this comment.
How many layers do large models have? Is 256 a better limit?
There was a problem hiding this comment.
I have seen up to 80 layers. So, yes, 256 will be a better limit.
There was a problem hiding this comment.
GLM4.7 notably has 92 layers https://huggingface.co/zai-org/GLM-4.7/blob/main/config.json
There was a problem hiding this comment.
Without NCCL the number of ggml graphs is currently even higher. However, these graphs only have a single node so using CUDA graphs may not be worthwhile in the first place.
There was a problem hiding this comment.
Okay implemented this. Since std::priority_queue doesn't support random access, we need to a do a little bit of book-keeping on the side and keep some stale entries in the queue, but I think it should be okay.
There was a problem hiding this comment.
I was thinking about a much simpler implementation:
- Add a timestamp to
ggml_cuda_graph - LRU purging is loop over
cuda_graphsand remove outdated entires - Update timestamp each time a graph is referenced
I highly doubt the priority queue here has any advantage in terms of performance since we are dealing with very small number of entries, while it makes the logic quite complicated.
There was a problem hiding this comment.
-sm tensor on peak was at 800 cuda graphs with 4x 4090 running gpt-oss-120b, so I don't think we can loop over all entries? We have to keep them sorted somehow
There was a problem hiding this comment.
We don't need to purge on each iteration. For example, we can purge only if X seconds have passed since the last purge, so even with 800 graphs or more in the container, it should not be a problem to loop over all of them from time to time.
There was a problem hiding this comment.
Ok I simplified this, and did not see any performance degradation
I feel ring-buffers on their own are not going to reflect well more complex orchestration loops (e.g. libmtmd), where a lot of "never initialized" There is a parallel LRU-based proposal at #21673 that was proposed to reflect this (but is of course more complex). Instead of being time-based, it assigns a VRAM budget. |
|
@JohannesGaessler can you share the command for which you saw memory increasing? I can test with the current PR |
|
@am17an just start the llama.cpp server with any model and incrementally fill up the context with generations. |
bab7d29 to
958fd9f
Compare
|
I tested with 4x 4090 GPUs, on master memory steadily increases and with this PR it is constant after a point |
b156682 to
579972b
Compare
|
@ggml-org/ggml-cuda can I get another approval? |
|
Looks good to me. |
* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up
* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up
* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up
* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up
* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up
Overview
Since introducing graphs per node to enable multiple splits to have cuda graphs in #18934, there are cases when the node pointers in ggml_cgraph keep changing and it leads to the map being unbounded leading to memory leaks (e.g #20315)
This PR fixes the memory leaks
Additional information
Requirements