ggml-cuda: flush legacy pool on OOM and retry by leonardHONG · Pull Request #22155 · ggml-org/llama.cpp

leonardHONG · 2026-04-20T07:08:22Z

This adds a conservative fallback for the legacy CUDA/HIP pool allocator.

On non-VMM setups, the legacy pool can end up holding cached free buffers that are individually too small for a new request, but still occupy enough VRAM to make the next allocation fail. In that case, this patch flushes the cached legacy-pool buffers and retries the allocation once before aborting.

The normal hit path is unchanged. This is intended as a narrow mitigation for legacy-pool OOMs, not a broader allocator redesign. I validated the retry path locally with a synthetic OOM injection on a legacy-pool build.

This is intended to mitigate the legacy-pool OOM behavior reported in #22075 and #22107.

Signed-off-by: 梁厚宏 <2695316095@qq.com>

JohannesGaessler · 2026-04-20T07:55:23Z

This is not safe to do. If the memory is still in use it must not be freed.

leonardHONG · 2026-04-20T08:14:01Z

This is not safe to do. If the memory is still in use it must not be freed.

Thanks for the catch! I completely overlooked the async lifecycle issue here. I'll drop this unsafe logic and rework it tonight.

IMbackK · 2026-04-20T08:17:56Z

I.. actually don't see it (how clear_pool can prune in use chunks)

JohannesGaessler · 2026-04-20T08:45:29Z

The legacy CUDA buffer pool is essentially just a list of previously used temporary buffers. "Allocating" and "freeing" a buffer just means retrieving a buffer for use within a kernel. It does not mean that there no longer are any kernels queued that will attempt to use the buffer.

Since the buffer pool is per ggml_backend_cuda_context I think the current approach is safe if it is synchronized on the corresponding stream. Looking at the documentation for cudaMalloc and cudaFree again however, it seems those do an implicit device-wide synchronization when they're called. So even without any modifications I think this PR is safe after all.

Did you consider replacing the legacy buffer pool with cudaMallocAsync and cudaFreeAsync? The only reason we are not using those for CUDA is that the performance was worse vs. manually managed chunks of memory but I don't think I ever checked the performance for HIP.

IMbackK · 2026-04-20T08:48:53Z

Looking at the documentation for cudaMalloc and cudaFree again however, it seems those do an implicit device-wide synchronization when they're called. So even without any modifications I think this PR is safe after all.

Jeah exactly, i was really confused there thinking that this is perhaps not true for cuda and only for hip.

IMbackK · 2026-04-20T08:49:46Z

Did you consider replacing the legacy buffer pool with cudaMallocAsync and cudaFreeAsync? The only reason we are not using those for CUDA is that the performance was worse vs. manually managed chunks of memory but I don't think I ever checked the performance for HIP.

I did try this, it is slow. Hopefully they will fix the damn virtual memory support soon.

gaugarg-nv · 2026-04-20T08:55:47Z

I'd suggest if you go with this fix, don't rely on this behavior of cudaFree. It is better to add an explicit cudaDeviceSynchronize or consider using the async version of cudaMalloc and cudaFree as @JohannesGaessler suggested.

The CUDA documentation for cudaFree says: "For all other pointers, this API may perform implicit synchronization."
So, I won't rely on this.

JohannesGaessler · 2026-04-20T08:52:33Z

+    void clear_pool() {
+        for (int i = 0; i < MAX_BUFFERS; ++i) {
+            ggml_cuda_buffer & b = buffer_pool[i];
+            if (b.ptr != nullptr) {
+                CUDA_CHECK(cudaFree(b.ptr));
+                pool_size -= b.size;
+                b.ptr  = nullptr;
+                b.size = 0;
+            }
+        }
+    }


To ensure consistency, please call clear_pool in the destructor.

JohannesGaessler · 2026-04-20T08:53:20Z

        size_t look_ahead_size = (size_t) (1.05 * size);
        look_ahead_size = 256 * ((look_ahead_size + 255)/256);
        ggml_cuda_set_device(device);
+#if defined(GGML_USE_MUSA)


If at all possible, please just add the missing defines to the MUSA header if they're missing rather than to have per-vendor logic.

IMbackK · 2026-04-20T08:58:31Z

The CUDA documentation for cudaFree says: "For all other pointers, this API may perform implicit synchronization." So, I won't rely on this.

On hip there is no may so you would be safe there, doing an explicit synchronization would not hurt ofc.

leonardHONG · 2026-04-20T09:03:00Z

Thanks everyone for the amazing clarifications! I really appreciate the guidance. I will apply all the suggested fixes (adding explicit sync, updating the destructor, and cleaning up the MUSA logic) and push an update tonight.

JohannesGaessler · 2026-04-20T09:08:21Z

The CUDA documentation for cudaFree says: "For all other pointers, this API may perform implicit synchronization."

Thank you for correcting me, it seems I suffered from confirmation bias and should have been more careful. The way I had remembered it the synchronization was guaranteed, that is what Gemini said when I asked "Does cudaFree imply a device synchronization?", and that is what I read when I quickly checked the CUDA documentation it linked. I agree that there should be an explicit synchronization to be safe.

gaugarg-nv · 2026-04-20T09:14:32Z

AFAIK, there is always a sync with cudaFree, but documentation doesn't guarantee it. So, it is better to make it explicit in the code. Here is the document I was referring to.

…up MUSA macros Signed-off-by: 梁厚宏 <2695316095@qq.com>

leonardHONG · 2026-04-20T13:20:36Z

Thanks for the feedback — I pushed an update.

This revision:

reuses clear_pool() in the destructor
makes clear_pool() self-contained with device selection
adds an explicit synchronization before clearing cached legacy-pool buffers on the OOM retry path
removes the MUSA-specific branch by adding the corresponding error alias in the vendor header

I also reran test-backend-ops locally and it passes on my side.

On HIP without VMM, the legacy pool retains these at peak size causing quantized KV to OOM before f16. ggml_cuda_direct_alloc<T> uses raw hipMalloc/hipFree instead. HIP-only, complements ggml-org#22155. Fixes ggml-org#22107 without performance degradation. Tested: gfx1100, gfx1200, gfx1201.

* ggml-cuda: flush legacy pool on OOM and retry Signed-off-by: 梁厚宏 <2695316095@qq.com> * Address review comments: add explicit sync, update destructor, clean up MUSA macros Signed-off-by: 梁厚宏 <2695316095@qq.com> --------- Signed-off-by: 梁厚宏 <2695316095@qq.com>

ggml-cuda: flush legacy pool on OOM and retry

1ab3ca4

Signed-off-by: 梁厚宏 <2695316095@qq.com>

leonardHONG requested review from a team and IMbackK as code owners April 20, 2026 07:08

JohannesGaessler reviewed Apr 20, 2026

View reviewed changes

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 20, 2026

Address review comments: add explicit sync, update destructor, clean …

aa1c01e

…up MUSA macros Signed-off-by: 梁厚宏 <2695316095@qq.com>

JohannesGaessler approved these changes Apr 20, 2026

View reviewed changes

IMbackK approved these changes Apr 20, 2026

View reviewed changes

This was referenced Apr 20, 2026

hip: bypass memory pool for flash attention f16 temp buffers #22094

Closed

fix: force VEC FA path for quantized KV on HIP/ROCm TheTom/llama-cpp-turboquant#90

Merged

TheTom mentioned this pull request Apr 20, 2026

hip: direct alloc for FA f16 temp buffers #22185

Closed

IMbackK merged commit 9789512 into ggml-org:master Apr 20, 2026
50 of 52 checks passed

leonardHONG mentioned this pull request Apr 21, 2026

cuda: add partial eviction on pool OOM #22193

Open

IMbackK mentioned this pull request Apr 21, 2026

Eval bug: Memory leak? using ROCm #19979

Closed

sanmai mentioned this pull request May 3, 2026

SYCL: flash-attention buffers are retained across long-context ubatches causing linear VRAM growth #22585

Open

Conversation

leonardHONG commented Apr 20, 2026

Uh oh!

JohannesGaessler commented Apr 20, 2026

Uh oh!

leonardHONG commented Apr 20, 2026

Uh oh!

IMbackK commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Apr 20, 2026

Uh oh!

IMbackK commented Apr 20, 2026

Uh oh!

IMbackK commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaugarg-nv commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

IMbackK commented Apr 20, 2026

Uh oh!

leonardHONG commented Apr 20, 2026

Uh oh!

JohannesGaessler commented Apr 20, 2026

Uh oh!

gaugarg-nv commented Apr 20, 2026

Uh oh!

leonardHONG commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

IMbackK commented Apr 20, 2026 •

edited

Loading

IMbackK commented Apr 20, 2026 •

edited

Loading

gaugarg-nv commented Apr 20, 2026 •

edited

Loading