Environment
- OS: Windows 7 SP1 (x86-64) + VxKex NEXT
- GPU: NVIDIA GeForce RTX 3060 (12 GB)
- Driver: 474.11
- CUDA: 11.4.2
- Compiler: Visual Studio 16 2019
- llama.cpp version:
llama.cpp-b8833
- Command-line flags relevant:
-b 2048 -ub 512 --flash-attn on
Problem Description
When running llama-server and sending a long-context prompt (~20k tokens) for prefill, GPU memory rises step-wise until it hits the physical limit and CUDA throws out of memory.
CUDA error: out of memory
ggml/src/ggml-cuda/ggml-cuda.cu: CUDA error
current device: 0, in function alloc
ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
The same binary runs fine on Windows 10/11 with identical hardware. The issue is 100% reproducible only on Win7.
Call Chain Analysis
The memory leak is rooted in the interaction between long-context prefill, Flash Attention workspace allocation, and the Legacy CUDA memory pool. The complete call chain is as follows:
-
llama_context::decode() receives a batch of tokens and enters a do-while loop. For long prompts, the batch is split into multiple ubatches (each up to --ubatch-size, default 512).
-
For each ubatch, process_ubatch() builds a computation graph (ggml_cgraph) containing the Flash Attention node (GGML_OP_FLASH_ATTN_EXT). During prefill, the KV cache length (n_kv) represented in this graph grows monotonically with each successive ubatch.
-
When the scheduler executes the graph, ggml_backend_cuda_graph_compute() dispatches each node to ggml_cuda_compute_forward(). For Flash Attention, this routes to ggml_cuda_flash_attn_ext() and its specialized variants (vec, tile, mma, etc.).
-
Inside the Flash Attention kernel launcher (e.g. flash_attn_ext_vec_f16, flash_attn_ext_tile_f16), temporary workspace buffers are allocated via the helper template ggml_cuda_pool_alloc<T>:
K_f16 — quantized K cache dequantized to FP16
V_f16 — quantized V cache dequantized to FP16
dst_tmp — partial attention results when using parallel reduction
dst_tmp_meta — metadata for fixup kernels
-
ggml_cuda_pool_alloc::alloc() calls ggml_cuda_pool::alloc() on the active stream's pool. On Win7, VMM is unavailable, so the backend instantiates ggml_cuda_pool_leg (Legacy Pool).
-
The Legacy Pool's alloc() implements a best-fit strategy over a fixed-size array (256 slots). If no cached buffer is large enough, it performs a fresh cudaMalloc with a 5% lookahead padding. The free() method returns the buffer to the pool; if the pool is full, it calls cudaFree.
-
During prefill, n_kv increases with every ubatch. Consequently, the Flash Attention workspace size requested from the pool monotonically grows. The Legacy Pool's best-fit strategy breaks down:
- Old buffers (sized for smaller
n_kv) can never satisfy new, larger requests.
- They remain cached in the pool, holding physical VRAM pages.
- Before the 256-slot limit is reached, several gigabytes of "expired" buffers accumulate.
- On Win7 with WDDM 1.x, even when the pool eventually overflows and calls
cudaFree, physical reclamation is heavily delayed.
- Result: physical VRAM is exhausted by cached-but-unreusable buffers, triggering OOM.
-
During decode (ubatch.n_tokens == 1), the workspace size is stable (< 50 MiB). Buffers are aggressively reused, and the Legacy Pool behaves correctly. This is why the issue only manifests during prefill.
Root Cause
On Win7, CUDA falls back to ggml_cuda_pool_leg (Legacy Pool) because VMM is unsupported.
During prefill, Flash Attention workspace size monotonically increases with n_kv. Legacy Pool's best-fit caching strategy fails in this scenario:
- Old large buffers can never satisfy new (even larger) requests
- Yet they remain cached in the pool, occupying physical VRAM pages
- The pool accumulates several GB of "expired" buffers before hitting the 256-slot limit
- WDDM 1.x delays physical reclamation after
cudaFree, exacerbating the problem
- Result: physical VRAM is exhausted by cached-but-unreusable buffers, triggering OOM
Decode phase is unaffected because workspace size is stable (< 50MB) and buffers are heavily reused.
Proposed Fix
Purge all CUDA stream pools at ubatch boundary during prefill.
In llama_context::decode(), after each process_ubatch() completes, if ubatch.n_tokens > 1 (prefill), call ggml_backend_sched_synchronize() followed by a pool purge.
This clears all accumulated buffers before they can grow unbounded, while leaving the decode path completely untouched.
Why ggml_backend_sched_synchronize() is mandatory
Before purging the pool, we must ensure the GPU has finished all work submitted for the current ubatch. Flash Attention kernels hold raw pointers (K_f16.ptr, dst_tmp.ptr, etc.) obtained from the pool. If we destroy those buffers via cudaFree while kernels are still executing, the pointers become dangling, causing illegal memory access or GPU segfaults. synchronize() provides the necessary execution barrier.
Minimal code changes (against llama.cpp-b8833):
ggml/src/ggml-cuda/ggml-cuda.cu — add helper:
void ggml_backend_cuda_purge_pools(ggml_backend_t backend) {
if (!ggml_backend_is_cuda(backend)) return;
ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *)backend->context;
for (int is = 0; is < GGML_CUDA_MAX_STREAMS; ++is) {
cuda_ctx->pools[cuda_ctx->device][is].reset(); // triggers destructor -> cudaFree all cached buffers
}
}
src/llama-context.cpp — add declaration near top:
extern void ggml_backend_cuda_purge_pools(ggml_backend_t backend);
Inside decode(), in the do-while loop after process_ubatch():
if (ubatch.n_tokens > 1) {
ggml_backend_sched_synchronize(sched.get());
for (auto & backend : backends) {
ggml_backend_cuda_purge_pools(backend.get());
}
}
Note on pool recreation: The ggml_backend_cuda_context::pool() accessor lazily creates a new pool via new_pool_for_device() when pools[device][stream] is null. After reset(), the next allocation on that stream will transparently instantiate a fresh pool. No manual re-initialization is required.
Trade-off
- Prefill performance: Degraded slightly because each ubatch boundary now synchronizes the GPU.
- Decode performance: Completely unchanged (
ubatch.n_tokens == 1, condition skipped).
- VRAM stability: Prefill memory recovers per ubatch instead of accumulating to OOM.
Environment
llama.cpp-b8833-b 2048 -ub 512 --flash-attn onProblem Description
When running
llama-serverand sending a long-context prompt (~20k tokens) for prefill, GPU memory rises step-wise until it hits the physical limit and CUDA throwsout of memory.The same binary runs fine on Windows 10/11 with identical hardware. The issue is 100% reproducible only on Win7.
Call Chain Analysis
The memory leak is rooted in the interaction between long-context prefill, Flash Attention workspace allocation, and the Legacy CUDA memory pool. The complete call chain is as follows:
llama_context::decode()receives a batch of tokens and enters ado-whileloop. For long prompts, the batch is split into multipleubatches (each up to--ubatch-size, default 512).For each ubatch,
process_ubatch()builds a computation graph (ggml_cgraph) containing the Flash Attention node (GGML_OP_FLASH_ATTN_EXT). During prefill, the KV cache length (n_kv) represented in this graph grows monotonically with each successive ubatch.When the scheduler executes the graph,
ggml_backend_cuda_graph_compute()dispatches each node toggml_cuda_compute_forward(). For Flash Attention, this routes toggml_cuda_flash_attn_ext()and its specialized variants (vec,tile,mma, etc.).Inside the Flash Attention kernel launcher (e.g.
flash_attn_ext_vec_f16,flash_attn_ext_tile_f16), temporary workspace buffers are allocated via the helper templateggml_cuda_pool_alloc<T>:K_f16— quantized K cache dequantized to FP16V_f16— quantized V cache dequantized to FP16dst_tmp— partial attention results when using parallel reductiondst_tmp_meta— metadata for fixup kernelsggml_cuda_pool_alloc::alloc()callsggml_cuda_pool::alloc()on the active stream's pool. On Win7, VMM is unavailable, so the backend instantiatesggml_cuda_pool_leg(Legacy Pool).The Legacy Pool's
alloc()implements a best-fit strategy over a fixed-size array (256 slots). If no cached buffer is large enough, it performs a freshcudaMallocwith a 5% lookahead padding. Thefree()method returns the buffer to the pool; if the pool is full, it callscudaFree.During prefill,
n_kvincreases with every ubatch. Consequently, the Flash Attention workspace size requested from the pool monotonically grows. The Legacy Pool's best-fit strategy breaks down:n_kv) can never satisfy new, larger requests.cudaFree, physical reclamation is heavily delayed.During decode (
ubatch.n_tokens == 1), the workspace size is stable (< 50 MiB). Buffers are aggressively reused, and the Legacy Pool behaves correctly. This is why the issue only manifests during prefill.Root Cause
On Win7, CUDA falls back to
ggml_cuda_pool_leg(Legacy Pool) because VMM is unsupported.During prefill, Flash Attention workspace size monotonically increases with
n_kv. Legacy Pool's best-fit caching strategy fails in this scenario:cudaFree, exacerbating the problemDecode phase is unaffected because workspace size is stable (< 50MB) and buffers are heavily reused.
Proposed Fix
Purge all CUDA stream pools at ubatch boundary during prefill.
In
llama_context::decode(), after eachprocess_ubatch()completes, ifubatch.n_tokens > 1(prefill), callggml_backend_sched_synchronize()followed by a pool purge.This clears all accumulated buffers before they can grow unbounded, while leaving the decode path completely untouched.
Why
ggml_backend_sched_synchronize()is mandatoryBefore purging the pool, we must ensure the GPU has finished all work submitted for the current ubatch. Flash Attention kernels hold raw pointers (
K_f16.ptr,dst_tmp.ptr, etc.) obtained from the pool. If we destroy those buffers viacudaFreewhile kernels are still executing, the pointers become dangling, causing illegal memory access or GPU segfaults.synchronize()provides the necessary execution barrier.Minimal code changes (against
llama.cpp-b8833):ggml/src/ggml-cuda/ggml-cuda.cu— add helper:src/llama-context.cpp— add declaration near top:Inside
decode(), in thedo-whileloop afterprocess_ubatch():Note on pool recreation: The
ggml_backend_cuda_context::pool()accessor lazily creates a new pool vianew_pool_for_device()whenpools[device][stream]is null. Afterreset(), the next allocation on that stream will transparently instantiate a fresh pool. No manual re-initialization is required.Trade-off
ubatch.n_tokens == 1, condition skipped).