[Bug] CUDA OOM on Windows 7 during long-context prefill due to Legacy Pool accumulation

### Environment
- OS: Windows 7 SP1 (x86-64) + [VxKex NEXT](https://github.com/YuZhouRen86/VxKex-NEXT)
- GPU: NVIDIA GeForce RTX 3060 (12 GB)
- Driver: 474.11
- CUDA: 11.4.2
- Compiler: Visual Studio 16 2019
- llama.cpp version: `llama.cpp-b8833`
- Command-line flags relevant: `-b 2048 -ub 512 --flash-attn on`

### Problem Description
When running `llama-server` and sending a long-context prompt (~20k tokens) for prefill, GPU memory rises step-wise until it hits the physical limit and CUDA throws `out of memory`.

```
CUDA error: out of memory
ggml/src/ggml-cuda/ggml-cuda.cu: CUDA error
current device: 0, in function alloc
ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
```

The same binary runs fine on Windows 10/11 with identical hardware. The issue is 100% reproducible only on Win7.

---

### Call Chain Analysis

The memory leak is rooted in the interaction between long-context prefill, Flash Attention workspace allocation, and the Legacy CUDA memory pool. The complete call chain is as follows:

1. **`llama_context::decode()`** receives a batch of tokens and enters a `do-while` loop. For long prompts, the batch is split into multiple `ubatch`es (each up to `--ubatch-size`, default 512).

2. For each ubatch, **`process_ubatch()`** builds a computation graph (`ggml_cgraph`) containing the Flash Attention node (`GGML_OP_FLASH_ATTN_EXT`). During prefill, the KV cache length (`n_kv`) represented in this graph grows monotonically with each successive ubatch.

3. When the scheduler executes the graph, **`ggml_backend_cuda_graph_compute()`** dispatches each node to **`ggml_cuda_compute_forward()`**. For Flash Attention, this routes to **`ggml_cuda_flash_attn_ext()`** and its specialized variants (`vec`, `tile`, `mma`, etc.).

4. Inside the Flash Attention kernel launcher (e.g. **`flash_attn_ext_vec_f16`**, **`flash_attn_ext_tile_f16`**), temporary workspace buffers are allocated via the helper template **`ggml_cuda_pool_alloc<T>`**:
   - `K_f16` — quantized K cache dequantized to FP16
   - `V_f16` — quantized V cache dequantized to FP16  
   - `dst_tmp` — partial attention results when using parallel reduction
   - `dst_tmp_meta` — metadata for fixup kernels

5. **`ggml_cuda_pool_alloc::alloc()`** calls **`ggml_cuda_pool::alloc()`** on the active stream's pool. On Win7, VMM is unavailable, so the backend instantiates **`ggml_cuda_pool_leg`** (Legacy Pool).

6. The Legacy Pool's **`alloc()`** implements a **best-fit** strategy over a fixed-size array (256 slots). If no cached buffer is large enough, it performs a fresh **`cudaMalloc`** with a 5% lookahead padding. The **`free()`** method returns the buffer to the pool; if the pool is full, it calls **`cudaFree`**.

7. During **prefill**, `n_kv` increases with every ubatch. Consequently, the Flash Attention workspace size requested from the pool **monotonically grows**. The Legacy Pool's best-fit strategy breaks down:
   - Old buffers (sized for smaller `n_kv`) can never satisfy new, larger requests.
   - They remain cached in the pool, holding physical VRAM pages.
   - Before the 256-slot limit is reached, several gigabytes of "expired" buffers accumulate.
   - On Win7 with WDDM 1.x, even when the pool eventually overflows and calls `cudaFree`, physical reclamation is heavily delayed.
   - **Result**: physical VRAM is exhausted by cached-but-unreusable buffers, triggering OOM.

8. During **decode** (`ubatch.n_tokens == 1`), the workspace size is stable (< 50 MiB). Buffers are aggressively reused, and the Legacy Pool behaves correctly. This is why the issue only manifests during prefill.

---

### Root Cause
On Win7, CUDA falls back to `ggml_cuda_pool_leg` (Legacy Pool) because VMM is unsupported.

During prefill, Flash Attention workspace size **monotonically increases** with `n_kv`. Legacy Pool's best-fit caching strategy fails in this scenario:
- Old large buffers can never satisfy new (even larger) requests
- Yet they remain cached in the pool, occupying physical VRAM pages
- The pool accumulates several GB of "expired" buffers **before** hitting the 256-slot limit
- WDDM 1.x delays physical reclamation after `cudaFree`, exacerbating the problem
- Result: physical VRAM is exhausted by cached-but-unreusable buffers, triggering OOM

Decode phase is unaffected because workspace size is stable (< 50MB) and buffers are heavily reused.

---

### Proposed Fix
**Purge all CUDA stream pools at ubatch boundary during prefill.**

In `llama_context::decode()`, after each `process_ubatch()` completes, if `ubatch.n_tokens > 1` (prefill), call `ggml_backend_sched_synchronize()` followed by a pool purge.

This clears all accumulated buffers before they can grow unbounded, while leaving the decode path completely untouched.

#### Why `ggml_backend_sched_synchronize()` is mandatory
Before purging the pool, we **must** ensure the GPU has finished all work submitted for the current ubatch. Flash Attention kernels hold raw pointers (`K_f16.ptr`, `dst_tmp.ptr`, etc.) obtained from the pool. If we destroy those buffers via `cudaFree` while kernels are still executing, the pointers become dangling, causing illegal memory access or GPU segfaults. `synchronize()` provides the necessary execution barrier.

#### Minimal code changes (against `llama.cpp-b8833`):

**`ggml/src/ggml-cuda/ggml-cuda.cu`** — add helper:
```cpp
void ggml_backend_cuda_purge_pools(ggml_backend_t backend) {
    if (!ggml_backend_is_cuda(backend)) return;
    ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *)backend->context;
    for (int is = 0; is < GGML_CUDA_MAX_STREAMS; ++is) {
        cuda_ctx->pools[cuda_ctx->device][is].reset(); // triggers destructor -> cudaFree all cached buffers
    }
}
```

**`src/llama-context.cpp`** — add declaration near top:
```cpp
extern void ggml_backend_cuda_purge_pools(ggml_backend_t backend);
```

Inside `decode()`, in the `do-while` loop after `process_ubatch()`:
```cpp
if (ubatch.n_tokens > 1) {
    ggml_backend_sched_synchronize(sched.get());
    for (auto & backend : backends) {
        ggml_backend_cuda_purge_pools(backend.get());
    }
}
```

**Note on pool recreation:** The `ggml_backend_cuda_context::pool()` accessor lazily creates a new pool via `new_pool_for_device()` when `pools[device][stream]` is null. After `reset()`, the next allocation on that stream will transparently instantiate a fresh pool. No manual re-initialization is required.

---

### Trade-off
- **Prefill performance**: Degraded slightly because each ubatch boundary now synchronizes the GPU.
- **Decode performance**: Completely unchanged (`ubatch.n_tokens == 1`, condition skipped).
- **VRAM stability**: Prefill memory recovers per ubatch instead of accumulating to OOM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] CUDA OOM on Windows 7 during long-context prefill due to Legacy Pool accumulation #22075

Environment

Problem Description

Call Chain Analysis

Root Cause

Proposed Fix

Why `ggml_backend_sched_synchronize()` is mandatory

Minimal code changes (against `llama.cpp-b8833`):

Trade-off

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] CUDA OOM on Windows 7 during long-context prefill due to Legacy Pool accumulation #22075

Description

Environment

Problem Description

Call Chain Analysis

Root Cause

Proposed Fix

Why ggml_backend_sched_synchronize() is mandatory

Minimal code changes (against llama.cpp-b8833):

Trade-off

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Why `ggml_backend_sched_synchronize()` is mandatory

Minimal code changes (against `llama.cpp-b8833`):