fix: force VEC FA path for quantized KV on HIP/ROCm#90
fix: force VEC FA path for quantized KV on HIP/ROCm#90TheTom merged 2 commits intofeature/turboquant-kv-cachefrom
Conversation
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers proportional to full KV cache length for any quantized KV type. On ROCm/HIP these pool allocations persist at peak size, meaning the temp buffer VRAM exceeds the savings from KV compression. This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM before f16 at the same context length. Confirmed on gfx1100 (RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock llama.cpp q4_0/q8_0 (not TurboQuant-specific). Fix: on HIP, force VEC path for quantized KV when available (head_dim <= 256). VEC does inline dequant with no temp buffer. Trade-off: prefill throughput may decrease (VEC processes queries sequentially). Decode is unaffected since VEC was already selected for single-token generation. Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512) cannot use VEC and still routes through TILE. Bounded temp buffer in a separate compilation unit is the proper fix for those cases (see domvox/llama.cpp-turboquant-hip a13c3db12 discussion). Refs: ggml-org#20969 (community reports) Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations permanently. For quantized KV flash attention, the f16 dequant temp buffers stay allocated in the pool after use, consuming more VRAM than the KV compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM before f16 at equivalent context lengths on HIP/ROCm. Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with hipFree (via RAII destructor) instead of the pool. Memory is released after the FA kernel completes via hipStreamSynchronize. Compared to commit 1 (VEC force), this approach recovers prefill: pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169% pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203% Decode: unchanged across all configs Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k). Can be eliminated in future with hipFreeAsync (stream-ordered free). Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory. On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM. Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1) Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP) Refs: ggml-org#20969 Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Will test this today on my 9070 XT using a non-Gemma model. Thanks! |
|
Validation on RX 9070 XT (gfx1201) - 16GB VRAM Tested PR #90 using a much heavier 27B model (Qwen 3.5 27B IQ3_M) to aggressively stress-test the 16GB VRAM boundary at 65536 context. The patch works exactly as Baseline (f16 KV Cache): 1 $ ./llama-bench -m Qwopus3.5-27B-IQ3_M.gguf -ngl 99 -fa 1 -ctk f16 -ctv f16 -p 512 -n 128 -d 0,32768,65536 -r 1 Patched (q8_0/turbo4 KV Cache): 1 $ ./llama-bench -m Qwopus3.5-27B-IQ3_M.gguf -ngl 99 -fa 1 -ctk q8_0 -ctv turbo4 -p 512 -n 128 -d 0,32768,65536 -r 1 The prefill speed overhead you mentioned is present (e.g., dropping from 1198 t/s to 1138 t/s at d0), but decode speeds (tg128) remained completely locked at 27 t/s This is a massive stability win for RDNA hardware without VMM! Thanks for the patch!
|
This is exactly the validation I needed. f16 OOM at 65K, q8_0/turbo4 survives and computes. That's the fix working as designed on a real workload. Prefill overhead at d0 (1198 → 1138, -5%) matches what we saw on the 1.5B model. Decode locked at 27 t/s both configs. Clean result. Thank you for testing with the 27B model, that was the missing piece. Updating the PR with your results. |
Community Validation: OOM Fix Confirmedapollo-mg tested on RX 9070 XT (gfx1201, 16GB) with Qwen3.5-27B IQ3_M (11.71 GiB) at 65536 context:
f16 crashes at 65K context. q8_0/turbo4 with the pool bypass survives and computes. Prefill overhead -5% at d0, decode unchanged. This confirms the OOM fix on a real workload (27B model on 16GB VRAM). Combined with the initial 1.5B model testing (functional correctness, prefill recovery from -65% to -5%), the fix is validated across two model sizes on gfx1201. Still welcome testing from gfx1100 (RX 7900 series) and gfx1200 (RX 9060 XT) owners. |
|
I ran your tests on an RX 7900 XTX (gfx1100, 24GB), and it looks like it works. f16 failed at llama-bench -dev ROCm0 -m ~/Downloads/unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-UD-Q5_K_XL.gguf -ngl 99 -fa 1 -ctk f16 -ctv f16 -p 512 -n 128 -d 0,32768,65536,98304,131072,163840,196608,229376,262144 -r 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
| model | size | params | backend | ngl | fa | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------ | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | pp512 | 900.64 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | tg128 | 22.60 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | pp512 @ d32768 | 495.88 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | tg128 @ d32768 | 21.30 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | pp512 @ d65536 | 340.45 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | tg128 @ d65536 | 20.22 ± 0.00 |
main: error: failed to create context with model '/home/caleb/Downloads/unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-UD-Q5_K_XL.gguf'llama-bench -dev ROCm0 -m ~/Downloads/unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-UD-Q5_K_XL.gguf -ngl 99 -fa 1 -ctk q8_0 -ctv turbo4 -p 512 -n 128 -d 0,32768,65536,98304,131072,163840,196608,229376,262144 -r 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
| model | size | params | backend | ngl | type_k | type_v | fa | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | ------------ | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | pp512 | 877.63 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | tg128 | 23.01 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | pp512 @ d32768 | 229.62 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | tg128 @ d32768 | 20.49 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | pp512 @ d65536 | 128.63 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | tg128 @ d65536 | 18.48 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | pp512 @ d98304 | 88.64 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | tg128 @ d98304 | 16.53 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | pp512 @ d131072 | 67.52 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | tg128 @ d131072 | 15.10 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | pp512 @ d163840 | 53.40 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | tg128 @ d163840 | 13.84 ± 0.00 |
/home/caleb/Downloads/llama-cpp-turboquant/ggml/src/ggml-cuda/ggml-cuda.cu:99: ROCm error |
|
Quick question: can you confirm you're running both commits from the PR? The prefill regression at 32K (495 → 229 t/s) looks like commit 1 (VEC force) is dominating. If you have both commits, the pool bypass (commit 2) should let us relax the VEC force in a follow-up and recover that prefill. You can check with: Should show both:
|
|
Both are present. $ git log --oneline -3
0757ff4ee (HEAD -> fix/hip-force-vec-quantized-kv, origin/fix/hip-force-vec-quantized-kv) fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
8993d4fd7 fix: force VEC FA path for quantized KV on HIP/ROCm
107362298 (tag: feature-turboquant-kv-cache-b8961-1073622, origin/rebase/upstream-sync-april-2026) fix: add TURBO2_0 to flash_attn auto-enable checkI started with a fresh git clone of your repo and followed your exact fetch/checkout commands. |
|
Always happy to lend the 9070 XT for testing on this project. This repo is part of my daily workflow at the moment so any improvements to it help me directly too. |
|
I reran the benchmarks across the pull request commit (0757ff4), only its first commit (8993d4f), and the commit just prior to you merging them (0198d58). The regression is onboth 0757ff4 and 8993d4f and their numbers looks nearly identical. The regression is not in 0198d58 as expected. 0757ff4: full pull requestPrefill regression at 32K (494 -> 228 t/s), nearly idenical to my original run as expected. llama-bench -dev ROCm0 -m ~/Downloads/unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-UD-Q5_K_XL.gguf -ngl 99 -fa 1 -ctk f16 -ctv f16 -p 512 -n 128 -d 0,32768 -r 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
| model | size | params | backend | ngl | fa | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------ | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | pp512 | 899.17 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | tg128 | 22.83 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | pp512 @ d32768 | 494.47 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | tg128 @ d32768 | 21.54 ± 0.00 |
build: 0757ff4ee (8963)llama-bench -dev ROCm0 -m ~/Downloads/unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-UD-Q5_K_XL.gguf -ngl 99 -fa 1 -ctk q8_0 -ctv turbo4 -p 512 -n 128 -d 0,32768 -r 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
| model | size | params | backend | ngl | type_k | type_v | fa | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | ------------ | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | pp512 | 875.56 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | tg128 | 23.14 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | pp512 @ d32768 | 228.75 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | tg128 @ d32768 | 20.48 ± 0.00 |
build: 0757ff4ee (8963)8993d4f: first commit onlyPrefill regression at 32K (493 -> 228 t/s), consistent with original run and rerun. llama-bench -dev ROCm0 -m ~/Downloads/unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-UD-Q5_K_XL.gguf -ngl 99 -fa 1 -ctk f16 -ctv f16 -p 512 -n 128 -d 0,32768 -r 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
| model | size | params | backend | ngl | fa | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------ | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | pp512 | 896.75 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | tg128 | 23.24 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | pp512 @ d32768 | 493.08 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | tg128 @ d32768 | 21.87 ± 0.00 |
build: 8993d4fd7 (8962)
llama-bench -dev ROCm0 -m ~/Downloads/unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-UD-Q5_K_XL.gguf -ngl 99 -fa 1 -ctk q8_0 -ctv turbo4 -p 512 -n 128 -d 0,32768 -r 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
| model | size | params | backend | ngl | type_k | type_v | fa | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | ------------ | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | pp512 | 874.45 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | tg128 | 23.09 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | pp512 @ d32768 | 228.96 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | tg128 @ d32768 | 20.46 ± 0.00 |
build: 8993d4fd7 (8962)
0198d58: commit prior to mergeNo regression prior to merge at 32K (494 -> 493 t/s). $ llama-bench -dev ROCm0 -m ~/Downloads/unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-UD-Q5_K_XL.gguf -ngl 99 -fa 1 -ctk f16 -ctv f16 -p 512 -n 128 -d 0,32768 -r 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
| model | size | params | backend | ngl | fa | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------ | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | pp512 | 901.46 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | tg128 | 23.26 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | pp512 @ d32768 | 494.54 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | 1 | ROCm0 | tg128 @ d32768 | 21.91 ± 0.00 |
build: 0198d5819 (8963)$ llama-bench -dev ROCm0 -m ~/Downloads/unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-UD-Q5_K_XL.gguf -ngl 99 -fa 1 -ctk q8_0 -ctv turbo4 -p 512 -n 128 -d 0,32768 -r 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
| model | size | params | backend | ngl | type_k | type_v | fa | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | ------------ | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | pp512 | 901.21 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | tg128 | 22.36 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | pp512 @ d32768 | 493.28 ± 0.00 |
| qwen35 27B Q5_K - Medium | 18.78 GiB | 26.90 B | ROCm | 99 | q8_0 | turbo4 | 1 | ROCm0 | tg128 @ d32768 | 20.00 ± 0.00 |
build: 0198d5819 (8963) |
|
Heads up: the upstream PR for this fix (ggml-org#22094) was closed by another contributor who pointed to an alternative approach (ggml-org#22155). I tested ggml-org#22155 on the same hardware and it fixes the OOM but is 3x slower at depth (129 vs 389 t/s at d40000) because it flush-retries the entire pool on every FA call. If this fix matters to you, a comment on ggml-org#22094 requesting a reopen would help. The data is all in the thread. |
|
|
New upstream PR with cleaner code: ggml-org#22185 Refactored the inline struct into a reusable ggml_cuda_direct_alloc template in common.cuh that mirrors the pool_alloc interface. Same fix, better structure. |
fix: force VEC FA path for quantized KV on HIP/ROCm
Summary
Two-commit fix for ROCm VRAM leak that causes quantized KV types to OOM before f16 at the same context length. Scoped to
#ifdef GGML_USE_HIPonly. No impact on Metal or CUDA codepaths.Commit 1: Force VEC FA path for quantized KV (fattn.cu)
Forces VEC flash attention kernel on HIP when KV types are quantized and VEC is available (head_dim <= 256). VEC does inline dequant with no temp buffer allocation. This eliminates the unbounded f16 temp buffer in TILE/MMA/WMMA paths.
Trade-off: prefill regression of 15-69% (VEC processes queries sequentially). Decode unaffected.
Commit 2: Bypass memory pool for FA f16 temp buffers (fattn-common.cuh)
Replaces the pool-based f16 temp buffer allocators (
K_f16,V_f16) with a RAII struct using rawhipMalloc/hipFree. The pool (ggml_cuda_pool_leg) retains peak-sized allocations permanently becausefree()stores buffers for reuse rather than releasing them. On HIP without VMM support, this means the f16 dequant buffer stays allocated after use, consuming more VRAM than the KV compression saves.This commit recovers the prefill regression from commit 1 while maintaining correct VRAM behavior.
Root Cause
ggml_cuda_pool_leg::free()stores freed buffers inbuffer_pool[]for reuse and never callscudaFree. On CUDA with VMM, the OS can reclaim unused virtual memory. On HIP without VMM (VMM: noon RX 7900 XT, RX 9060 XT, RX 9070 XT), the pool permanently consumes peak VRAM. The f16 temp buffers allocated during flash attention for quantized KV dequant grow proportional to full KV cache length and persist at peak size.Problem Reports
Prior art: stragulus (abf395d) implemented bounded temp buffers in
fattn-common.cuhbut this caused a 30% decode regression on HIP/RDNA3 due to compiler codegen pollution affecting the VEC kernel. domvox reverted it (a13c3db12).Test Results (RX 9070 XT, gfx1201, 16GB, Windows 11, HIP SDK 7.1)
Model: Qwen2.5-1.5B-Instruct Q8_0 (1.76 GiB)
Decode Performance (unchanged across all builds)
Prefill Performance
Pool bypass recovers prefill from -65% (VEC force) to -5% (hipStreamSync overhead).
Functional Verification
llama-cliwith q8_0/q8_0 KV: correct output, coherent generation.OOM Verification
Pending larger model testing (7B+ at 64K+). The 1.5B model does not stress 16GB VRAM enough to trigger OOM at 32K.
Known Limitations
hipStreamSynchronizein commit 2. Can be eliminated withhipFreeAsync(stream-ordered free, available since ROCm 5.4)Future Improvements
Replace
hipStreamSynchronize+hipFreewithhipFreeAsyncand set the HIP mempool release threshold to 0:This would make the fix fully async with zero pipeline stall.
Community Testing Request
If you have an AMD GPU (RX 7900 series, RX 9000 series) and can run a 7B+ model at 64K+ context, please test:
Refs: ggml-org/llama.cpp#20969