server: fix prompt cache reuse for hybrid/recurrent models#21099
server: fix prompt cache reuse for hybrid/recurrent models#21099hanxiao wants to merge 1 commit intoggml-org:masterfrom
Conversation
Fix KV cache reuse for hybrid architectures (Qwen3.5, Falcon-H1, Jamba, etc.) where recurrent layers cause pos_min to always equal the sequence length, incorrectly triggering full prompt re-processing on every turn. Four changes, all gated behind llama_model_is_recurrent(): 1. Set pos_min_thold = 0 for recurrent models (SWA threshold not meaningful) 2. Checkpoint search uses position-matching semantics instead of SWA window 3. do_reset path preserves n_past instead of zeroing it 4. Checkpoint erasure log no longer references n_swa Pure transformer paths unchanged. Tested on Qwen3.5-35B-A3B with 250K context on NVIDIA L4: - Cold prefill: 408s (240K tokens) - Subsequent queries (same prefix, different suffix): 2.6s average - 157x speedup, 8/8 needle-in-haystack retrieval correct Fixes ggml-org#20225
|
Hi @hanxiao, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Would be extremely helpful to get in, long-context becomes a bit of a pain. |
@nssatlantis Please provide a minimal reproducible example of what you call "a bit of a pain". This pull request claims to fix the already closed issue #20225. However, that issue describes a bug that does not actually exist, because in that case, the cache is invalidated correctly invalidated by OpenCode mutating the prefix. So if there is still a bug, please open a new issue with an example with which the bug can be reproduced and with which a potential fix can be verified whether it is working or not. Note also that some parts of the proposed code change in this pull request seems odd. For example, it removes still relevant information from being logged, rather than extending it with the newly introduced |
From my quick perspective, it seemed like something that would definitely help the cache not invalidate itself, however, I haven't read through the code, just what it supposedly would fix. I also learned that the cache does work to a greater extent, and half of the cache invalidations was due to injecting dynamic text into a prompt at the top, rather than the bottom. So in short, hoped/thought this would help, but as is often the case... User error :) |
Problem
Hybrid/recurrent models (Qwen3.5, Falcon-H1, Jamba, Nemotron, etc.) trigger full prompt re-processing at long context lengths, even when the prompt prefix is unchanged. This makes memory-store use cases (large static context + small varying query) impractical.
Observed behavior with upstream Docker image
Tested with
ghcr.io/ggml-org/llama.cpp:server-cuda(latest, 2026-03-27 build, includes #16382 + #19045):At 258K tokens, the server log shows all 32 checkpoints being erased and full re-processing:
The bug is scale-dependent: cache reuse works at shorter contexts but breaks at longer ones because the checkpoint position thresholds grow beyond any checkpoint's range.
Root cause
For recurrent layers,
llama_memory_seq_pos_min()returns the full sequence length (not the SWA window position). The existing threshold logicpos_min_thold = max(0, pos_next - n_swa)evaluates topos_nextwhenn_swa == 0(Qwen3.5 has no SWA), so:pos_min >= pos_min_tholdis always true → enters checkpoint searchcur.pos_min < pos_min_tholdfinds no match (threshold too large at long contexts)do_resetzerosn_past→ full re-processingpos_max > pos_nextget erasedAt shorter contexts, the prompt cache similarity matching (
sim_best) finds a usable cached prompt before the checkpoint logic triggers, masking the bug. At 250K+, this fallback fails.Fix
Four changes, all gated behind
llama_model_is_recurrent():pos_min_thold = 0for recurrent models (SWA threshold not meaningful for recurrent state)cur.pos_max <= n_past) instead of SWA window logicdo_resetpath preservesn_pastinstead of zeroing it (avoids discarding valid cache)n_swaPure transformer / SWA paths are completely unchanged.
Testing
Tested on Qwen3.5-35B-A3B (30x GDN + 10x Gated Attention) with 250K token context on NVIDIA L4 (24GB):
8 different queries tested with 250K static context (needle-in-a-haystack at various depths), all correctly answered in 2.4-2.8s after initial prefill.
Server log after fix (no reprocessing warning):
Also verified at 100K context: 115s cold → 1.7s warm (68x speedup).
Related issues
Fixes #20225
Related: #19858, #19690, #19794, #18497