server: fix prompt cache reuse for hybrid/recurrent models by hanxiao · Pull Request #21099 · ggml-org/llama.cpp

hanxiao · 2026-03-28T04:48:30Z

Problem

Hybrid/recurrent models (Qwen3.5, Falcon-H1, Jamba, Nemotron, etc.) trigger full prompt re-processing at long context lengths, even when the prompt prefix is unchanged. This makes memory-store use cases (large static context + small varying query) impractical.

Observed behavior with upstream Docker image

Tested with ghcr.io/ggml-org/llama.cpp:server-cuda (latest, 2026-03-27 build, includes #16382 + #19045):

Context size	Cold prefill	Cache reuse (Q2)	Speedup	Status
23K tokens	20.9s	0.8s	24.9x	✅ Works
112K tokens	136.9s	1.5s	91.5x	✅ Works
258K tokens	437.6s	437.6s (full re-process)	1x	❌ Broken

At 258K tokens, the server log shows all 32 checkpoints being erased and full re-processing:

slot update_slots: id 0 | task 259 | n_past = 11, pos_min = 258048, n_swa = 1
slot update_slots: id 0 | task 259 | Checking checkpoint with [258036, 258036] against 10...
...  (all 32 checkpoints checked, none match)
slot update_slots: id 0 | task 259 | erased invalidated context checkpoint (pos_min = 225279, ...)
slot update_slots: id 0 | task 259 | erased invalidated context checkpoint (pos_min = 227327, ...)
...  (all 32 checkpoints erased)
slot update_slots: id 0 | task 259 | n_tokens = 0, memory_seq_rm [0, end)

The bug is scale-dependent: cache reuse works at shorter contexts but breaks at longer ones because the checkpoint position thresholds grow beyond any checkpoint's range.

Root cause

For recurrent layers, llama_memory_seq_pos_min() returns the full sequence length (not the SWA window position). The existing threshold logic pos_min_thold = max(0, pos_next - n_swa) evaluates to pos_next when n_swa == 0 (Qwen3.5 has no SWA), so:

pos_min >= pos_min_thold is always true → enters checkpoint search
cur.pos_min < pos_min_thold finds no match (threshold too large at long contexts)
do_reset zeros n_past → full re-processing
All checkpoints with pos_max > pos_next get erased

At shorter contexts, the prompt cache similarity matching (sim_best) finds a usable cached prompt before the checkpoint logic triggers, masking the bug. At 250K+, this fallback fails.

Fix

Four changes, all gated behind llama_model_is_recurrent():

pos_min_thold = 0 for recurrent models (SWA threshold not meaningful for recurrent state)
Checkpoint search uses position-matching semantics (cur.pos_max <= n_past) instead of SWA window logic
do_reset path preserves n_past instead of zeroing it (avoids discarding valid cache)
Log cleanup: checkpoint erasure message no longer references n_swa

Pure transformer / SWA paths are completely unchanged.

Testing

Tested on Qwen3.5-35B-A3B (30x GDN + 10x Gated Attention) with 250K token context on NVIDIA L4 (24GB):

Metric	Before (upstream)	After (this PR)
Cold prefill (240K tokens)	408s	408s
Subsequent query (same prefix)	408s (full re-process)	2.6s (cache reuse)
Speedup	1x	157x
NIAH retrieval accuracy	8/8	8/8

8 different queries tested with 250K static context (needle-in-a-haystack at various depths), all correctly answered in 2.4-2.8s after initial prefill.

Server log after fix (no reprocessing warning):

slot update_slots: id 0 | task 79 | new prompt, n_ctx_slot = 262144
(prompt processing proceeds from cached position, only new tokens processed)

Also verified at 100K context: 115s cold → 1.7s warm (68x speedup).

Related issues

Fixes #20225
Related: #19858, #19690, #19794, #18497

Fix KV cache reuse for hybrid architectures (Qwen3.5, Falcon-H1, Jamba, etc.) where recurrent layers cause pos_min to always equal the sequence length, incorrectly triggering full prompt re-processing on every turn. Four changes, all gated behind llama_model_is_recurrent(): 1. Set pos_min_thold = 0 for recurrent models (SWA threshold not meaningful) 2. Checkpoint search uses position-matching semantics instead of SWA window 3. do_reset path preserves n_past instead of zeroing it 4. Checkpoint erasure log no longer references n_swa Pure transformer paths unchanged. Tested on Qwen3.5-35B-A3B with 250K context on NVIDIA L4: - Cold prefill: 408s (240K tokens) - Subsequent queries (same prefix, different suffix): 2.6s average - 157x speedup, 8/8 needle-in-haystack retrieval correct Fixes ggml-org#20225

ggml-gh-bot · 2026-03-28T04:52:03Z

Hi @hanxiao, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

nssatlantis · 2026-03-29T01:46:51Z

Would be extremely helpful to get in, long-context becomes a bit of a pain.

howlger · 2026-03-31T07:21:38Z

Would be extremely helpful to get in, long-context becomes a bit of a pain.

@nssatlantis Please provide a minimal reproducible example of what you call "a bit of a pain". This pull request claims to fix the already closed issue #20225. However, that issue describes a bug that does not actually exist, because in that case, the cache is invalidated correctly invalidated by OpenCode mutating the prefix. So if there is still a bug, please open a new issue with an example with which the bug can be reproduced and with which a potential fix can be verified whether it is working or not.

Note also that some parts of the proposed code change in this pull request seems odd. For example, it removes still relevant information from being logged, rather than extending it with the newly introduced is_recurrent flag.

nssatlantis · 2026-03-31T13:52:19Z

Would be extremely helpful to get in, long-context becomes a bit of a pain.

@nssatlantis Please provide a minimal reproducible example of what you call "a bit of a pain". This pull request claims to fix the already closed issue #20225. However, that issue describes a bug that does not actually exist, because in that case, the cache is invalidated correctly invalidated by OpenCode mutating the prefix. So if there is still a bug, please open a new issue with an example with which the bug can be reproduced and with which a potential fix can be verified whether it is working or not.

Note also that some parts of the proposed code change in this pull request seems odd. For example, it removes still relevant information from being logged, rather than extending it with the newly introduced is_recurrent flag.

From my quick perspective, it seemed like something that would definitely help the cache not invalidate itself, however, I haven't read through the code, just what it supposedly would fix.

I also learned that the cache does work to a greater extent, and half of the cache invalidations was due to injecting dynamic text into a prompt at the top, rather than the bottom.

So in short, hoped/thought this would help, but as is often the case... User error :)

hanxiao requested a review from a team as a code owner March 28, 2026 04:48

github-actions Bot added examples server labels Mar 28, 2026

ggerganov closed this Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: fix prompt cache reuse for hybrid/recurrent models#21099

server: fix prompt cache reuse for hybrid/recurrent models#21099
hanxiao wants to merge 1 commit intoggml-org:masterfrom
hanxiao:fix/hybrid-recurrent-cache-reuse

hanxiao commented Mar 28, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot Bot commented Mar 28, 2026

Uh oh!

nssatlantis commented Mar 29, 2026

Uh oh!

howlger commented Mar 31, 2026

Uh oh!

nssatlantis commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hanxiao commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Observed behavior with upstream Docker image

Root cause

Fix

Testing

Related issues

Uh oh!

ggml-gh-bot Bot commented Mar 28, 2026

Uh oh!

nssatlantis commented Mar 29, 2026

Uh oh!

howlger commented Mar 31, 2026

Uh oh!

nssatlantis commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hanxiao commented Mar 28, 2026 •

edited

Loading