Skip to content

server: fix prompt cache reuse for hybrid/recurrent models#21099

Closed
hanxiao wants to merge 1 commit intoggml-org:masterfrom
hanxiao:fix/hybrid-recurrent-cache-reuse
Closed

server: fix prompt cache reuse for hybrid/recurrent models#21099
hanxiao wants to merge 1 commit intoggml-org:masterfrom
hanxiao:fix/hybrid-recurrent-cache-reuse

Conversation

@hanxiao
Copy link
Copy Markdown

@hanxiao hanxiao commented Mar 28, 2026

Problem

Hybrid/recurrent models (Qwen3.5, Falcon-H1, Jamba, Nemotron, etc.) trigger full prompt re-processing at long context lengths, even when the prompt prefix is unchanged. This makes memory-store use cases (large static context + small varying query) impractical.

Observed behavior with upstream Docker image

Tested with ghcr.io/ggml-org/llama.cpp:server-cuda (latest, 2026-03-27 build, includes #16382 + #19045):

Context size Cold prefill Cache reuse (Q2) Speedup Status
23K tokens 20.9s 0.8s 24.9x ✅ Works
112K tokens 136.9s 1.5s 91.5x ✅ Works
258K tokens 437.6s 437.6s (full re-process) 1x Broken

At 258K tokens, the server log shows all 32 checkpoints being erased and full re-processing:

slot update_slots: id 0 | task 259 | n_past = 11, pos_min = 258048, n_swa = 1
slot update_slots: id 0 | task 259 | Checking checkpoint with [258036, 258036] against 10...
...  (all 32 checkpoints checked, none match)
slot update_slots: id 0 | task 259 | erased invalidated context checkpoint (pos_min = 225279, ...)
slot update_slots: id 0 | task 259 | erased invalidated context checkpoint (pos_min = 227327, ...)
...  (all 32 checkpoints erased)
slot update_slots: id 0 | task 259 | n_tokens = 0, memory_seq_rm [0, end)

The bug is scale-dependent: cache reuse works at shorter contexts but breaks at longer ones because the checkpoint position thresholds grow beyond any checkpoint's range.


Root cause

For recurrent layers, llama_memory_seq_pos_min() returns the full sequence length (not the SWA window position). The existing threshold logic pos_min_thold = max(0, pos_next - n_swa) evaluates to pos_next when n_swa == 0 (Qwen3.5 has no SWA), so:

  1. pos_min >= pos_min_thold is always true → enters checkpoint search
  2. cur.pos_min < pos_min_thold finds no match (threshold too large at long contexts)
  3. do_reset zeros n_past → full re-processing
  4. All checkpoints with pos_max > pos_next get erased

At shorter contexts, the prompt cache similarity matching (sim_best) finds a usable cached prompt before the checkpoint logic triggers, masking the bug. At 250K+, this fallback fails.

Fix

Four changes, all gated behind llama_model_is_recurrent():

  1. pos_min_thold = 0 for recurrent models (SWA threshold not meaningful for recurrent state)
  2. Checkpoint search uses position-matching semantics (cur.pos_max <= n_past) instead of SWA window logic
  3. do_reset path preserves n_past instead of zeroing it (avoids discarding valid cache)
  4. Log cleanup: checkpoint erasure message no longer references n_swa

Pure transformer / SWA paths are completely unchanged.

Testing

Tested on Qwen3.5-35B-A3B (30x GDN + 10x Gated Attention) with 250K token context on NVIDIA L4 (24GB):

Metric Before (upstream) After (this PR)
Cold prefill (240K tokens) 408s 408s
Subsequent query (same prefix) 408s (full re-process) 2.6s (cache reuse)
Speedup 1x 157x
NIAH retrieval accuracy 8/8 8/8

8 different queries tested with 250K static context (needle-in-a-haystack at various depths), all correctly answered in 2.4-2.8s after initial prefill.

Server log after fix (no reprocessing warning):

slot update_slots: id 0 | task 79 | new prompt, n_ctx_slot = 262144
(prompt processing proceeds from cached position, only new tokens processed)

Also verified at 100K context: 115s cold → 1.7s warm (68x speedup).

Related issues

Fixes #20225
Related: #19858, #19690, #19794, #18497

Fix KV cache reuse for hybrid architectures (Qwen3.5, Falcon-H1, Jamba,
etc.) where recurrent layers cause pos_min to always equal the sequence
length, incorrectly triggering full prompt re-processing on every turn.

Four changes, all gated behind llama_model_is_recurrent():

1. Set pos_min_thold = 0 for recurrent models (SWA threshold not meaningful)
2. Checkpoint search uses position-matching semantics instead of SWA window
3. do_reset path preserves n_past instead of zeroing it
4. Checkpoint erasure log no longer references n_swa

Pure transformer paths unchanged.

Tested on Qwen3.5-35B-A3B with 250K context on NVIDIA L4:
- Cold prefill: 408s (240K tokens)
- Subsequent queries (same prefix, different suffix): 2.6s average
- 157x speedup, 8/8 needle-in-haystack retrieval correct

Fixes ggml-org#20225
@hanxiao hanxiao requested a review from a team as a code owner March 28, 2026 04:48
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Mar 28, 2026

Hi @hanxiao, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@ggerganov ggerganov closed this Mar 28, 2026
@nssatlantis
Copy link
Copy Markdown

Would be extremely helpful to get in, long-context becomes a bit of a pain.

@howlger
Copy link
Copy Markdown
Contributor

howlger commented Mar 31, 2026

Would be extremely helpful to get in, long-context becomes a bit of a pain.

@nssatlantis Please provide a minimal reproducible example of what you call "a bit of a pain". This pull request claims to fix the already closed issue #20225. However, that issue describes a bug that does not actually exist, because in that case, the cache is invalidated correctly invalidated by OpenCode mutating the prefix. So if there is still a bug, please open a new issue with an example with which the bug can be reproduced and with which a potential fix can be verified whether it is working or not.

Note also that some parts of the proposed code change in this pull request seems odd. For example, it removes still relevant information from being logged, rather than extending it with the newly introduced is_recurrent flag.

@nssatlantis
Copy link
Copy Markdown

Would be extremely helpful to get in, long-context becomes a bit of a pain.

@nssatlantis Please provide a minimal reproducible example of what you call "a bit of a pain". This pull request claims to fix the already closed issue #20225. However, that issue describes a bug that does not actually exist, because in that case, the cache is invalidated correctly invalidated by OpenCode mutating the prefix. So if there is still a bug, please open a new issue with an example with which the bug can be reproduced and with which a potential fix can be verified whether it is working or not.

Note also that some parts of the proposed code change in this pull request seems odd. For example, it removes still relevant information from being logged, rather than extending it with the newly introduced is_recurrent flag.

From my quick perspective, it seemed like something that would definitely help the cache not invalidate itself, however, I haven't read through the code, just what it supposedly would fix.

I also learned that the cache does work to a greater extent, and half of the cache invalidations was due to injecting dynamic text into a prompt at the top, rather than the bottom.

So in short, hoped/thought this would help, but as is often the case... User error :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Qwen 3.5 Full prompt re-processing on every conversation turn

4 participants