server: prevent destructive memory wipe on seq_rm failure for hybrid models by BlisteringViola · Pull Request #22534 · ggml-org/llama.cpp

BlisteringViola · 2026-04-30T02:47:33Z

Problem

For hybrid/recurrent architectures (e.g. qwen35moe with Gated DeltaNet layers), llama_memory_seq_rm() always returns false for partial range removal — the recurrent state cannot be partially truncated. When this happens, the server unconditionally wipes ALL cached state:

if (!llama_memory_seq_rm(llama_get_memory(ctx), slot.id, p0, -1)) {
    slot.prompt_clear(true);
    slot.n_prompt_tokens_cache = 0;
}

This destroys any checkpoint that was just restored by the checkpoint system, making --ctx-checkpoints and --checkpoint-every-n-tokens completely useless for hybrid models. Every request forces full prompt re-processing from scratch, regardless of how much context is shared with the previous request.

Related issues: #18497, #19690, #19794, #19858, #19977, #20003, #20153, #20225, #21383, #21831

Fix

Add a guard: if this is a hybrid model (ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL) and there are cached tokens from a successfully restored checkpoint (n_prompt_tokens_cache > 0), skip the destructive wipe. The checkpoint system already ensures state consistency.

if (!llama_memory_seq_rm(llama_get_memory(ctx), slot.id, p0, -1)) {
    if (slot.ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL && slot.n_prompt_tokens_cache > 0) {
        SLT_INF(slot, "seq_rm failed (expected for hybrid) - keeping %d cached tokens from checkpoint
",
            slot.n_prompt_tokens_cache);
    } else {
        SLT_WRN(slot, "failed to truncate tokens with position >= %d - clearing the memory
", p0);
        slot.prompt_clear(true);
        slot.n_prompt_tokens_cache = 0;
    }
}

The existing checkpoint search predicate (cur.pos_min < pos_min_thold) already correctly handles hybrid models — it only restores checkpoints that fall within the common prefix between old and new prompts. When no valid checkpoint exists (e.g. completely unrelated prompts), the do_reset path fires and performs a clean full re-processing. No changes are needed to the checkpoint search logic.

How it works

Multi-turn conversation (same chat session):

Turn N processes prompt, creates checkpoint at end of prompt processing
Turn N+1 arrives — shares Turn N's entire prompt as common prefix
Checkpoint from Turn N is within the common prefix → restored
seq_rm fails (expected for hybrid) → fix preserves the cached state
Server processes only the new tokens from Turn N+1

Independent requests (different prompts):

New prompt shares only a small prefix with cached prompt (e.g. system prompt)
No checkpoint exists within the short common prefix → do_reset fires
pos_next = 0, n_past = 0 → full reprocessing from clean state
No checkpoint to preserve, so the fix's guard (n_prompt_tokens_cache > 0) falls through to the original wipe behavior

Affected architectures

Any model where llama_memory_seq_rm() returns false for partial range removal:

qwen35moe (Qwen3.5/3.6 MoE with Gated DeltaNet)
qwen3next (Qwen3-Next)
jamba
falcon-h1

Pure transformer models (qwen3moe, llama, gemma, etc.) are unaffected — their seq_rm succeeds normally and the new code path is never reached.

Test results

Tested with Qwen3.6-35B-A3B (qwen35moe, Q8_0, CPU inference) with --ctx-checkpoints 256 --checkpoint-every-n-tokens 512:

5-turn multi-turn conversation:

Turn	Prompt tokens	Cached	Cache rate
1	57	0	0%
2	192	0	0%
3	305	188	61.6%
4	443	301	67.9%
5	603	439	72.8%

All facts established in earlier turns recalled correctly. Turn 2 shows 0% because no checkpoint exists yet from Turn 1's short prompt (< checkpoint-every-n-tokens).

Independent single-turn requests (4 unrelated prompts): All returned HTTP 200 with correct responses. No M-RoPE errors, no decode failures. Each request cleanly triggered do_reset → full reprocessing.

Requirements

--ctx-checkpoints N (N > 0) and --checkpoint-every-n-tokens M must be set. Without checkpoints, there is nothing to preserve.
Requires [Tensor Parallel] Fix recurrent state serialization for partial reads and writes #22362 (recurrent state serialization fix) which landed in b8967.

…models When llama_memory_seq_rm() fails for hybrid/recurrent architectures (e.g. qwen35moe with Gated DeltaNet layers), the server currently wipes ALL cached state and forces full prompt re-processing from scratch. This destroys any checkpoint that was just restored, making the checkpoint system useless for these models. The fix adds a guard: if this is a hybrid model (ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL) and there are cached tokens from a restored checkpoint, skip the destructive wipe. The checkpoint system already ensures state consistency — the wipe is counterproductive. The existing checkpoint search predicate (cur.pos_min < pos_min_thold) correctly handles hybrid models by only restoring checkpoints that fall within the common prefix between old and new prompts. When no valid checkpoint exists (e.g. completely unrelated prompts), the do_reset path fires and performs a clean full re-processing. Affected architectures: qwen35moe (Qwen3.5/3.6 MoE), qwen3next, jamba, falcon-h1 — any model where llama_memory_seq_rm() returns false for partial range removal. Pure transformer models (qwen3moe, llama, etc.) are unaffected as their seq_rm succeeds normally. Requires --ctx-checkpoints N and --checkpoint-every-n-tokens M to have any effect (without checkpoints, there is nothing to preserve). Tested with Qwen3.6-35B-A3B (qwen35moe) over 5-turn conversations: Turn 1: 0% cached (cold start) Turn 2: 0% cached (no checkpoint from turn 1 yet) Turn 3: 61.6% cached (188/305) Turn 4: 67.9% cached (301/443) Turn 5: 72.8% cached (439/603) All facts recalled correctly across turns. Independent single-turn requests with unrelated prompts also work correctly (clean do_reset, no errors).

ggml-gh-bot · 2026-04-30T02:51:56Z

Hi @BlisteringViola, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

BlisteringViola requested a review from a team as a code owner April 30, 2026 02:47

github-actions Bot added examples server labels Apr 30, 2026

am17an closed this Apr 30, 2026

janvitos mentioned this pull request Apr 30, 2026

Server forces full prompt re-processing on subsequent requests (SWA/recurrent memory error) #21831

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: prevent destructive memory wipe on seq_rm failure for hybrid models#22534

server: prevent destructive memory wipe on seq_rm failure for hybrid models#22534
BlisteringViola wants to merge 1 commit intoggml-org:masterfrom
BlisteringViola:fix/hybrid-recurrent-seq-rm-wipe

BlisteringViola commented Apr 30, 2026

Uh oh!

ggml-gh-bot Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BlisteringViola commented Apr 30, 2026

Problem

Fix

How it works

Affected architectures

Test results

Requirements

Uh oh!

ggml-gh-bot Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants