Skip to content

server: prevent destructive memory wipe on seq_rm failure for hybrid models#22534

Closed
BlisteringViola wants to merge 1 commit intoggml-org:masterfrom
BlisteringViola:fix/hybrid-recurrent-seq-rm-wipe
Closed

server: prevent destructive memory wipe on seq_rm failure for hybrid models#22534
BlisteringViola wants to merge 1 commit intoggml-org:masterfrom
BlisteringViola:fix/hybrid-recurrent-seq-rm-wipe

Conversation

@BlisteringViola
Copy link
Copy Markdown

Problem

For hybrid/recurrent architectures (e.g. qwen35moe with Gated DeltaNet layers), llama_memory_seq_rm() always returns false for partial range removal — the recurrent state cannot be partially truncated. When this happens, the server unconditionally wipes ALL cached state:

if (!llama_memory_seq_rm(llama_get_memory(ctx), slot.id, p0, -1)) {
    slot.prompt_clear(true);
    slot.n_prompt_tokens_cache = 0;
}

This destroys any checkpoint that was just restored by the checkpoint system, making --ctx-checkpoints and --checkpoint-every-n-tokens completely useless for hybrid models. Every request forces full prompt re-processing from scratch, regardless of how much context is shared with the previous request.

Related issues: #18497, #19690, #19794, #19858, #19977, #20003, #20153, #20225, #21383, #21831

Fix

Add a guard: if this is a hybrid model (ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL) and there are cached tokens from a successfully restored checkpoint (n_prompt_tokens_cache > 0), skip the destructive wipe. The checkpoint system already ensures state consistency.

if (!llama_memory_seq_rm(llama_get_memory(ctx), slot.id, p0, -1)) {
    if (slot.ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL && slot.n_prompt_tokens_cache > 0) {
        SLT_INF(slot, "seq_rm failed (expected for hybrid) - keeping %d cached tokens from checkpoint
",
            slot.n_prompt_tokens_cache);
    } else {
        SLT_WRN(slot, "failed to truncate tokens with position >= %d - clearing the memory
", p0);
        slot.prompt_clear(true);
        slot.n_prompt_tokens_cache = 0;
    }
}

The existing checkpoint search predicate (cur.pos_min < pos_min_thold) already correctly handles hybrid models — it only restores checkpoints that fall within the common prefix between old and new prompts. When no valid checkpoint exists (e.g. completely unrelated prompts), the do_reset path fires and performs a clean full re-processing. No changes are needed to the checkpoint search logic.

How it works

Multi-turn conversation (same chat session):

  1. Turn N processes prompt, creates checkpoint at end of prompt processing
  2. Turn N+1 arrives — shares Turn N's entire prompt as common prefix
  3. Checkpoint from Turn N is within the common prefix → restored
  4. seq_rm fails (expected for hybrid) → fix preserves the cached state
  5. Server processes only the new tokens from Turn N+1

Independent requests (different prompts):

  1. New prompt shares only a small prefix with cached prompt (e.g. system prompt)
  2. No checkpoint exists within the short common prefix → do_reset fires
  3. pos_next = 0, n_past = 0 → full reprocessing from clean state
  4. No checkpoint to preserve, so the fix's guard (n_prompt_tokens_cache > 0) falls through to the original wipe behavior

Affected architectures

Any model where llama_memory_seq_rm() returns false for partial range removal:

  • qwen35moe (Qwen3.5/3.6 MoE with Gated DeltaNet)
  • qwen3next (Qwen3-Next)
  • jamba
  • falcon-h1

Pure transformer models (qwen3moe, llama, gemma, etc.) are unaffected — their seq_rm succeeds normally and the new code path is never reached.

Test results

Tested with Qwen3.6-35B-A3B (qwen35moe, Q8_0, CPU inference) with --ctx-checkpoints 256 --checkpoint-every-n-tokens 512:

5-turn multi-turn conversation:

Turn Prompt tokens Cached Cache rate
1 57 0 0%
2 192 0 0%
3 305 188 61.6%
4 443 301 67.9%
5 603 439 72.8%

All facts established in earlier turns recalled correctly. Turn 2 shows 0% because no checkpoint exists yet from Turn 1's short prompt (< checkpoint-every-n-tokens).

Independent single-turn requests (4 unrelated prompts): All returned HTTP 200 with correct responses. No M-RoPE errors, no decode failures. Each request cleanly triggered do_reset → full reprocessing.

Requirements

…models

When llama_memory_seq_rm() fails for hybrid/recurrent architectures
(e.g. qwen35moe with Gated DeltaNet layers), the server currently
wipes ALL cached state and forces full prompt re-processing from
scratch. This destroys any checkpoint that was just restored, making
the checkpoint system useless for these models.

The fix adds a guard: if this is a hybrid model (ctx_seq_rm_type ==
COMMON_CONTEXT_SEQ_RM_TYPE_FULL) and there are cached tokens from a
restored checkpoint, skip the destructive wipe. The checkpoint system
already ensures state consistency — the wipe is counterproductive.

The existing checkpoint search predicate (cur.pos_min < pos_min_thold)
correctly handles hybrid models by only restoring checkpoints that
fall within the common prefix between old and new prompts. When no
valid checkpoint exists (e.g. completely unrelated prompts), the
do_reset path fires and performs a clean full re-processing.

Affected architectures: qwen35moe (Qwen3.5/3.6 MoE), qwen3next,
jamba, falcon-h1 — any model where llama_memory_seq_rm() returns
false for partial range removal. Pure transformer models (qwen3moe,
llama, etc.) are unaffected as their seq_rm succeeds normally.

Requires --ctx-checkpoints N and --checkpoint-every-n-tokens M to
have any effect (without checkpoints, there is nothing to preserve).

Tested with Qwen3.6-35B-A3B (qwen35moe) over 5-turn conversations:
  Turn 1: 0% cached (cold start)
  Turn 2: 0% cached (no checkpoint from turn 1 yet)
  Turn 3: 61.6% cached (188/305)
  Turn 4: 67.9% cached (301/443)
  Turn 5: 72.8% cached (439/603)
All facts recalled correctly across turns. Independent single-turn
requests with unrelated prompts also work correctly (clean do_reset,
no errors).
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 30, 2026

Hi @BlisteringViola, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants