server: prevent destructive memory wipe on seq_rm failure for hybrid models#22534
Closed
BlisteringViola wants to merge 1 commit intoggml-org:masterfrom
Closed
server: prevent destructive memory wipe on seq_rm failure for hybrid models#22534BlisteringViola wants to merge 1 commit intoggml-org:masterfrom
BlisteringViola wants to merge 1 commit intoggml-org:masterfrom
Conversation
…models When llama_memory_seq_rm() fails for hybrid/recurrent architectures (e.g. qwen35moe with Gated DeltaNet layers), the server currently wipes ALL cached state and forces full prompt re-processing from scratch. This destroys any checkpoint that was just restored, making the checkpoint system useless for these models. The fix adds a guard: if this is a hybrid model (ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL) and there are cached tokens from a restored checkpoint, skip the destructive wipe. The checkpoint system already ensures state consistency — the wipe is counterproductive. The existing checkpoint search predicate (cur.pos_min < pos_min_thold) correctly handles hybrid models by only restoring checkpoints that fall within the common prefix between old and new prompts. When no valid checkpoint exists (e.g. completely unrelated prompts), the do_reset path fires and performs a clean full re-processing. Affected architectures: qwen35moe (Qwen3.5/3.6 MoE), qwen3next, jamba, falcon-h1 — any model where llama_memory_seq_rm() returns false for partial range removal. Pure transformer models (qwen3moe, llama, etc.) are unaffected as their seq_rm succeeds normally. Requires --ctx-checkpoints N and --checkpoint-every-n-tokens M to have any effect (without checkpoints, there is nothing to preserve). Tested with Qwen3.6-35B-A3B (qwen35moe) over 5-turn conversations: Turn 1: 0% cached (cold start) Turn 2: 0% cached (no checkpoint from turn 1 yet) Turn 3: 61.6% cached (188/305) Turn 4: 67.9% cached (301/443) Turn 5: 72.8% cached (439/603) All facts recalled correctly across turns. Independent single-turn requests with unrelated prompts also work correctly (clean do_reset, no errors).
|
Hi @BlisteringViola, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
For hybrid/recurrent architectures (e.g.
qwen35moewith Gated DeltaNet layers),llama_memory_seq_rm()always returnsfalsefor partial range removal — the recurrent state cannot be partially truncated. When this happens, the server unconditionally wipes ALL cached state:This destroys any checkpoint that was just restored by the checkpoint system, making
--ctx-checkpointsand--checkpoint-every-n-tokenscompletely useless for hybrid models. Every request forces full prompt re-processing from scratch, regardless of how much context is shared with the previous request.Related issues: #18497, #19690, #19794, #19858, #19977, #20003, #20153, #20225, #21383, #21831
Fix
Add a guard: if this is a hybrid model (
ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL) and there are cached tokens from a successfully restored checkpoint (n_prompt_tokens_cache > 0), skip the destructive wipe. The checkpoint system already ensures state consistency.The existing checkpoint search predicate (
cur.pos_min < pos_min_thold) already correctly handles hybrid models — it only restores checkpoints that fall within the common prefix between old and new prompts. When no valid checkpoint exists (e.g. completely unrelated prompts), thedo_resetpath fires and performs a clean full re-processing. No changes are needed to the checkpoint search logic.How it works
Multi-turn conversation (same chat session):
seq_rmfails (expected for hybrid) → fix preserves the cached stateIndependent requests (different prompts):
do_resetfirespos_next = 0, n_past = 0→ full reprocessing from clean staten_prompt_tokens_cache > 0) falls through to the original wipe behaviorAffected architectures
Any model where
llama_memory_seq_rm()returnsfalsefor partial range removal:qwen35moe(Qwen3.5/3.6 MoE with Gated DeltaNet)qwen3next(Qwen3-Next)jambafalcon-h1Pure transformer models (
qwen3moe,llama,gemma, etc.) are unaffected — theirseq_rmsucceeds normally and the new code path is never reached.Test results
Tested with Qwen3.6-35B-A3B (
qwen35moe, Q8_0, CPU inference) with--ctx-checkpoints 256 --checkpoint-every-n-tokens 512:5-turn multi-turn conversation:
All facts established in earlier turns recalled correctly. Turn 2 shows 0% because no checkpoint exists yet from Turn 1's short prompt (<
checkpoint-every-n-tokens).Independent single-turn requests (4 unrelated prompts): All returned HTTP 200 with correct responses. No M-RoPE errors, no decode failures. Each request cleanly triggered
do_reset→ full reprocessing.Requirements
--ctx-checkpoints N(N > 0) and--checkpoint-every-n-tokens Mmust be set. Without checkpoints, there is nothing to preserve.