From b9989525bbd0a75355751b5a45d1bcc07b21b877 Mon Sep 17 00:00:00 2001 From: roshi Date: Thu, 30 Apr 2026 12:17:44 +1000 Subject: [PATCH] server: prevent destructive memory wipe on seq_rm failure for hybrid models MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When llama_memory_seq_rm() fails for hybrid/recurrent architectures (e.g. qwen35moe with Gated DeltaNet layers), the server currently wipes ALL cached state and forces full prompt re-processing from scratch. This destroys any checkpoint that was just restored, making the checkpoint system useless for these models. The fix adds a guard: if this is a hybrid model (ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL) and there are cached tokens from a restored checkpoint, skip the destructive wipe. The checkpoint system already ensures state consistency — the wipe is counterproductive. The existing checkpoint search predicate (cur.pos_min < pos_min_thold) correctly handles hybrid models by only restoring checkpoints that fall within the common prefix between old and new prompts. When no valid checkpoint exists (e.g. completely unrelated prompts), the do_reset path fires and performs a clean full re-processing. Affected architectures: qwen35moe (Qwen3.5/3.6 MoE), qwen3next, jamba, falcon-h1 — any model where llama_memory_seq_rm() returns false for partial range removal. Pure transformer models (qwen3moe, llama, etc.) are unaffected as their seq_rm succeeds normally. Requires --ctx-checkpoints N and --checkpoint-every-n-tokens M to have any effect (without checkpoints, there is nothing to preserve). Tested with Qwen3.6-35B-A3B (qwen35moe) over 5-turn conversations: Turn 1: 0% cached (cold start) Turn 2: 0% cached (no checkpoint from turn 1 yet) Turn 3: 61.6% cached (188/305) Turn 4: 67.9% cached (301/443) Turn 5: 72.8% cached (439/603) All facts recalled correctly across turns. Independent single-turn requests with unrelated prompts also work correctly (clean do_reset, no errors). --- tools/server/server-context.cpp | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp index e3822225bdb..e44d72b75ac 100644 --- a/tools/server/server-context.cpp +++ b/tools/server/server-context.cpp @@ -2559,12 +2559,17 @@ struct server_context_impl { SLT_INF(slot, "n_tokens = %d, memory_seq_rm [%d, end)\n", slot.prompt.n_tokens(), p0); if (!llama_memory_seq_rm(llama_get_memory(ctx), slot.id, p0, -1)) { - SLT_WRN(slot, "failed to truncate tokens with position >= %d - clearing the memory\n", p0); - - slot.prompt_clear(true); - - // there is no common part left - slot.n_prompt_tokens_cache = 0; + if (slot.ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL && slot.n_prompt_tokens_cache > 0) { + // hybrid/recurrent: partial seq_rm always fails, but checkpoint restored valid state + SLT_INF(slot, "seq_rm failed (expected for hybrid) - keeping %d cached tokens from checkpoint\n", slot.n_prompt_tokens_cache); + } else { + SLT_WRN(slot, "failed to truncate tokens with position >= %d - clearing the memory\n", p0); + + slot.prompt_clear(true); + + // there is no common part left + slot.n_prompt_tokens_cache = 0; + } } // If using an alora, there may be uncached tokens that come