From b9989525bbd0a75355751b5a45d1bcc07b21b877 Mon Sep 17 00:00:00 2001
From: roshi <roshi@Dobby.servers.local>
Date: Thu, 30 Apr 2026 12:17:44 +1000
Subject: [PATCH] server: prevent destructive memory wipe on seq_rm failure for
 hybrid models
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When llama_memory_seq_rm() fails for hybrid/recurrent architectures
(e.g. qwen35moe with Gated DeltaNet layers), the server currently
wipes ALL cached state and forces full prompt re-processing from
scratch. This destroys any checkpoint that was just restored, making
the checkpoint system useless for these models.

The fix adds a guard: if this is a hybrid model (ctx_seq_rm_type ==
COMMON_CONTEXT_SEQ_RM_TYPE_FULL) and there are cached tokens from a
restored checkpoint, skip the destructive wipe. The checkpoint system
already ensures state consistency — the wipe is counterproductive.

The existing checkpoint search predicate (cur.pos_min < pos_min_thold)
correctly handles hybrid models by only restoring checkpoints that
fall within the common prefix between old and new prompts. When no
valid checkpoint exists (e.g. completely unrelated prompts), the
do_reset path fires and performs a clean full re-processing.

Affected architectures: qwen35moe (Qwen3.5/3.6 MoE), qwen3next,
jamba, falcon-h1 — any model where llama_memory_seq_rm() returns
false for partial range removal. Pure transformer models (qwen3moe,
llama, etc.) are unaffected as their seq_rm succeeds normally.

Requires --ctx-checkpoints N and --checkpoint-every-n-tokens M to
have any effect (without checkpoints, there is nothing to preserve).

Tested with Qwen3.6-35B-A3B (qwen35moe) over 5-turn conversations:
  Turn 1: 0% cached (cold start)
  Turn 2: 0% cached (no checkpoint from turn 1 yet)
  Turn 3: 61.6% cached (188/305)
  Turn 4: 67.9% cached (301/443)
  Turn 5: 72.8% cached (439/603)
All facts recalled correctly across turns. Independent single-turn
requests with unrelated prompts also work correctly (clean do_reset,
no errors).
---
 tools/server/server-context.cpp | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index e3822225bdb..e44d72b75ac 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -2559,12 +2559,17 @@ struct server_context_impl {
                     SLT_INF(slot, "n_tokens = %d, memory_seq_rm [%d, end)\n", slot.prompt.n_tokens(), p0);
 
                     if (!llama_memory_seq_rm(llama_get_memory(ctx), slot.id, p0, -1)) {
-                        SLT_WRN(slot, "failed to truncate tokens with position >= %d - clearing the memory\n", p0);
-
-                        slot.prompt_clear(true);
-
-                        // there is no common part left
-                        slot.n_prompt_tokens_cache = 0;
+                        if (slot.ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL && slot.n_prompt_tokens_cache > 0) {
+                            // hybrid/recurrent: partial seq_rm always fails, but checkpoint restored valid state
+                            SLT_INF(slot, "seq_rm failed (expected for hybrid) - keeping %d cached tokens from checkpoint\n", slot.n_prompt_tokens_cache);
+                        } else {
+                            SLT_WRN(slot, "failed to truncate tokens with position >= %d - clearing the memory\n", p0);
+                    
+                            slot.prompt_clear(true);
+                    
+                            // there is no common part left
+                            slot.n_prompt_tokens_cache = 0;
+                        }
                     }
 
                     // If using an alora, there may be uncached tokens that come