Skip to content

server: Fix multimodal context checkpointing for hybrid/recurrent models#19747

Closed
timkhronos wants to merge 13 commits intoggml-org:masterfrom
timkhronos:Checkpoints_With_Vision
Closed

server: Fix multimodal context checkpointing for hybrid/recurrent models#19747
timkhronos wants to merge 13 commits intoggml-org:masterfrom
timkhronos:Checkpoints_With_Vision

Conversation

@timkhronos
Copy link
Copy Markdown
Contributor

This PR enables context checkpointing to work with multimodal inputs on hybrid/recurrent architectures (e.g. Qwen3.5).

Previously, checkpointing was hard-disabled when an mmproj was loaded, causing full prompt reprocessing on every turn.
The changes implemented in this PR allow context checkpointing to function normally, aimed at properly handling processed images in the kvcache.

Tested extensively on Qwen3.5, and context checkpointing now works as expected with multimodal contexts.

This closes #19690.

Comment on lines +2436 to +2440
const llama_pos checkpoint_pos = std::max(it->pos_min, it->pos_max);
llama_memory_seq_rm(llama_get_memory(ctx), slot.id, checkpoint_pos, -1);

slot.prompt_clear(true);
const size_t checkpoint_size = it->data.size();
const size_t n = llama_state_seq_set_data_ext(ctx, it->data.data(), checkpoint_size, slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid the changes to libllama by simply calling llama_memory_seq_rm() after restoring the partial (i.e. recurrent in this case) state?

Copy link
Copy Markdown
Contributor Author

@timkhronos timkhronos Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the changes to find slot in llama-memory-recurrent.cpp are unavoidable, as the issue is at checkpoint creation, not recovery. Reordering the restore/seq_rm in the server would not help because the recurrent cell's position is already wrong in the checkpoint data.

Without the fix, when we try to find the last_pos, it will always read the position from the temporal plane, which is the same for all processed image tokens (lower than the real last position).

This will later cause M-rope constraint violations when we try to restore. The checkpoint would also save the position of the recurrent cell wrong. The kv cache would be properly truncated to the last image token, the recurrent cell's state would also be saved properly, after it was modified by the processed image, but the checkpoint would save the position of the recurrent cell to be just before the image, even though it was already modified by the image.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take an example of such prompt?

In reality, the chance of having image as the last token position should never happen. This is because image is always being followed by end-of-image and/or end-of-turn tokens.

If we want to avoid the problematic case where image is at the end of prompt (without any text tokens after it), then the simple fix is to simply delete the image.

In other words, we can simply this whole logic by doing llama_memory_seq_rm() the image plus one token before it.

Copy link
Copy Markdown
Contributor Author

@timkhronos timkhronos Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify on what part of the implementation you have a problem with?

I don't quite see how we could achieve a valid fix without modifying the checkpointing logic. It was written for text only checkpoints, and doesn't take into account that multiple M-RoPE compressed tokens can share the same temporal value.

In my previous message, my point wasn't that there could be a case where the last token of a prompt is an image token, rather that without changing the logic in memory-recurrent.cpp, the checkpoints wouldn't properly track the amount of M-RoPE compressed image tokens, causing the kv cache and the recurrent cell to desync after a checkpoint restore. As a band aid we could truncate the image(which might still not be a good idea as right now checkpoints are created after prompt processing, so after the recurrent cell is modified by the image. If we load the checkpoint but truncate the image from the kv cache and make it reprocess the image, the recurrent cell will have been modified by the image twice.), but without the modification , if we ever, at any point create a checkpoint after an image has been processed, the checkpoint will be created at the wrong position.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as the issue is at checkpoint creation, not recovery.

I think we have to add logic to not do checkpoints in the middle of an image. This is important because some vision models use non-causal attention and this requires all image tokens for a given image to be processed in a single ubatch (so that each image token can "see" every other image token).

IIUC your proposed solution implicitly assumes causal attention for the image tokens. Although this seems to work for Qwen3.5, it is not completely generic as described in the paragraph above.

Unless I am missing something, if we impose the restriction to not create the checkpoints in the middle of an image, then we won't need the extra changes to libllama. Do you agree?

Copy link
Copy Markdown
Contributor

@ngxson ngxson Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but without the modification , if we ever, at any point create a checkpoint after an image has been processed, the checkpoint will be created at the wrong position.

Please correct if I understand this phrase correctly: what you are saying is the case of M-RoPE where one image takes the same temporal index, example, my image has t=2:

0 1 2 2 2 2 2 4

So what you are saying is that from the perspective of the recurrent layer, it sees:

0 1 2 3 4 5 6 7 ...

Essentially a linearly increasing index, correct?

If that's the case, then can we reuse the same positional tracking system between the 2?

As a band aid we could truncate the image(which might still not be a good idea as right now checkpoints are created after prompt processing, so after the recurrent cell is modified by the image. If we load the checkpoint but truncate the image from the kv cache and make it reprocess the image, the recurrent cell will have been modified by the image twice.)

We should not track the amount of tokens. Instead, we should track the position index.

If we track the position index, the whole image can be viewed as one big blob, it is not allowed to have a checkpoint of half-image as @ggerganov explained above.

To mitigate the case where image can take multiple batches, and that user can potentially stop the processing mid-way, we can always create one checkpoint just right before we process an image.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov You're right that we could add a guard against mid image checkpoints, even though currently as far as I understand, checkpoints are only created after we are done with prompt processing, so there should never be a chance to create a checkpoint mid image. It would be a problem if that ever changed, so we can add that check.

However, that alone wouldn't fix the issue I previously brought up. The bug is in how the checkpoint position is recorded, and it happens even when the checkpoint is created well after the image has been fully processed, regardless of how many text tokens we have processed after.

Here's a concrete example with some imaginary values:

  • We have 6000 text tokens (positions 0–5999) in a conversation
  • An image gets processed, producing ~1500 vision tokens that get M-RoPE compressed to let's say 60 KV entries (positions 6000–6059)
  • A few hundred text tokens follow after the image (positions 6060–6259)
  • Prompt processing finishes, and we create a checkpoint

The original code in memory-recurrent.cpp records the position of the checkpoint by looking at the temporal value of the last token. For text tokens this works fine as each text token increments the temporal value by 1. But for M-RoPE image tokens, all tokens from a single image share the same temporal value. So in the example, we are at position 6259. If the code tries to find the position of the last token however, the image block at 6000–6059 will report a temporal value of ~6000 for all 60 entries, causing the checkpoint to be created for position 6200 ( image tokens temporal pos is 6000 + 200 text tokens. This would ignore that the 60 image tokens are sharing a temporal position, thus are only counted as one)

This means the checkpoint records a position that is too low. The recurrent cell's state has been updated by the full image + 200 text tokens, but the checkpoint metadata says it only covers up to ~6200. On restore, this desync between the recurrent state and the KV cache causes M-RoPE constraint violations + in theory a drift of the recurrent cache of (Number of images in context) * (Image tokens in image -1) * (Number of times we restored the checkpoint) number of tokens after we restore. The drift would get more severe the more times we restore the checkpoint.

The code needs to look at the max(width, height) values from the M-RoPE position planes too, to find the actual span of the image, otherwise the image, regardless of how many tokens it is, would always return as just 1 token, causing the drift. If video vision is ever implemented the way I have it set up here should work correctly, since we look at Temporal + Width + Height.

The above is why I believe the changes to the libllama are needed. Without it, images would always report as only a single token to the checkpointing logic, causing more and more severe drift as the number of restores + images are increased.

I agree we should add a mid-image checkpoint guard for safety, even though I believe it is currently not possible for the create checkpoint logic to run before prompt processing is fully finished, thus preventing that specific scenario, unless I am misunderstanding something. Regardless if someone were to ever add mid prompt processing checkpointing they could run into that issue, and even though I think it would be best addressed in that PR, I'd be happy to add it here if you believe it's best.

But the core problem, that requires the libllama changes is that post-image checkpoints record the wrong position because the temporal plane doesn't reflect the true extent of M-RoPE compressed tokens.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code in memory-recurrent.cpp records the position of the checkpoint by looking at the temporal value of the last token. For text tokens this works fine as each text token increments the temporal value by 1. But for M-RoPE image tokens, all tokens from a single image share the same temporal value. So in the example, we are at position 6259. If the code tries to find the position of the last token however, the image block at 6000–6059 will report a temporal value of ~6000 for all 60 entries, causing the checkpoint to be created for position 6200 ( image tokens temporal pos is 6000 + 200 text tokens. This would ignore that the 60 image tokens are sharing a temporal position, thus are only counted as one)

I'm pretty sure this problem can be better understand from my question above:

If that's the case, then can we reuse the same positional tracking system between the 2?

Also note that we already somewhat has this logic in the mask construction of m-rope:

// M-RoPE causal mask
if (is_2d) {
if (p0 == p1) {
const auto & p0_ext = cells.ext_get(j);
if (p0_ext.is_2d_gt(p1_x, p1_y)) {
goto skip;
}
}
}

So I don't think the current code is acceptable as-is, especially because it assume the next position is max(x, y). This calculation must not be inside libllama.

Comment on lines +609 to +615
if (ubatch.n_pos > 1 && ubatch.embd != nullptr) {
for (uint32_t p = 0; p < ubatch.n_pos; ++p) {
for (uint32_t t = 0; t < n_seq_tokens; ++t) {
last_pos = std::max(last_pos, ubatch.pos[p * ubatch.n_tokens + i + t]);
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must not be inside libllama. If you look into server code, server_tokens::pos_next() already handle this in more generic way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I believe this needs to be in libllama is that the find_slot() is where the recurrent cell's position gets written. This happens internally during ubatch processing, and I don't see where the server would get a chance to intercept or correct the value before it gets stored.

Specifically, find_slot() reads ubatch.pos[i + n_seq_tokens - 1] to determine last_pos, which then gets written to the cell. For M-RoPE ubatches, with the current implementation, that value comes from the temporal plane and is too low. server_tokens::pos_next() tracks the next position to assign, but it doesn't fix that the cell already recorded the last pos during find_slot().

I'm open to handling this in another way, but I personally can't see it fixed without modifying recurrent.cpp since at least as far as I saw, by the time the server could do anything about it, the wrong value was already stored, so we need to fix it where the value gets stored.

Copy link
Copy Markdown
Contributor

@ngxson ngxson Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically what you said is that llama_pos last_pos = ubatch.pos[i + n_seq_tokens - 1]; will return the incorrect position index if the last token in ubatch is an image token.

But in reality, what's the example of such input? You're talking about the case where ubatch contains part of the image? Or the case where image is at the end of prompt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem isn't specifically about the last token being an image token, or images being at the end of the prompt. It happens for any ubatch that contains M-RoPE image tokens, regardless of position in the conversation.
When find_slot() processes an ubatch containing image tokens, it sets the cell's last_pos from the temporal plane. For an image with 60 M-RoPE compressed KV entries at positions 6000–6059, all 60 entries share a temporal value of 6000. So after processing that ubatch, the cell records last_pos = 6000 instead of 6059.
Then the next ubatch with text tokens gets called. The cell thinks it's at position 6000, but the KV cache knows the image occupied slots up to 6059. This drift persists through every subsequent checkpoint save/restore.
Basically let's say:

  1. text tokens 0–5999 -> cell records last_pos = 5999
  2. image tokens, 60 entries, real positions are 6000–6059, but the temporal position for all image tokens is = 6000 thus the cell incorrectly records last_pos as 6000 when it should be 6059
  3. text tokens 6060–6259 -> cell records last_pos as 6200, when it should be 6259

The drift of 59 tokens is always present from 2 onward. It's not an edge case, but rather it happens every time an image is processed. I think my longer reply above to @ggerganov might help illustrate my concern better.

Copy link
Copy Markdown
Contributor

@ngxson ngxson Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ubatch.pos[i + n_seq_tokens - 1] explained in human language is: "get the (temporal) position of the last token in ubatch"

Now, if the "last token in ubatch" is a text token, there is nothing wrong with this logic, because it will always be set to the correct position (pos = 6259 in your case) before even added into the batch.

Which other cases do you think, that the logic above can be wrong?

Copy link
Copy Markdown
Contributor

@ngxson ngxson Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. text tokens 0–5999 -> cell records last_pos = 5999
  2. image tokens, 60 entries, real positions are 6000–6059, but the temporal position for all image tokens is = 6000 thus the cell incorrectly records last_pos as 6000 when it should be 6059
  3. text tokens 6060–6259 -> cell records last_pos as 6200, when it should be 6259

In other words: the moment you decode another text token, because its position is set by server_tokens, its pos will be set to 6259 + 1 = 6260

llama_pos pos = server_tokens.pos_next(); // returns 6259 + 1 = 6260
common_batch_add(batch, text_token, pos, ...);
llama_decode(batch); // now, last_pos will be updated to 6060

So, suddenly, everything will be in sync again?

Copy link
Copy Markdown
Contributor

@ngxson ngxson Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cell thinks it's at position 6000, but the KV cache knows the image occupied slots up to 6059.

Also, important note that KV slot does NOT know the that image occupied position upto 6059, it only knows that 6059 cells are used.

Here is the code where cell.pos is updated in KV cache:

cells.pos_set(idx, ubatch.pos[i]);

Copy link
Copy Markdown
Contributor Author

@timkhronos timkhronos Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite what you said making sense in theory, in practice it fails. I've just tested a build with the find_slot() changes in memory-recurrent.cpp reverted but all other changes intact. The issue reproduces immediately, always prompting a full reprocess after an image is present.
Here are a few relevant pieces of logs from a build without the find_slot modification, everything else from my PR intact. After an image is processed and a checkpoint is created:

created context checkpoint 1 of 8 (pos_min = 11667, pos_max = 11667)
find_slot: non-consecutive token position 11707 after 11667 for sequence 3 with 10 new tokens

The checkpoint is created after prompt processing completes but before generation begins. At that moment, cell.pos is still 11667 — the wrong value from the temporal plane. The actual position should be 11707 (the gap of 40 is the compressed image tokens being counted as 1).
The "self-correction" from token decoding comes too late, by that point the checkpoint has already been initialized with pos_min = 11667, pos_max = 11667.

On the next conversation turn, we try to restore this checkpoint:
restored context checkpoint (pos_min = 11667, pos_max = 11667)
memory_seq_rm [10986, end)
failed to recover recurrent state - clearing the memory

seq_rm fails because the position data is wrong, then we fall back to a full prompt reprocess from scratch, which is the exact same behavior from issue #19690.

With the find_slot() fix restored, the checkpoint records the correct position and restore works as expected.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After an image is processed and a checkpoint is created

Unless I missed something: checkpoint is only created upon SLOT_STATE_DONE_PROMPT, so that means your prompt ends with an image, not a text token.

From what you confirmed, pos_max = 11667 is the cell.pos value which you assumed to be wrong in recurrent case. However, as I explained above, adding one more text token will correct cell.pos (the text token must be inside the prompt, not as a generated text) please verify this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this in a normal SillyTavern conversations, text before the image, text after the image, so nothing unusual. Checkpoint restore still fails without the fix to memory-recurrent.cpp. The text token after the image do not appear to correct the position tracking for the M-RoPE compressed image tokens.

Logically, find_slot was written assuming each token increments position by 1. M-RoPE image tokens break that assumption, many tokens share a temporal position but occupy distinct spatial positions. I believe this should be accounted for at the checkpoint creation level, not bridged over later. Doing that would be a workaround, that would probably end up being fairly fragile if someone later changes something about the checkpointing or the tracking logic. But in this case, it doesn't seem to work at any rate.

llama_pos pos_next() const;
const mtmd::input_chunk_ptr & find_chunk(size_t idx) const;

size_t tokens_up_to_pos(llama_pos max_pos) const;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding this as a dedicated function, you just need to extend pos_next() to have an optional arg: pos_next(llama_pos i_pos_start = -1)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I can add the code from tokens_up_to_pos into pos_next(). I'll use the schema you suggested, but the naming could end up becoming a bit confusing. pos_next(-1) would mean next position and pos_next(6259) would mean how many tokens up to position 6259. These would be pretty different operations.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before you do so, I would suggest reflecting one more time about my point about not using absolute token count at all

most logics here suggests me that the conversion between token count <--> position index is redundant, as we can just simply use position index

the n_past_new calculation is for 2 purposes: filling out the slot.n_prompt_tokens_cache = n_past_new and calling keep_first(), which I proved to be wrong in another comment; n_prompt_tokens_cache can be changed to n_prompt_pos_cache

Copy link
Copy Markdown
Contributor Author

@timkhronos timkhronos Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think refactoring the server to track positions natively instead of token counts is a worthwhile idea, but that would touch a lot of code beyond this PR. This PR is trying to add support for context checkpointing with recurrent models, on multimodal contexts.

Regarding keep_first(n_past_new) n_past_new is already a token count, not a position index. The tokens_up_to_pos() call converts the checkpoint position into the corresponding token count, specifically to handle the M-RoPE case where token count >= position index.

So the conversion is neccessary here to make keep_first() work correctly with its existing API. It could be modified to take a position instead of a count, but since I didn't add n_prompt_tokens_cache and keep_first(), and they are already used throughout the server codebase with their current behaviour, that would probably require reworking large swaths of the code base.

This would be a larger scale refactor, and I think it's beyond the scope of this PR.

SLT_WRN(slot, "recovered recurrent state from checkpoint (pos_min = %d, pos_max = %d, n_tokens = %d), n_past: %d -> %d\n",
it->pos_min, it->pos_max, it->n_tokens_cached, slot.prompt.n_tokens(), n_past_new);

slot.prompt.tokens.keep_first(n_past_new);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be wrong: keep_first is the absolute number of tokens (or KV slots count), not the position index

If you do keep_first(n_past_new) here, it will end up removing more tokens than needed, because in case of m-rope: number of tokens >= index position

Instead, the cleaner way is to convert everything to token index, we should no longer rely on absolute token count

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_past_new is already a token count here, not a position. It comes from tokens_up_to_pos(checkpoint_pos), which converts the position index into the absolute number of tokens (correctly accounting for M-RoPE compressed image tokens). So keep_first(n_past_new) should be correct.

That said, I understand the confusion since the variable name could be better. I can rename it to something like n_tokens_keep to make the intent more obvious.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm right, I was indeed confused.

to make it clear, probably better to have 2 different keep_first:

  • keep_first(size_t) takes the number of absolute tokens
  • keep_first_n_pos(llama_pos) takes the position index

By having specific type, I hope that any misuse of these 2 can be detected by the compiler

Copy link
Copy Markdown
Contributor Author

@timkhronos timkhronos Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, I'll add the new keep first pos, that will handle the position-to-token-count conversion internally, then call keep first.

@ggerganov
Copy link
Copy Markdown
Member

@timkhronos Could you confirm that #19849 works correctly? I tried to simplify the approach here and need some feedback if that implementation is correct.

@timkhronos
Copy link
Copy Markdown
Contributor Author

timkhronos commented Feb 24, 2026

@ggerganov The implementation in #19849 seems to work fine in the same test cases I tried for my implementation, However I have found an issue during testing:

If we have a conversation (Ta = assistant text turn, Tu = user text turn, Iu = user image turn):

  1. Tu
  2. Ta
  3. Tu
  4. Ta
  5. Iu
  6. Ta
  7. Tu
  8. Ta
  9. Tu
    9.5. After 9 Create context chekpoint
  10. Ta (assistant generates reply)

If we delete turn 10, 9, and swipe on reply 8, or if we delete turns 10,9,8 and send 7. to generate, the recurrent state incorrectly gets restored. The checkpoint should have become invalid since the prompt has moved to a point, (7) before it, instead the checkpoint is still restored, even though its recurrent cell state was computed after processing tokens that no longer exist.

The checkpoint at pos_min=pos_max=8440 was created after processing all 11052 tokens, so the recurrent state reflects all of them. After deletion (prompt is now only 10607 tokens), restoring this checkpoint contaminates the recurrent state with influence from deleted messages.

We'd need to also store the total prompt length or max position at checkpoint creation time, and invalidate any checkpoint where that exceeds the current context.
Logs attached showing the behavior when this happens.

slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 10540, batch.n_tokens = 396, progress = 0.953674
find_slot: non-consecutive token position 8440 after 8021 for sequence 3 with 396 new tokens
find_slot: non-consecutive token position 8440 after 8021 for sequence 3 with 396 new tokens
slot update_slots: id 3 | task 0 | n_tokens = 10540, memory_seq_rm [8441, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 11052, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id 3 | task 0 | prompt done, n_tokens = 11052, batch.n_tokens = 512
slot init_sampler: id 3 | task 0 | init sampler, took 0.65 ms, tokens: text = 8876, total = 11052
slot update_slots: id 3 | task 0 | created context checkpoint 1 of 8 (pos_min = 8440, pos_max = 8440, size = 186.329 MiB)
slot print_timing: id 3 | task 0 |
prompt eval time = 34453.93 ms / 11052 tokens ( 3.12 ms per token, 320.78 tokens per second)
eval time = 43217.56 ms / 438 tokens ( 98.67 ms per token, 10.13 tokens per second)
total time = 77671.49 ms / 11490 tokens
slot release: id 3 | task 0 | stop processing: n_tokens = 11489, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.923
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> top-k -> min-p -> ?temp-ext -> adaptive-p
slot launch_slot_: id 3 | task 445 | processing task, is_child = 0
slot update_slots: id 3 | task 445 | new prompt, n_ctx_slot = 80128, n_keep = 0, task.n_tokens = 10607
slot update_slots: id 3 | task 445 | n_past = 10603, slot.prompt.tokens.size() = 11489, seq_id = 3, pos_min = 9389, n_swa = 1
slot update_slots: id 3 | task 445 | restored context checkpoint (pos_min = 8440, pos_max = 8440, size = 186.329 MiB)
slot update_slots: id 3 | task 445 | n_tokens = 10541, memory_seq_rm [8442, end)
slot update_slots: id 3 | task 445 | prompt processing progress, n_tokens = 10607, batch.n_tokens = 66, progress = 1.000000
slot update_slots: id 3 | task 445 | prompt done, n_tokens = 10607, batch.n_tokens = 66
slot init_sampler: id 3 | task 445 | init sampler, took 0.59 ms, tokens: text = 8431, total = 10607
find_slot: non-consecutive token position 8507 after 8440 for sequence 3 with 66 new tokens
find_slot: non-consecutive token position 8507 after 8440 for sequence 3 with 66 new tokens
slot print_timing: id 3 | task 445 |
prompt eval time = 1805.72 ms / 66 tokens ( 27.36 ms per token, 36.55 tokens per second)
eval time = 35591.93 ms / 359 tokens ( 99.14 ms per token, 10.09 tokens per second)
total time = 37397.65 ms / 425 tokens

@ebfio
Copy link
Copy Markdown

ebfio commented Feb 24, 2026

@timkhronos I've been running this branch since yesterday and it works great so far. But today I've hit a regression. During an agentic workflow, this has happened:

llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
^Cllamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | n_tokens = 60646, memory_seq_rm [58613, end)
llamacpp-qwen-3.5  | slot update_slots: id  0 | task 153526 | prompt processing progress, n_tokens = 60646, batch.n_tokens = 0, progress = 1.002380
llamacpp-qwen-3.5  | srv  update_slots: no tokens to decode

And it got into a loop. I had to kill llamacpp and it got back just fine.

@ggerganov
Copy link
Copy Markdown
Member

ggerganov commented Feb 24, 2026

The checkpoint at pos_min=pos_max=8440 was created after processing all 11052 tokens, so the recurrent state reflects all of them. After deletion (prompt is now only 10607 tokens), restoring this checkpoint contaminates the recurrent state with influence from deleted messages.

I am not sure that is correct. The reason is because the checkpoint is created before processing the last batch. So these logs:

slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 10540, batch.n_tokens = 396, progress = 0.953674
find_slot: non-consecutive token position 8440 after 8021 for sequence 3 with 396 new tokens
find_slot: non-consecutive token position 8440 after 8021 for sequence 3 with 396 new tokens
slot update_slots: id 3 | task 0 | n_tokens = 10540, memory_seq_rm [8441, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 11052, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id 3 | task 0 | prompt done, n_tokens = 11052, batch.n_tokens = 512
slot init_sampler: id 3 | task 0 | init sampler, took 0.65 ms, tokens: text = 8876, total = 11052
slot update_slots: id 3 | task 0 | created context checkpoint 1 of 8 (pos_min = 8440, pos_max = 8440, size = 186.329 MiB)
slot print_timing: id 3 | task 0 |
prompt eval time = 34453.93 ms / 11052 tokens ( 3.12 ms per token, 320.78 tokens per second)

They are a bit misleading. It means that the checkpoint was created before processing the last batch of 512 tokens (i.e. before actually calling llama_decode() to "embed" them into the memory). I.e. there are 11052 - 512 = 10540 tokens in the checkpoint. So resuming with n_past = 10603 from that checkpoint is OK.

It is intentionally done like this (see #16440). We want to store the checkpoint slightly earlier before the full prompt is processed, specifically to allow regenerations or small user message modifications when recurrent state is involved.

I'll add some changes to improve the logs in this regard.

Can you confirm?

@timkhronos
Copy link
Copy Markdown
Contributor Author

timkhronos commented Feb 25, 2026

@ggerganov After testing I believe you are correct, and I was indeed the one getting tripped up by the logging. Your implementation seems to behave as expected, with the knowledge that we checkpoint before the last batch, and my extra check would cause undue reprocessing, for no benefit as far as I could see.

Recently some smaller Qwen MoE's have released, sharing the same vision capabilities and hybrid recurrent attention as the 397B one does, in case you want to validate the PR personally.

@ggerganov
Copy link
Copy Markdown
Member

Recently some smaller Qwen MoE's have released, sharing the same vision capabilities and hybrid recurrent attention as the 397B one does, in case you want to validate the PR personally.

Yes, it's much easier now that we have these models. I am running evaluation and so far seems to work OK.

@ggerganov
Copy link
Copy Markdown
Member

Superseded by #19849

@ggerganov ggerganov closed this Mar 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: qwen35moe always forces a full prompt reprocess after each message, 'failed to truncate'

4 participants