server: batch checkpoints to support kvcache context truncation by aagit · Pull Request #19970 · ggml-org/llama.cpp

aagit · 2026-02-28T04:02:03Z

Currently, checkpoints are only created once the entire kvcache has been computed. This renders the kvcache an append-only cache.

This change creates checkpoints at every token batch. If the context is truncated, the kvcache remains useful like with models that do not require checkpoints.

Reproducer simulating a coding workflow frequently editing files in context:

import requests
import time
import json

url = "http://localhost:8811/v1/chat/completions"

repeated_line = "This is a repeated line. "
payload_long = {
    "messages": [{"role": "user", "content": repeated_line * 10000}],
    "max_tokens": 1
}

payload_short = {
    "messages": [{"role": "user", "content": repeated_line * 9000}],
    "max_tokens": 1
}

start_time = time.time()
response_long = requests.post(url, json=payload_long)
time_long = time.time() - start_time
if response_long.status_code != 200:
    raise RuntimeError(f"Request failed with status {response_long.status_code}")

start_time = time.time()
response_short = requests.post(url, json=payload_short)
time_short = time.time() - start_time
if response_short.status_code != 200:
    raise RuntimeError(f"Request failed with status {response_short.status_code}")

print(f"kvcache preload: {time_long:.4f} seconds")
print(f"truncated kvcache lookup: {time_short:.4f} seconds")

Results with llama.cpp --ctx-checkpoints 128.

Devstral-Small-2-24B-Instruct-2512 (stock):
kvcache preload: 133.6303 seconds
truncated kvcache lookup: 0.1259 seconds

Qwen3.5-35B-A3B nothink (stock):
kvcache preload: 43.1127 seconds
truncated kvcache lookup: 37.4520 seconds

Qwen3.5-35B-A3B nothink (with batch checkpoints):
kvcache preload: 43.4365 seconds
truncated kvcache lookup: 0.9374 seconds

Performance boost for razor coding rg-edit [1] workflows with context LRU management and optional preloading [2]: ~39x with ~52k context.

[1] https://gitlab.com/aarcange/ripgrep-edit
[2] karthink/gptel#1147

Currently, checkpoints are only created once the entire kvcache has been computed. This renders the kvcache an append-only cache. This change creates checkpoints at every token batch. If the context is truncated, the kvcache remains useful like with models that do not require checkpoints. Reproducer simulating a coding workflow frequently editing files in context: ```python import requests import time import json url = "http://localhost:8811/v1/chat/completions" repeated_line = "This is a repeated line. " payload_long = { "messages": [{"role": "user", "content": repeated_line * 10000}], "max_tokens": 1 } payload_short = { "messages": [{"role": "user", "content": repeated_line * 9000}], "max_tokens": 1 } start_time = time.time() response_long = requests.post(url, json=payload_long) time_long = time.time() - start_time if response_long.status_code != 200: raise RuntimeError(f"Request failed with status {response_long.status_code}") start_time = time.time() response_short = requests.post(url, json=payload_short) time_short = time.time() - start_time if response_short.status_code != 200: raise RuntimeError(f"Request failed with status {response_short.status_code}") print(f"kvcache preload: {time_long:.4f} seconds") print(f"truncated kvcache lookup: {time_short:.4f} seconds") ``` Results with `llama.cpp --ctx-checkpoints 128`. Devstral-Small-2-24B-Instruct-2512 (stock): kvcache preload: 133.6303 seconds truncated kvcache lookup: 0.1259 seconds Qwen3.5-35B-A3B nothink (stock): kvcache preload: 43.1127 seconds truncated kvcache lookup: 37.4520 seconds Qwen3.5-35B-A3B nothink (with batch checkpoints): kvcache preload: 43.4365 seconds truncated kvcache lookup: 0.9374 seconds Performance boost for razor coding rg-edit [1] workflows with context LRU management and optional preloading [2]: ~39x with ~52k context. [1] https://gitlab.com/aarcange/ripgrep-edit [2] karthink/gptel#1147

ahti · 2026-03-01T00:16:37Z

This resolved my issue where using Qwen3.5 in OpenCode would regularly need to reprocess the whole prompt, leading to very long pauses. Running this branch, I see successful checkpoint restoration happening in the log and the long pauses are gone. Thanks :)

whoreson · 2026-03-01T07:23:53Z

#17428

#neverhappening

tlaerm · 2026-03-01T20:55:48Z

Can confirm that this solved the prompt reprocessing issue for me too with qwen35moe. I used --ctx-checkpoints 64 flag. Thanks

…processing Resolves conflict in server-context.cpp by combining PR ggml-org#19970's approach (checkpoint creation on every batch) with master's checkpoint_every_nt gating.

Match PR ggml-org#19970 behavior: checkpoint creation runs on every batch unconditionally, gated only by pos_min/pos_max/gap conditions. The checkpoint_every_nt gating in the else branch was preventing intermediate checkpoints during recovery reprocessing.

SurealCereal · 2026-03-21T16:32:38Z

This resolved my issue where using Qwen3.5 in OpenCode would regularly need to reprocess the whole prompt, leading to very long pauses. Running this branch, I see successful checkpoint restoration happening in the log and the long pauses are gone. Thanks :)

Using LLAMA_ARG_CTX_CHECKPOINTS: 64 definitely helped, but the entire context still sometimes get reprocessed when I manually prompt something in OpenCode with Unsloth Qwen3.5-122B-A10B-GGUF_UD-Q4_K_XL on llama "full-cuda` b8399.

aagit · 2026-03-21T17:28:19Z

With the current upstream you don't need this PR anymore and you can use --checkpoint-every-n-tokens 2048 --ctx-checkpoints 64 instead. Or nothing if you can tolerate an extra 6k tokens having to be recomputed after every truncation.

aagit requested review from ggerganov and ngxson as code owners February 28, 2026 04:02

github-actions Bot added examples server labels Feb 28, 2026

aagit mentioned this pull request Mar 4, 2026

Hybrid model cache: add --checkpoint-every-nb #20087

Merged

aagit closed this Mar 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: batch checkpoints to support kvcache context truncation#19970

server: batch checkpoints to support kvcache context truncation#19970
aagit wants to merge 1 commit intoggml-org:masterfrom
aagit:batch_checkpoints

aagit commented Feb 28, 2026

Uh oh!

ahti commented Mar 1, 2026

Uh oh!

whoreson commented Mar 1, 2026

Uh oh!

tlaerm commented Mar 1, 2026

Uh oh!

SurealCereal commented Mar 21, 2026

Uh oh!

aagit commented Mar 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

aagit commented Feb 28, 2026

Uh oh!

ahti commented Mar 1, 2026

Uh oh!

whoreson commented Mar 1, 2026

Uh oh!

tlaerm commented Mar 1, 2026

Uh oh!

SurealCereal commented Mar 21, 2026

Uh oh!

aagit commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aagit commented Mar 21, 2026 •

edited

Loading