Skip to content

server: batch checkpoints to support kvcache context truncation#19970

Closed
aagit wants to merge 1 commit intoggml-org:masterfrom
aagit:batch_checkpoints
Closed

server: batch checkpoints to support kvcache context truncation#19970
aagit wants to merge 1 commit intoggml-org:masterfrom
aagit:batch_checkpoints

Conversation

@aagit
Copy link
Copy Markdown
Contributor

@aagit aagit commented Feb 28, 2026

Currently, checkpoints are only created once the entire kvcache has been computed. This renders the kvcache an append-only cache.

This change creates checkpoints at every token batch. If the context is truncated, the kvcache remains useful like with models that do not require checkpoints.

Reproducer simulating a coding workflow frequently editing files in context:

import requests
import time
import json

url = "http://localhost:8811/v1/chat/completions"

repeated_line = "This is a repeated line. "
payload_long = {
    "messages": [{"role": "user", "content": repeated_line * 10000}],
    "max_tokens": 1
}

payload_short = {
    "messages": [{"role": "user", "content": repeated_line * 9000}],
    "max_tokens": 1
}

start_time = time.time()
response_long = requests.post(url, json=payload_long)
time_long = time.time() - start_time
if response_long.status_code != 200:
    raise RuntimeError(f"Request failed with status {response_long.status_code}")

start_time = time.time()
response_short = requests.post(url, json=payload_short)
time_short = time.time() - start_time
if response_short.status_code != 200:
    raise RuntimeError(f"Request failed with status {response_short.status_code}")

print(f"kvcache preload: {time_long:.4f} seconds")
print(f"truncated kvcache lookup: {time_short:.4f} seconds")

Results with llama.cpp --ctx-checkpoints 128.

Devstral-Small-2-24B-Instruct-2512 (stock):
kvcache preload: 133.6303 seconds
truncated kvcache lookup: 0.1259 seconds

Qwen3.5-35B-A3B nothink (stock):
kvcache preload: 43.1127 seconds
truncated kvcache lookup: 37.4520 seconds

Qwen3.5-35B-A3B nothink (with batch checkpoints):
kvcache preload: 43.4365 seconds
truncated kvcache lookup: 0.9374 seconds

Performance boost for razor coding rg-edit [1] workflows with context LRU management and optional preloading [2]: ~39x with ~52k context.

[1] https://gitlab.com/aarcange/ripgrep-edit
[2] karthink/gptel#1147

Currently, checkpoints are only created once the entire kvcache has
been computed. This renders the kvcache an append-only cache.

This change creates checkpoints at every token batch. If the context
is truncated, the kvcache remains useful like with models that do not
require checkpoints.

Reproducer simulating a coding workflow frequently editing files in context:

```python
import requests
import time
import json

url = "http://localhost:8811/v1/chat/completions"

repeated_line = "This is a repeated line. "
payload_long = {
    "messages": [{"role": "user", "content": repeated_line * 10000}],
    "max_tokens": 1
}

payload_short = {
    "messages": [{"role": "user", "content": repeated_line * 9000}],
    "max_tokens": 1
}

start_time = time.time()
response_long = requests.post(url, json=payload_long)
time_long = time.time() - start_time
if response_long.status_code != 200:
    raise RuntimeError(f"Request failed with status {response_long.status_code}")

start_time = time.time()
response_short = requests.post(url, json=payload_short)
time_short = time.time() - start_time
if response_short.status_code != 200:
    raise RuntimeError(f"Request failed with status {response_short.status_code}")

print(f"kvcache preload: {time_long:.4f} seconds")
print(f"truncated kvcache lookup: {time_short:.4f} seconds")
```

Results with `llama.cpp --ctx-checkpoints 128`.

Devstral-Small-2-24B-Instruct-2512 (stock):
kvcache preload: 133.6303 seconds
truncated kvcache lookup: 0.1259 seconds

Qwen3.5-35B-A3B nothink (stock):
kvcache preload: 43.1127 seconds
truncated kvcache lookup: 37.4520 seconds

Qwen3.5-35B-A3B nothink (with batch checkpoints):
kvcache preload: 43.4365 seconds
truncated kvcache lookup: 0.9374 seconds

Performance boost for razor coding rg-edit [1] workflows with context
LRU management and optional preloading [2]: ~39x with ~52k context.

[1] https://gitlab.com/aarcange/ripgrep-edit
[2] karthink/gptel#1147
@ahti
Copy link
Copy Markdown

ahti commented Mar 1, 2026

This resolved my issue where using Qwen3.5 in OpenCode would regularly need to reprocess the whole prompt, leading to very long pauses. Running this branch, I see successful checkpoint restoration happening in the log and the long pauses are gone. Thanks :)

@whoreson
Copy link
Copy Markdown
Contributor

whoreson commented Mar 1, 2026

#17428

#neverhappening

@tlaerm
Copy link
Copy Markdown

tlaerm commented Mar 1, 2026

Can confirm that this solved the prompt reprocessing issue for me too with qwen35moe. I used --ctx-checkpoints 64 flag. Thanks

burakaydinofficial added a commit to burakaydinofficial/llama.cpp that referenced this pull request Mar 7, 2026
…processing

Resolves conflict in server-context.cpp by combining PR ggml-org#19970's approach
(checkpoint creation on every batch) with master's checkpoint_every_nt gating.
burakaydinofficial added a commit to burakaydinofficial/llama.cpp that referenced this pull request Mar 7, 2026
Match PR ggml-org#19970 behavior: checkpoint creation runs on every batch
unconditionally, gated only by pos_min/pos_max/gap conditions.
The checkpoint_every_nt gating in the else branch was preventing
intermediate checkpoints during recovery reprocessing.
@SurealCereal
Copy link
Copy Markdown

This resolved my issue where using Qwen3.5 in OpenCode would regularly need to reprocess the whole prompt, leading to very long pauses. Running this branch, I see successful checkpoint restoration happening in the log and the long pauses are gone. Thanks :)

Using LLAMA_ARG_CTX_CHECKPOINTS: 64 definitely helped, but the entire context still sometimes get reprocessed when I manually prompt something in OpenCode with Unsloth Qwen3.5-122B-A10B-GGUF_UD-Q4_K_XL on llama "full-cuda` b8399.

@aagit
Copy link
Copy Markdown
Contributor Author

aagit commented Mar 21, 2026

With the current upstream you don't need this PR anymore and you can use --checkpoint-every-n-tokens 2048 --ctx-checkpoints 64 instead. Or nothing if you can tolerate an extra 6k tokens having to be recomputed after every truncation.

@aagit aagit closed this Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants