server: batch checkpoints to support kvcache context truncation#19970
server: batch checkpoints to support kvcache context truncation#19970aagit wants to merge 1 commit intoggml-org:masterfrom
Conversation
Currently, checkpoints are only created once the entire kvcache has
been computed. This renders the kvcache an append-only cache.
This change creates checkpoints at every token batch. If the context
is truncated, the kvcache remains useful like with models that do not
require checkpoints.
Reproducer simulating a coding workflow frequently editing files in context:
```python
import requests
import time
import json
url = "http://localhost:8811/v1/chat/completions"
repeated_line = "This is a repeated line. "
payload_long = {
"messages": [{"role": "user", "content": repeated_line * 10000}],
"max_tokens": 1
}
payload_short = {
"messages": [{"role": "user", "content": repeated_line * 9000}],
"max_tokens": 1
}
start_time = time.time()
response_long = requests.post(url, json=payload_long)
time_long = time.time() - start_time
if response_long.status_code != 200:
raise RuntimeError(f"Request failed with status {response_long.status_code}")
start_time = time.time()
response_short = requests.post(url, json=payload_short)
time_short = time.time() - start_time
if response_short.status_code != 200:
raise RuntimeError(f"Request failed with status {response_short.status_code}")
print(f"kvcache preload: {time_long:.4f} seconds")
print(f"truncated kvcache lookup: {time_short:.4f} seconds")
```
Results with `llama.cpp --ctx-checkpoints 128`.
Devstral-Small-2-24B-Instruct-2512 (stock):
kvcache preload: 133.6303 seconds
truncated kvcache lookup: 0.1259 seconds
Qwen3.5-35B-A3B nothink (stock):
kvcache preload: 43.1127 seconds
truncated kvcache lookup: 37.4520 seconds
Qwen3.5-35B-A3B nothink (with batch checkpoints):
kvcache preload: 43.4365 seconds
truncated kvcache lookup: 0.9374 seconds
Performance boost for razor coding rg-edit [1] workflows with context
LRU management and optional preloading [2]: ~39x with ~52k context.
[1] https://gitlab.com/aarcange/ripgrep-edit
[2] karthink/gptel#1147
|
This resolved my issue where using Qwen3.5 in OpenCode would regularly need to reprocess the whole prompt, leading to very long pauses. Running this branch, I see successful checkpoint restoration happening in the log and the long pauses are gone. Thanks :) |
|
#neverhappening |
|
Can confirm that this solved the prompt reprocessing issue for me too with qwen35moe. I used |
…processing Resolves conflict in server-context.cpp by combining PR ggml-org#19970's approach (checkpoint creation on every batch) with master's checkpoint_every_nt gating.
Match PR ggml-org#19970 behavior: checkpoint creation runs on every batch unconditionally, gated only by pos_min/pos_max/gap conditions. The checkpoint_every_nt gating in the else branch was preventing intermediate checkpoints during recovery reprocessing.
Using |
|
With the current upstream you don't need this PR anymore and you can use --checkpoint-every-n-tokens 2048 --ctx-checkpoints 64 instead. Or nothing if you can tolerate an extra 6k tokens having to be recomputed after every truncation. |
Currently, checkpoints are only created once the entire kvcache has been computed. This renders the kvcache an append-only cache.
This change creates checkpoints at every token batch. If the context is truncated, the kvcache remains useful like with models that do not require checkpoints.
Reproducer simulating a coding workflow frequently editing files in context:
Results with
llama.cpp --ctx-checkpoints 128.Devstral-Small-2-24B-Instruct-2512 (stock):
kvcache preload: 133.6303 seconds
truncated kvcache lookup: 0.1259 seconds
Qwen3.5-35B-A3B nothink (stock):
kvcache preload: 43.1127 seconds
truncated kvcache lookup: 37.4520 seconds
Qwen3.5-35B-A3B nothink (with batch checkpoints):
kvcache preload: 43.4365 seconds
truncated kvcache lookup: 0.9374 seconds
Performance boost for razor coding rg-edit [1] workflows with context LRU management and optional preloading [2]: ~39x with ~52k context.
[1] https://gitlab.com/aarcange/ripgrep-edit
[2] karthink/gptel#1147