Name and Version
version: 8070 (cc45f2a)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
Ryzen 9950x3d + 5090
Models
The issue presents with several quants of
Qwen3.5 397B
Problem description & steps to reproduce
When a generation is initiated, regardless if from the llama.cpp webui, sillytavern or other frontends, every generation regardless if it's a new reply or a swipe forces a full prompt reprocessing.
launch arguments:
/llama-server -m model/Qwen3.5-397B-A17B-UD-Q5_K_XL-00001-of-00007.gguf -ngl 999 --threads 16 --threads-batch 16 --batch-size 2048 -ub 2048 -ot "blk.(0|1|2).ffn_.=CUDA0" -ot "blk.._exps.=CPU" --ctx-size 96000 --port 15000 --chat-template-kwargs "{"enable_thinking": false}" --mmproj Desktop/model/AI/LLama.cpp/mmproj-F32.gguf --no-mmap
First Bad Commit
It is present since the commit adding support for qwen3.5moe
Relevant log output
slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.838
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> top-k -> min-p -> ?temp-ext -> adaptive-p
slot launch_slot_: id 3 | task 427 | processing task, is_child = 0
slot update_slots: id 3 | task 427 | new prompt, n_ctx_slot = 96000, n_keep = 0, task.n_tokens = 2189
slot update_slots: id 3 | task 427 | need to evaluate at least 1 token for each active slot (n_past = 2189, task.n_tokens() = 2189)
slot update_slots: id 3 | task 427 | n_past was set to 2188
slot update_slots: id 3 | task 427 | n_tokens = 2188, memory_seq_rm [2188, end)
slot update_slots: id 3 | task 427 | failed to truncate tokens with position >= 2188 - clearing the memory
slot prompt_clear: id 3 | task 427 | clearing prompt with 2188 tokens
Name and Version
version: 8070 (cc45f2a)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
Ryzen 9950x3d + 5090
Models
The issue presents with several quants of
Qwen3.5 397B
Problem description & steps to reproduce
When a generation is initiated, regardless if from the llama.cpp webui, sillytavern or other frontends, every generation regardless if it's a new reply or a swipe forces a full prompt reprocessing.
launch arguments:
First Bad Commit
It is present since the commit adding support for qwen3.5moe
Relevant log output