Name and Version
llama-server
version: 8124 (3571565)
built with GNU 14.2.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA, CPU
Hardware
Dual 5th gen Xeon with 768gb DDR5 + RTX Pro 6000
Models
Kimi K2.5 (q8_0 + q4_0 mix)
Minimax M2.5 (q8_0 + q4_k + q4_k + q5_k mix)
Of note, I could NOT reproduce this on Qwen 3.5 (q8_0 + q4_k + q4_k + q5_k mix)
Perhaps there is some issue common to the chat parsers for Kimi K2.5 and Minimax M2.5 but not Qwen 3.5?
Problem description & steps to reproduce
Alrighty, this is a weird one.
I noticed that when using Kimi K2.5 via llama-server's chat completions, the model always omits the final " if the response ends on a ". However, if I manually format the prompt (by rendering the jinja template) and send the request as a raw request to /v1/completions, this doesn't happen. @ddh0 also encountered this issue on Minimax M2.5.
To reproduce, hit the endpoint with a curl request:
curl http://0.0.0.0:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer AINT_GOT_NO_API_KEY" \
-d '{
"messages": [
{
"role": "user",
"content": "Write the word test wrapped in quotes."
}
],
"top_k": 1,
"chat_template_kwargs": {"thinking": false}
}'
and then here we get:
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"\"test"}}],"created":1771735211,"model":"moonshotai/Kimi-K2.5","system_fingerprint":"b8124-35715657c","object":"chat.completion","usage":{"completion_tokens":4,"prompt_tokens":17,"total_tokens":21},"id":"chatcmpl-CESqRF1bX2EM29fGt6aHtIBBU9g3zNQD","timings":{"cache_n":16,"prompt_n":1,"prompt_ms":122.831,"prompt_per_token_ms":122.831,"prompt_per_second":8.141267269663196,"predicted_n":4,"predicted_ms":220.697,"predicted_per_token_ms":55.17425,"predicted_per_second":18.124396797419084}}
where the returned string is "test.
If we instead manually format the prompt into a string:
curl http://0.0.0.0:8080/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer AINT_GOT_NO_API_KEY" \
-d '{
"prompt": "<|im_user|>user<|im_middle|>Write the word test wrapped in quotes.<|im_end|><|im_assistant|>assistant<|im_middle|><think></think>",
"top_k": 1
}'
there's no problem with the trailing quote:
{"choices":[{"text":"\"test\"","index":0,"logprobs":null,"finish_reason":"stop"}],"created":1771735231,"model":"moonshotai/Kimi-K2.5","system_fingerprint":"b8124-35715657c","object":"text_completion","usage":{"completion_tokens":4,"prompt_tokens":17,"total_tokens":21},"id":"chatcmpl-jrXhbDqotkrbzsMGXMul1X5tAmpUNGlm","timings":{"cache_n":16,"prompt_n":1,"prompt_ms":79.296,"prompt_per_token_ms":79.296,"prompt_per_second":12.61097659402744,"predicted_n":4,"predicted_ms":191.489,"predicted_per_token_ms":47.87225,"predicted_per_second":20.88892834575354}}
where the returned string is "test".
For Minimax M2.5, the curl requests would be:
curl http://0.0.0.0:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer AINT_GOT_NO_API_KEY" \
-d '{
"messages": [
{
"role": "user",
"content": "Write the word test wrapped in quotes."
}
],
"top_k": 1
}'
and
curl http://0.0.0.0:8080/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer AINT_GOT_NO_API_KEY" \
-d '{
"prompt": "]~!b[]~b]system\nYou are a helpful assistant. Your name is MiniMax-M2.5 and is built by MiniMax.[e~[\n]~b]user\nWrite the word test wrapped in quotes.[e~[\n]~b]ai\n<think>\n",
"top_k": 1
}'
First Bad Commit
No response
Relevant log output
curl http://0.0.0.0:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer AINT_GOT_NO_API_KEY" \
-d '{
"messages": [
{
"role": "user",
"content": "Write the word test wrapped in quotes."
}
],
"top_k": 1,
"chat_template_kwargs": {"thinking": false}
}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"\"test"}}],"created":1771735211,"model":"moonshotai/Kimi-K2.5","system_fingerprint":"b8124-35715657c","object":"chat.completion","usage":{"completion_tokens":4,"prompt_tokens":17,"total_tokens":21},"id":"chatcmpl-CESqRF1bX2EM29fGt6aHtIBBU9g3zNQD","timings":{"cache_n":16,"prompt_n":1,"prompt_ms":122.831,"prompt_per_token_ms":122.831,"prompt_per_second":8.141267269663196,"predicted_n":4,"predicted_ms":220.697,"predicted_per_token_ms":55.17425,"predicted_per_second":18.124396797419084}}
curl http://0.0.0.0:8080/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer AINT_GOT_NO_API_KEY" \
-d '{
"prompt": "<|im_user|>user<|im_middle|>Write the word test wrapped in quotes.<|im_end|><|im_assistant|>assistant<|im_middle|><think></think>",
"top_k": 1
}'
{"choices":[{"text":"\"test\"","index":0,"logprobs":null,"finish_reason":"stop"}],"created":1771735231,"model":"moonshotai/Kimi-K2.5","system_fingerprint":"b8124-35715657c","object":"text_completion","usage":{"completion_tokens":4,"prompt_tokens":17,"total_tokens":21},"id":"chatcmpl-jrXhbDqotkrbzsMGXMul1X5tAmpUNGlm","timings":{"cache_n":16,"prompt_n":1,"prompt_ms":79.296,"prompt_per_token_ms":79.296,"prompt_per_second":12.61097659402744,"predicted_n":4,"predicted_ms":191.489,"predicted_per_token_ms":47.87225,"predicted_per_second":20.88892834575354}}
Name and Version
llama-server
version: 8124 (3571565)
built with GNU 14.2.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA, CPU
Hardware
Dual 5th gen Xeon with 768gb DDR5 + RTX Pro 6000
Models
Kimi K2.5 (q8_0 + q4_0 mix)
Minimax M2.5 (q8_0 + q4_k + q4_k + q5_k mix)
Of note, I could NOT reproduce this on Qwen 3.5 (q8_0 + q4_k + q4_k + q5_k mix)
Perhaps there is some issue common to the chat parsers for Kimi K2.5 and Minimax M2.5 but not Qwen 3.5?
Problem description & steps to reproduce
Alrighty, this is a weird one.
I noticed that when using Kimi K2.5 via llama-server's chat completions, the model always omits the final
"if the response ends on a". However, if I manually format the prompt (by rendering the jinja template) and send the request as a raw request to/v1/completions, this doesn't happen. @ddh0 also encountered this issue on Minimax M2.5.To reproduce, hit the endpoint with a curl request:
and then here we get:
where the returned string is
"test.If we instead manually format the prompt into a string:
there's no problem with the trailing quote:
where the returned string is
"test".For Minimax M2.5, the curl requests would be:
and
First Bad Commit
No response
Relevant log output