server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false)#13771
server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false)#13771ochafik merged 13 commits intoggml-org:masterfrom
server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false)#13771Conversation
|
yes this can be useful, I thought about it in #13272 , which is part of my idea about implementing the thinking budget. just to be less confused between |
|
Consider adding Granite's |
@CISC I hadn't seen that one, thanks for bringing this up! Strong case for support through @ngxson's #13272, the request param could override the flag then, or something. |
server: add --reasoning-format=disabled to disable thinking (incl. qwen3 w/ enable_thinking:false)server: add --reasoning-format=nothink to disable thinking (incl. qwen3 w/ enable_thinking:false)
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
| "controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:\n" | ||
| "- none: leaves thoughts unparsed in `message.content`\n" | ||
| "- deepseek: puts thoughts in `message.reasoning_content` (except in streaming mode, which behaves as `none`)\n" | ||
| "- nothink: prevents generation of thoughts (forcibly closing thoughts tag or setting template-specific variables such as `enable_thinking: false` for Qwen3)\n" |
There was a problem hiding this comment.
doesn't feel worth adding a separate flag at this stage, wdyt?
Tbh I think we should still separate it to another flag. The format meaning it only format the response, not changing the behavior, but here nothink changes the generation behavior
There was a problem hiding this comment.
I think it's ok to just add a flag called --reasoning-budget and only support either -1 (unlimited budget) or 0 (no think) for now
server: add --reasoning-format=nothink to disable thinking (incl. qwen3 w/ enable_thinking:false)server: add --reasoning-budget to disable thinking (incl. qwen3 w/ enable_thinking:false)
server: add --reasoning-budget to disable thinking (incl. qwen3 w/ enable_thinking:false)server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false)
|
@ngxson & @ochafik I have a question regarding the usage. Simply adding llama-server `
--model 'D:\AI\LLM\gguf\Qwen3-30B-A3B\Qwen3-30B-A3B.IQ3_XXS.gguf' `
--alias 'Qwen3-30B-A3B.IQ3_XXS.gguf' `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 99 `
--reasoning-budget 0 `
--flash-attnThis request: curl.exe http://127.0.0.1:8080/v1/chat/completions `
--silent `
--header "Content-Type: application/json" `
--data '{
\"model\": \"Qwen3-30B-A3B.IQ3_XXS.gguf\",
\"messages\": [
{
\"role\": \"user\",
\"content\": \"How are you?\"
}
],
\"temperature\": 0.6,
\"max_tokens\": 1024
}'Returns the following: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "<think>\nOkay, the user asked, \"How are you?\" I need to respond appropriately. Since I'm an AI, I don't have feelings, but I should keep the response friendly and helpful. Maybe say something like, \"I'm just a bunch of code, but I'm doing great! How can I assist you today?\" That's positive and shifts the focus back to the user. Let me make sure it's concise and friendly. Yep, that works.\n</think>\n\nI'm just a bunch of code, but I'm doing great! How can I assist you today? ƒÿè"
}
}
],
"created": 1748251147,
"model": "Qwen3-30B-A3B.IQ3_XXS.gguf",
"system_fingerprint": "b5490-fef693dc",
"object": "chat.completion",
"usage": {
"completion_tokens": 121,
"prompt_tokens": 12,
"total_tokens": 133
},
"id": "chatcmpl-Ihg3Q1yUsY6rFGKOnOXr6hbRtTR42v2e",
"timings": {
"prompt_n": 12,
"prompt_ms": 69.177,
"prompt_per_token_ms": 5.76475,
"prompt_per_second": 173.46806019341687,
"predicted_n": 121,
"predicted_ms": 893.017,
"predicted_per_token_ms": 7.3803057851239675,
"predicted_per_second": 135.49574084255954
}
} |
|
@countzero You need to start the server with |
|
@kth8 Thank you for the hint. That indeed works now: llama-server `
--model 'D:\AI\LLM\gguf\Qwen3-30B-A3B\Qwen3-30B-A3B.IQ3_XXS.gguf' `
--alias 'Qwen3-30B-A3B.IQ3_XXS.gguf' `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 99 `
--reasoning-budget 0 `
--jinja `
--flash-attn@ngxson & @ochafik As a developer I would like to use the Suggestion: Activate |
|
Please take a look: #13877 |
|
I am not able to get reasoning-budget to work |
|
@jacekpoplawski you didn't run with |
|
does it work for you with --jinja? |
… w/ enable_thinking:false) (ggml-org#13771) --------- Co-authored-by: ochafik <ochafik@google.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

This allows disabling thinking for all supported thinking models (QwQ, DeepSeek R1 distills, Qwen3, Command R7B), when the flag
--reasoning-budget 0is set"enable_thinking": falseas extra template context variable (similar to Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client #13196, which will still be very useful in general)For per-request behaviour, see #13272 (discussion on upcoming reasoning budget request param) and #13196 (support passing generic kvs).
cc/ @matteoserva
cc/ @ngxson Not sure about the slight alteration of the semantics of the CLI flag (updated docs + inline help), but doesn't feel worth adding a separate flag at this stage, wdyt?