`server`: add `--reasoning-budget 0` to disable thinking (incl. qwen3 w/ enable_thinking:false) by ochafik · Pull Request #13771 · ggml-org/llama.cpp

ochafik · 2025-05-25T08:03:32Z

This allows disabling thinking for all supported thinking models (QwQ, DeepSeek R1 distills, Qwen3, Command R7B), when the flag --reasoning-budget 0 is set

For Qwen3, it sets "enable_thinking": false as extra template context variable (similar to Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client #13196, which will still be very useful in general)
For models that append an open thinking tag, it forcibly closes it

For per-request behaviour, see #13272 (discussion on upcoming reasoning budget request param) and #13196 (support passing generic kvs).

cc/ @matteoserva
cc/ @ngxson Not sure about the slight alteration of the semantics of the CLI flag (updated docs + inline help), but doesn't feel worth adding a separate flag at this stage, wdyt?

…en3 w/ enable_thinking:false)

ngxson · 2025-05-25T08:50:18Z

yes this can be useful, I thought about it in #13272 , which is part of my idea about implementing the thinking budget.

just to be less confused between none and disabled, I think it's better to call this flag nothink instead. In the future, we may also want to add hidden mode which still allow the model to generate thought, but is hidden from the response

CISC · 2025-05-25T08:58:27Z

Consider adding Granite's thinking option in it's chat template, which changes the system prompt. Basically the inverse of Qwen3's option.

ochafik · 2025-05-25T09:22:11Z

Consider adding Granite's thinking option in it's chat template, which changes the system prompt. Basically the inverse of Qwen3's option.

@CISC I hadn't seen that one, thanks for bringing this up! Strong case for support through @ngxson's #13272, the request param could override the flag then, or something.

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

…nable-thinking

ngxson · 2025-05-25T12:11:39Z

+        "controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:\n"
+        "- none: leaves thoughts unparsed in `message.content`\n"
+        "- deepseek: puts thoughts in `message.reasoning_content` (except in streaming mode, which behaves as `none`)\n"
+        "- nothink: prevents generation of thoughts (forcibly closing thoughts tag or setting template-specific variables such as `enable_thinking: false` for Qwen3)\n"


doesn't feel worth adding a separate flag at this stage, wdyt?

Tbh I think we should still separate it to another flag. The format meaning it only format the response, not changing the behavior, but here nothink changes the generation behavior

I think it's ok to just add a flag called --reasoning-budget and only support either -1 (unlimited budget) or 0 (no think) for now

countzero · 2025-05-26T09:21:15Z

@ngxson & @ochafik I have a question regarding the usage. Simply adding --reasoning-budget 0 does not stop Qwen3 to output <think> tags and reason before answering. Am I missing something?

llama-server `
    --model 'D:\AI\LLM\gguf\Qwen3-30B-A3B\Qwen3-30B-A3B.IQ3_XXS.gguf' `
    --alias 'Qwen3-30B-A3B.IQ3_XXS.gguf' `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 99 `
    --reasoning-budget 0 `
    --flash-attn

This request:

curl.exe http://127.0.0.1:8080/v1/chat/completions `
    --silent `
    --header "Content-Type: application/json" `
    --data '{
        \"model\": \"Qwen3-30B-A3B.IQ3_XXS.gguf\",
        \"messages\": [
            {
                \"role\": \"user\",
                \"content\": \"How are you?\"
            }
        ],
        \"temperature\": 0.6,
        \"max_tokens\": 1024
    }'

Returns the following:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think>\nOkay, the user asked, \"How are you?\" I need to respond appropriately. Since I'm an AI, I don't have feelings, but I should keep the response friendly and helpful. Maybe say something like, \"I'm just a bunch of code, but I'm doing great! How can I assist you today?\" That's positive and shifts the focus back to the user. Let me make sure it's concise and friendly. Yep, that works.\n</think>\n\nI'm just a bunch of code, but I'm doing great! How can I assist you today? ƒÿè"
      }
    }
  ],
  "created": 1748251147,
  "model": "Qwen3-30B-A3B.IQ3_XXS.gguf",
  "system_fingerprint": "b5490-fef693dc",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 121,
    "prompt_tokens": 12,
    "total_tokens": 133
  },
  "id": "chatcmpl-Ihg3Q1yUsY6rFGKOnOXr6hbRtTR42v2e",
  "timings": {
    "prompt_n": 12,
    "prompt_ms": 69.177,
    "prompt_per_token_ms": 5.76475,
    "prompt_per_second": 173.46806019341687,
    "predicted_n": 121,
    "predicted_ms": 893.017,
    "predicted_per_token_ms": 7.3803057851239675,
    "predicted_per_second": 135.49574084255954
  }
}

kth8 · 2025-05-26T09:23:41Z

@countzero You need to start the server with --jinja in addition to --reasoning-budget 0

countzero · 2025-05-26T09:42:18Z

@kth8 Thank you for the hint. That indeed works now:

llama-server `
    --model 'D:\AI\LLM\gguf\Qwen3-30B-A3B\Qwen3-30B-A3B.IQ3_XXS.gguf' `
    --alias 'Qwen3-30B-A3B.IQ3_XXS.gguf' `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 99 `
    --reasoning-budget 0 `
    --jinja `
    --flash-attn

@ngxson & @ochafik As a developer I would like to use the --reasoning-budget argument without having to know about the --jinja flag, so that I can simply use what I read in the usage documentation directly.

Suggestion: Activate --jinja automatically if --reasoning-budget needs it. I think a similar mechanism is already implemented for other flags.

characharm · 2025-05-29T00:03:11Z

Please take a look: #13877

jacekpoplawski · 2025-06-01T01:34:48Z

I am not able to get reasoning-budget to work

CUDA_VISIBLE_DEVICES=0,1 llama-cli --reasoning-budget 0 -ngl 99 -fa -ctv q8_0 -ctk q8_0 -m /mnt/models3/Qwen_Qwen3-4B-Q8_0.gguf -p "what is the capital of Poland?" 2>/dev/null
user
what is the capital of Poland?
assistant
<think>
Okay, the user is asking for the capital of Poland. I know that Poland is a country in Europe, and I remember that its capital is Warsaw. But wait, I should make sure I'm not mixing it up with another city. Let me think. Poland's capital is indeed Warsaw. I think that's correct. But maybe I should double-check. Let me recall some facts. Poland has several cities, like Kraków, Wrocław, and Gdańsk, but the capital is Warsaw. Yes, I'm pretty sure that's right. I think Warsaw is the capital. So the answer should be Warsaw. But wait, maybe the user is testing if I know that Warsaw is the capital. I should confirm. Let me think of other countries. For example, the capital of France is Paris, Germany is Berlin, and so on. Poland's capital is Warsaw. Yeah, that's right. I don't think there's any confusion here. So the answer is Warsaw.
</think>

The capital of Poland is **Warsaw**. It is the largest city in the country and serves as its political, cultural, and economic center.

>

CUDA_VISIBLE_DEVICES=0,1 llama-cli --reasoning-budget 0 -ngl 99 -fa -ctv q8_0 -ctk q8_0 -m /mnt/models3/Qwen_QwQ-32B-Q8_0.gguf -p "what is the capital of Poland?" 2>/dev/null
user
what is the capital of Poland?
assistant
<think>
Okay, the user is asking for the capital of Poland. Let me think... I remember that Warsaw is the capital, but I should make sure I'm not confusing it with another city. Let me verify. Poland is a country in Central Europe. The capital cities can sometimes be tricky because some countries have capitals that aren't their most famous cities, like Brazil's capital is Brasília, not Rio de Janeiro. But in Poland's case, Warsaw is definitely the largest city and the capital.

Wait, maybe I should recall some historical context. Warsaw was the site of the Warsaw Uprising during World War II, which I think was the largest single military effort by any European resistance movement during the war. That reinforces that it's an important city. Also, the Palace of Culture and Science is a landmark there, which is a gift from the Soviet Union. Yeah, so that's in Warsaw.

I don't think there's been any recent changes to the capital. Poland has been independent since the end of Communist rule, but the capital hasn't changed in modern times. Kraków is another major city in Poland, but it's not the capital. Sometimes people might confuse them because Kraków was the historical capital before Warsaw became the capital in 1596. So historically, there was a shift, but currently, Warsaw is definitely the capital.

Let me think if there's any other possible confusion. Maybe some might think about Gdańsk because of the Solidarity movement, but that's a different city on the coast. So no, the answer should be Warsaw. I can't think of any recent news where the capital would have changed. Therefore, I'm confident that the capital of Poland is Warsaw.
</think>

The capital of Poland is **Warsaw** (Warszawa in Polish). It has been the country's political, cultural, and economic center since the 16th century. Warsaw is known for its rich history, including its role in World War II and its subsequent reconstruction. Key landmarks include the Royal Castle, Old Town (Stare Miasto), and the Palace of Culture and Science.

kth8 · 2025-06-01T01:43:52Z

@jacekpoplawski you didn't run with --jinja like mentioned in previous comments

jacekpoplawski · 2025-06-01T02:13:56Z

does it work for you with --jinja?
UPDATE: it works, but only with server, not with cli

jacek@AI-SuperComputer:~$ CUDA_VISIBLE_DEVICES=0,1 llama-cli --jinja --reasoning-budget 0 -ngl 99 -fa -ctv q8_0 -ctk q8_
0 -m /mnt/models3/Qwen_Qwen3-4B-Q8_0.gguf -p "what is the capital of Poland?" 2>/dev/null
user
what is the capital of Poland?
assistant
<think>
Okay, the user is asking for the capital of Poland. I need to make sure I give the correct answer. First, I recall that Poland is a country in Central Europe. The capital is a city that's well-known for its history and culture. I think it's Warsaw. But wait, I should double-check that to be sure.

Let me think. Poland's major cities include Warsaw, Kraków, Wrocław, and others. But the capital is usually the most important city, especially in terms of government and politics. I remember that Warsaw was the capital during the interwar period, and even after World War II, it remained the capital. There was a period when the capital was moved to Kraków, but that was during the time when the country was under German occupation. However, after the war, Warsaw was restored as the capital. So yes, Warsaw is the capital of Poland. I should confirm that there's no other city that's more commonly referred to as the capital. Maybe some people confuse it with other cities, but I'm pretty sure it's Warsaw. The official name is Warszawa. So the answer is Warsaw, and the official name is Warszawa. I need to present that clearly.
</think>

The capital of Poland is **Warsaw** (in Polish: **Warszawa**). It is the country's political, cultural, and economic center, home to the Polish parliament (Sejm) and government institutions. Warsaw has a rich history, including being the capital during the Polish–Lithuanian Commonwealth and after World War II, when it was rebuilt as the heart of a unified Poland.

jacekpoplawski · 2025-06-01T02:19:52Z

server works "kind of" but it think this is a problem with QwQ itself

command was:
CUDA_VISIBLE_DEVICES=0,1 llama-server --reasoning-budget 0 --jinja -ngl 99 -fa -ctv q8_0 -ctk q8_0 -m /mnt/models3/Qwen_QwQ-32B-Q8_0.gguf --host 0.0.0.0

… w/ enable_thinking:false) (ggml-org#13771) --------- Co-authored-by: ochafik <ochafik@google.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

ochafik added 3 commits May 25, 2025 08:10

server: fix/test add_generation_prompt

8a25f79

tools: enable hermes2/qwen chat logic even w/o tools

43b5626

server: add --reasoning-format=disabled to disable thinking (incl. qw…

b457f89

…en3 w/ enable_thinking:false)

github-actions Bot added testing Everything test related examples python python script changes server labels May 25, 2025

Update README.md

df25e6b

ochafik force-pushed the enable-thinking branch from 1b05d5c to df25e6b Compare May 25, 2025 08:08

ochafik mentioned this pull request May 25, 2025

Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client #13196

Merged

Add models/templates/Qwen-Qwen3-0.6B.jinja

b6eb0a5

update --reasoning-format={disabled -> nothink} as suggested

cdea6a9

fix command r7b's nothink w/ official template

473c01e

ochafik changed the title ~~server: add --reasoning-format=disabled to disable thinking (incl. qwen3 w/ enable_thinking:false)~~ server: add --reasoning-format=nothink to disable thinking (incl. qwen3 w/ enable_thinking:false) May 25, 2025

ochafik marked this pull request as ready for review May 25, 2025 09:44

ochafik requested a review from ngxson as a code owner May 25, 2025 09:44

ngxson approved these changes May 25, 2025

View reviewed changes

Comment thread common/arg.cpp Outdated

Comment thread common/chat.cpp Outdated

ochafik and others added 4 commits May 25, 2025 11:57

rewrite docs as list as suggested

6b9efe7

Update common/chat.cpp

355b38c

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

Merge branch 'enable-thinking' of github.com:ochafik/llama.cpp into e…

fe6022f

…nable-thinking

const char* return for chat enum name helpers

8547fcc

ngxson reviewed May 25, 2025

View reviewed changes

ochafik added 2 commits May 25, 2025 18:52

switch to --reasoning-budget flag

9cdeebe

Merge branch 'fix-gen-prompt' into enable-thinking

9162380

ngxson approved these changes May 25, 2025

View reviewed changes

ngxson changed the title ~~server: add --reasoning-format=nothink to disable thinking (incl. qwen3 w/ enable_thinking:false)~~ server: add --reasoning-budget to disable thinking (incl. qwen3 w/ enable_thinking:false) May 25, 2025

ngxson changed the title ~~server: add --reasoning-budget to disable thinking (incl. qwen3 w/ enable_thinking:false)~~ server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) May 25, 2025

ochafik merged commit e121edc into ggml-org:master May 25, 2025
48 checks passed

n9Mtq4 mentioned this pull request May 29, 2025

Eval bug: std::runtime_error Invalid diff: #13876

Closed

mostlygeek mentioned this pull request May 31, 2025

aider-qwq-coder example bug mostlygeek/llama-swap#151

Closed

ericcurtin mentioned this pull request Jun 12, 2025

Reasoning flag containers/ramalama#1509

Closed

firecoperana mentioned this pull request Aug 23, 2025

Tool calls support from mainline ikawrakow/ik_llama.cpp#723

Merged

4 tasks

createthis mentioned this pull request Aug 24, 2025

Eval bug: Thinking model with thinking disabled cannot use /apply-template with final assistant turn #15401

Closed

Conversation

ochafik commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 25, 2025

Uh oh!

CISC commented May 25, 2025

Uh oh!

ochafik commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson May 25, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson May 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

countzero commented May 26, 2025

Uh oh!

kth8 commented May 26, 2025

Uh oh!

countzero commented May 26, 2025

Uh oh!

characharm commented May 29, 2025

Uh oh!

jacekpoplawski commented Jun 1, 2025

Uh oh!

kth8 commented Jun 1, 2025

Uh oh!

jacekpoplawski commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacekpoplawski commented Jun 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ochafik commented May 25, 2025 •

edited

Loading

ochafik commented May 25, 2025 •

edited

Loading

jacekpoplawski commented Jun 1, 2025 •

edited

Loading