common : re-arm reasoning budget after DONE on new <think>#22323
common : re-arm reasoning budget after DONE on new <think>#22323pwilkin merged 1 commit intoggml-org:masterfrom
Conversation
DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response. Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).
This doesn't mean the model produces multiple thinking blocks within a single generation. It means thinking blocks from previous generations aren't stripped from the context on later turns. Those are two different things. Also, the template itself doesn't handle |
|
As @aldehir said. From my tests on Qwen3.5/3.6, using too low of a reasoning budget would cause the model to spill reasoning over to content. In extreme cases, it would do what you're describing as "opening a second thinking block", but that isn't a legitimate construction. |
You are right, after reading it more carefully this is about thinking across turns. However preservation does indirectly put prior turn
The captured logs show both The first block ended with Before After NB. Happy to share the complete server logs (or re-run with a different budget/prompt) if it would help |
Ah, good catch. Yes, this is apparent with |
|
Thanks for the clear pointer, @aldehir, after reading around more I can see what you mean. The I'm very new to this codebase, so I missed that entirely and went for the state-machine fix instead. Would you be okay with me submitting that as a follow-up PR? Let me know if you'd rather take it yourself, i'm happy to stand down. |
Yes, this is the issue. However, there's another subtle problem for templates that include an opening I am exploring adding assistant prefill that requires the issue above fixed as well. For now, I think we can accept your PR as is. |
|
Thanks @aldehir that is indeed a tricky edge case I hadn't considered. Happy to leave the prefill problem to you. Let me know if there's anything useful I can do while this PR works through review. |
aldehir
left a comment
There was a problem hiding this comment.
@BruceJillis thank you for the PR and diagnosing the bug.
cc @pwilkin
|
Sorry, yeah, let's use this as a workaround for now and fix it properly as a followup. |
|
@BruceJillis: With every turn there are more of those "deactivated / re-activated" messages. I think it is not doing any harm besides spamming my log, but nevertheless something that probably should get fixed. As this PR is already merged, should I create a new issue for that? |
|
@tha80 Thanks for reporting! This is a side effect of my fix.. the log lines were useful for validating the fix so I never considered they'd become noisy in long conversations. There's an architectural fix in the works (#22488) that will supersede this workaround. I'll check whether it resolves this too, and if not I'll push a follow-up PR asap. |
…22323) DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response. Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).
…22323) DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response. Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).
Overview
DONE state in reasoning budget state machine absorbs start tags, causing any block after the first to run unbudgeted. This makes it so the reasoning budget is a no-op for multi-block thinking models. Using the Qwen3.6-27B model with the recommended settings causes this issue to appear [1]. The fix is to re-arm in DONE on a match and transition to COUNTING with a fresh budget. I've added a regression test in test-reasoning-budget to test for this new behavior and all 6 tests pass.
Additional information
Reproducible using:
unsloth/Qwen3.6-27B-GGUF, server flags:--reasoning-budget 128 --reasoning-format deepseek --jinja, base commit: master at15fa3c493(b8920)Request body
{ "model": "kimi-q36-27b", "messages": [ { "role": "user", "content": "Get the current time, then tell me what it is." }, { "role": "assistant", "content": "<think>The user wants the time. I should call get_time to retrieve it.</think>", "tool_calls": [ { "id": "call_1", "type": "function", "function": { "name": "get_time", "arguments": "{}" } } ] }, { "role": "tool", "tool_call_id": "call_1", "content": "12:00 PM UTC" } ], "tools": [ { "type": "function", "function": { "name": "get_time", "description": "Get the current time.", "parameters": { "type": "object", "properties": {} } } } ], "temperature": 0.3, "max_tokens": 512 }Before
The second
<think>(the assistant prefill at end of prompt) hit DONE state and was silently absorbed. The subsequent generation ran without budget enforcement.After (this PR)
The new
re-activated on new start tagline is the fix incommon_reasoning_budget_accept: when a start tag is accepted in DONE, the sampler transitions back to COUNTING with a fresh budget. The downstreamdeactivated (natural end)confirms the second block was tracked to completion.Related work includes: #21594 handles cross-generation reset and #21141 adds a conclusion state. Both are independent of this fix which targets the case of multiple blocks in the same request.
Requirements