Send reasoning content back to the model across turns via the reasoning_content API field#21036
Conversation
Preserve assistant reasoning across turns by extracting it from internal tags and sending it as a separate reasoning_content field in the API payload. The server and Jinja templates handle native formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...). Adds "Exclude reasoning from context" toggle in Settings > Developer (off by default, so reasoning is preserved). Includes unit tests.
ngxson
left a comment
There was a problem hiding this comment.
Not sure if this will be supported by many models (some also requires explicitly set a model-specific kwarg like exclude_thinking=False), but if you're confident with this, we can give it a try
|
Absolutely, my translator (LLM) ate the most important part, it will rarely be useful but it will work if needed. The API includes a symmetric field for returning reasoning_content in the context, and the Jinja template handles rejecting it if necessary this is the case for most models |
|
May I ask, I recall that previously most models did not recommend including the reasoning content in the response. Will this modification break that behavior, or does the model’s Jinja template prevent it so there’s no need to worry? |
Absolutely, the Jinja template "filters" and cleans the CoT text for the vast majority of models, and if it ever exceptionally causes a problem on a particular model, simply check the box in Settings/Dev to test this case. Furthermore, I'm certain that everything can be overridden in various ways on the backend, CLI, presets.ini for router mode etc... (even externalize another modded Jinja template). |
…ng_content API field (ggml-org#21036) * webui: send reasoning_content back to model in context Preserve assistant reasoning across turns by extracting it from internal tags and sending it as a separate reasoning_content field in the API payload. The server and Jinja templates handle native formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...). Adds "Exclude reasoning from context" toggle in Settings > Developer (off by default, so reasoning is preserved). Includes unit tests. * webui: add syncable parameter for excludeReasoningFromContext * chore: update webui build output
…ng_content API field (ggml-org#21036) * webui: send reasoning_content back to model in context Preserve assistant reasoning across turns by extracting it from internal tags and sending it as a separate reasoning_content field in the API payload. The server and Jinja templates handle native formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...). Adds "Exclude reasoning from context" toggle in Settings > Developer (off by default, so reasoning is preserved). Includes unit tests. * webui: add syncable parameter for excludeReasoningFromContext * chore: update webui build output
…ng_content API field (ggml-org#21036) * webui: send reasoning_content back to model in context Preserve assistant reasoning across turns by extracting it from internal tags and sending it as a separate reasoning_content field in the API payload. The server and Jinja templates handle native formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...). Adds "Exclude reasoning from context" toggle in Settings > Developer (off by default, so reasoning is preserved). Includes unit tests. * webui: add syncable parameter for excludeReasoningFromContext * chore: update webui build output
Overview
Send reasoning content back to the model across turns via the reasoning_content API field instead of stripping it.
Currently the WebUI strips all reasoning from previous assistant messages before sending them to /v1/chat/completions. This means models like GLM-4.7-Flash, DeepSeek-R1, QwQ and others that support multi-turn chain-of-thought lose their own reasoning history on every new turn.
The server already supports reasoning_content as a first-class input field: common_chat_msgs_parse_oaicompat parses it, to_json_oaicompat serializes it, and Jinja templates consume it natively (e.g. GLM maps it to blocks via its clear_thinking flag). The WebUI also already stores reasoning inline in content wrapped in internal tags. The only missing piece was extracting it and sending it back as a proper API field.
Changes:
Tested live with MoE-GLM-4.7-Flash-30B-A3B: verified the payload in DevTools Network tab across all three toggle states (default on, toggled off at runtime, re-enabled at runtime) without page reload.
Additional information
Closes #19449
Related: PR #18994 (server-side reasoning input support)
Note: for GLM-4.7-Flash to actually preserve reasoning in the rendered prompt and not just receive it, the template also needs clear_thinking: false via chat_template_kwargs. That is a separate concern outside this PR scope.
Requirements
AI usage disclosure: YES Claude Opus 4.6 Extended inside a disposable local container used for code audit/generation without any privilege/write access, all changes reviewed and tested/commited manually