Skip to content

Send reasoning content back to the model across turns via the reasoning_content API field#21036

Merged
ServeurpersoCom merged 3 commits intoggml-org:masterfrom
ServeurpersoCom:webui/preserve-reasoning-in-context
Mar 27, 2026
Merged

Send reasoning content back to the model across turns via the reasoning_content API field#21036
ServeurpersoCom merged 3 commits intoggml-org:masterfrom
ServeurpersoCom:webui/preserve-reasoning-in-context

Conversation

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

Overview

Send reasoning content back to the model across turns via the reasoning_content API field instead of stripping it.

Currently the WebUI strips all reasoning from previous assistant messages before sending them to /v1/chat/completions. This means models like GLM-4.7-Flash, DeepSeek-R1, QwQ and others that support multi-turn chain-of-thought lose their own reasoning history on every new turn.

The server already supports reasoning_content as a first-class input field: common_chat_msgs_parse_oaicompat parses it, to_json_oaicompat serializes it, and Jinja templates consume it natively (e.g. GLM maps it to blocks via its clear_thinking flag). The WebUI also already stores reasoning inline in content wrapped in internal tags. The only missing piece was extracting it and sending it back as a proper API field.

Changes:

  • Extract reasoning from internal tags and send it as a separate reasoning_content field in the API payload, no internal tags leak into the request
  • Add "Exclude reasoning from context" toggle in Settings > Developer, unchecked by default so reasoning is preserved
  • Add corresponding syncable parameter so server admins can pre-configure the default
  • Add 12 unit tests covering extraction, stripping, and the conditional mapping logic

Tested live with MoE-GLM-4.7-Flash-30B-A3B: verified the payload in DevTools Network tab across all three toggle states (default on, toggled off at runtime, re-enabled at runtime) without page reload.

Additional information

Closes #19449

Related: PR #18994 (server-side reasoning input support)

Note: for GLM-4.7-Flash to actually preserve reasoning in the rendered prompt and not just receive it, the template also needs clear_thinking: false via chat_template_kwargs. That is a separate concern outside this PR scope.

Requirements

  • I have read and agree with the contributing guidelines
    AI usage disclosure: YES Claude Opus 4.6 Extended inside a disposable local container used for code audit/generation without any privilege/write access, all changes reviewed and tested/commited manually

Preserve assistant reasoning across turns by extracting it from
internal tags and sending it as a separate reasoning_content field
in the API payload. The server and Jinja templates handle native
formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...).

Adds "Exclude reasoning from context" toggle in Settings > Developer
(off by default, so reasoning is preserved). Includes unit tests.
@ServeurpersoCom ServeurpersoCom requested a review from a team as a code owner March 26, 2026 17:42
Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this will be supported by many models (some also requires explicitly set a model-specific kwarg like exclude_thinking=False), but if you're confident with this, we can give it a try

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

Absolutely, my translator (LLM) ate the most important part, it will rarely be useful but it will work if needed.

The API includes a symmetric field for returning reasoning_content in the context, and the Jinja template handles rejecting it if necessary this is the case for most models

@ServeurpersoCom ServeurpersoCom merged commit d0fa2c9 into ggml-org:master Mar 27, 2026
6 checks passed
@ZUIcat
Copy link
Copy Markdown

ZUIcat commented Mar 28, 2026

May I ask, I recall that previously most models did not recommend including the reasoning content in the response. Will this modification break that behavior, or does the model’s Jinja template prevent it so there’s no need to worry?

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

May I ask, I recall that previously most models did not recommend including the reasoning content in the response. Will this modification break that behavior, or does the model’s Jinja template prevent it so there’s no need to worry?

Absolutely, the Jinja template "filters" and cleans the CoT text for the vast majority of models, and if it ever exceptionally causes a problem on a particular model, simply check the box in Settings/Dev to test this case. Furthermore, I'm certain that everything can be overridden in various ways on the backend, CLI, presets.ini for router mode etc... (even externalize another modded Jinja template).

slartibardfast pushed a commit to slartibardfast/llama.cpp that referenced this pull request Apr 12, 2026
…ng_content API field (ggml-org#21036)

* webui: send reasoning_content back to model in context

Preserve assistant reasoning across turns by extracting it from
internal tags and sending it as a separate reasoning_content field
in the API payload. The server and Jinja templates handle native
formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...).

Adds "Exclude reasoning from context" toggle in Settings > Developer
(off by default, so reasoning is preserved). Includes unit tests.

* webui: add syncable parameter for excludeReasoningFromContext

* chore: update webui build output
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…ng_content API field (ggml-org#21036)

* webui: send reasoning_content back to model in context

Preserve assistant reasoning across turns by extracting it from
internal tags and sending it as a separate reasoning_content field
in the API payload. The server and Jinja templates handle native
formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...).

Adds "Exclude reasoning from context" toggle in Settings > Developer
(off by default, so reasoning is preserved). Includes unit tests.

* webui: add syncable parameter for excludeReasoningFromContext

* chore: update webui build output
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
…ng_content API field (ggml-org#21036)

* webui: send reasoning_content back to model in context

Preserve assistant reasoning across turns by extracting it from
internal tags and sending it as a separate reasoning_content field
in the API payload. The server and Jinja templates handle native
formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...).

Adds "Exclude reasoning from context" toggle in Settings > Developer
(off by default, so reasoning is preserved). Includes unit tests.

* webui: add syncable parameter for excludeReasoningFromContext

* chore: update webui build output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Setting to preserve Reasoning Content in WebUI

4 participants