server : support preserving reasoning_content in assistant message by ngxson · Pull Request #18994 · ggml-org/llama.cpp

ngxson · 2026-01-21T12:52:14Z

Changes included in this PR

use json_fwd in chat.h to avoid using template trick
deduplicate code between common_chat_msgs_to_json_oaicompat and common_chat_msg::to_json_oaicompat()
force clear_thinking = false for GLM 4.7 if it is not specified
report the supports_preserve_reasoning to server /props

(Web UI support is TBD)

Changes in API

The /chat/completions API now accepts reasoning_content for assistant message:

{
  "messages": [
    {
      "content": "Hello, world!",
      "role": "user"
    },
    {
      "content": "Hey there!",
      "role": "assistant",
      "reasoning_content": "This is my reasoning."
    },
    {
      "content": "Hello, world!",
      "role": "user"
    }
  ],
  "stream": false,
  "max_tokens": 64
}

If the template supports it, the reasoning will be put back into the message (testing with GLM 4.7)

Otherwise, it will be ignored.

To know if the template supports it, /props endpoint will indicate:

{
  "chat_template_caps": {
    ...
    "supports_preserve_reasoning": true,
    ...
  }
}

pwilkin

Just as a general notion: I am not a fan of splitting reasoning handling into "enable_reasoning", "clear_thinking" and the passive "supports_preserve_reasoning". I think this is a bit messy. Don't have a clear idea of how to handle this yet, but I guess we should (a) detect whether model supports reasoning (b) enable reasoning by default if it does (c) pass reasoning traces if the template supports it (d) accept explicit overrides, but I'm not sure if the explicit overrides are something we should handle on the level of flags or just allow passing it in template_kwargs.

pwilkin · 2026-01-21T13:09:58Z

 #include "log.h"
 #include "regex-partial.h"

-// #include <minja/chat-template.hpp>


Should just remove those at this point, we're not going back to Minja.

pwilkin · 2026-01-21T13:12:17Z

-}
+// std::vector<common_chat_msg> common_chat_msgs_parse_oaicompat(const std::string & messages) {
+//     return common_chat_msgs_parse_oaicompat(json::parse(messages));
+// }


Likewise, I'd just remove this. The code files are littered with comments like this that are left and then never removed.

pwilkin · 2026-01-21T13:12:26Z

-}
+// std::vector<common_chat_tool> common_chat_tools_parse_oaicompat(const std::string & tools) {
+//     return common_chat_tools_parse_oaicompat(json::parse(tools));
+// }


pwilkin · 2026-01-21T13:13:12Z

+    // TODO @ngxson : no known chat templates support reasoning_content in content parts yet
+    //                this can be useful for models with interleaved thinking (like Kimi-K2)
+    //                if you see any templates explicitly support this, please ping me
+    // std::string reasoning_content;


I guess you could argue that GPT-OSS does, but don't know if anyone properly supports that.

aldehir · 2026-01-21T13:55:26Z

+                {
+                    {"role", "assistant"},
+                    {"content", "Assistant message"},
+                    {"reasoning_content", "Reasoning content"}
+                },


Might need a couple more capability checks for thinking at the message level and "type": "thinking" in content parts for gpt-oss and ministral 3 respectively.

The current logic for these models transforms reasoning_content to their expected field at init.

for gpt-oss, it seems like reasoning is only allowed to be added if add_generation_prompt = false, so not usable in llama.cpp use case I think:

{%- elif loop.last and not add_generation_prompt %} {#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #} {#- This is a situation that should only occur in training, never in inference. #} {%- if "thinking" in message %} {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }} {%- endif %}

Line 293:

{%- elif message.thinking and not future_final_message.found %} {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }} {%- endif %}

ngxson · 2026-01-21T15:46:42Z

Just as a general notion: I am not a fan of splitting reasoning handling into "enable_reasoning", "clear_thinking" and the passive "supports_preserve_reasoning"

@pwilkin I'm not splitting but they are indeed different notions:

enable_reasoning: I think you mean enable_thinking. This flag means to add a trailing </think> to the formatted chat, it does not overlap with supports_preserve_reasoning (one is user-controlled and one is read-only). For example: I can enable thinking in older messages in the conversation, then the next message, I put back the reasoning_content while disable enable_thinking, this forces the model to read the reasoning from the earlier message in the conversation.
supports_preserve_reasoning: as explained above; However, this is NOT a flag that you can enable or disable, it's simply an indication for whether put back the reasoning_content into history is accepted by the template
clear_thinking: it is not a llama.cpp notion, just mentioned here because GLM 4.7 template have it; other models can have other naming for this.

(a) detect whether model supports reasoning (b) enable reasoning by default if it does (c) pass reasoning traces if the template supports it (d) accept explicit overrides

(a) Hmm, could you point me to the code where we detect if a model supports reasoning?
(b) Aren't we already enabled reasoning by default if model support it?
(c) You mean reasoning traces parsing (enable_thinking) or preserving reasoning trace inside history (supports_preserve_reasoning)?
(d) I think it's what this PR is made to do

Edit: I think this PR already provide the 4 points a,b,c,d that you brought up

pwilkin · 2026-01-21T22:47:14Z

@ngxson yeah, you're right. I was somehow confused that we're already passing the reasoning_content to the template.

aldehir · 2026-01-21T23:04:12Z

The API supports it, but the WebUI does not. I assume this is setting up the foundation to add first class support in the WebUI.

By support, I mean it'll pass the reasoning in the message objects fed to the template.

…gml-org#18994) * support reasoning_content input * report template caps to webui * add docs * rm commented code

@tugot17

… and new jinja template engine (ggml-org#1369) --------- Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com> common : add nemotron 3 parsing (ggml-org#18077) common : add parser for ministral/mistral large 3/devstral 2 (ggml-org#17713) common : default content to an empty string (ggml-org#18485) chat: make tool description and parameters optional per OpenAI spec (ggml-org#18478) Per the OpenAI API specification, both 'description' and 'parameters' fields in tool function definitions are optional. Previously, the parser would throw an exception if these fields were missing. Attempts to fix ggml-org#17667 common : implement new jinja template engine (ggml-org#18462) --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> jinja: correct member access rule (ggml-org#18905) jinja : fix lexing of float literals with sign (ggml-org#18901) jinja : add missing tojson filter for bool (ggml-org#18900) jinja : attribute support for join, map and sort (ggml-org#18883) jinja : fix object item order (and properly implement dictsort) (ggml-org#18904) tests : add test-jinja -py option for cross-checking (ggml-org#18906) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> ci : run test-jinja -py on high perf [no ci] (ggml-org#18916) jinja : fix undefined keys and attributes and int/float as bool (ggml-org#18924) jinja: support none|string (ggml-org#18995) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> jinja : implement mixed type object keys (ggml-org#18955) --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (ggml-org#19147) `tojson` is not a supported `undefined` filter keep it DRY and fix some types jinja : do not pass empty tools and add some none filters (ggml-org#19176) jinja : add unordered_map include to value.h [no ci] (ggml-org#19205) jinja : add missing 'in' test to template engine (ggml-org#19004) (ggml-org#19239) The jinja template parser was missing the 'in' test from global_builtins(), causing templates using reject("in", ...), select("in", ...), or 'x is in(y)' to fail with "selectattr: unknown test 'in'". This broke tool-calling for Qwen3-Coder and any other model whose chat template uses the 'in' test. Added test_is_in supporting array, string, and object containment checks, mirroring the existing 'in' operator logic in runtime.cpp. Includes test cases for all three containment types plus reject/select filter usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Add Jinja support for "indent" string filter (ggml-org#19529) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> add vendor refactor chat server : support preserving reasoning_content in assistant message (ggml-org#18994) chat : fix translategemma crash on common_chat_format_example (ggml-org#19019) chat: fix language input for translategemma (ggml-org#19052) Co-authored-by: Aldehir Rojas <hello@alde.dev> --------- Co-authored-by: Aldehir Rojas <hello@alde.dev> chat: fix case where template accepts type content only (ggml-org#19419) mtmd : chat : Fix extra \n between text and media marker (ggml-org#19595) Thanks to @tugot17 for detecting and reporting the issue. For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation. However `llama-server` doesn't. I traced it down to extra newline inserted after `<__media__>`. This happens in `to_json_oaicompat`, that treats media markers as text and joins all parts with `\n` separator. PR introduces new type `media_marker` and uses it for media markers. Extra logic is added to prevent insertion of newlines before and after media markers. With this change number of input tokens is identical to HF implementation and as a result the output is also identical. I explored other ways to address the issue * remove completely `\n` between text parts in `to_json_oaicompat` * merge text messages in server-common.cpp before sending them to `to_json_oaicompat` Please propose alternative ways of fixing this issue. Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> common : merge qwen3-coder and nemotron nano 3 parsers (ggml-org#19765) common : fix improper trimming in XML parser on complete message (ggml-org#19805) Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com> jinja: correct stats for tojson and string filters (ggml-org#19785) jinja : correct default size for string slices (ggml-org#19913) common : handle unicode during partial json parsing (ggml-org#16526) common : fix json schema with '\' in literals (ggml-org#17307) add back qwen_coder_xml and mirothinker Co-authored-by: Aldehir Rojas <hello@alde.dev>

ngxson added 2 commits January 21, 2026 13:09

support reasoning_content input

8b38c6d

report template caps to webui

25bb8a3

ngxson requested review from CISC, aldehir, ggerganov and pwilkin as code owners January 21, 2026 12:52

add docs

fc3f4d6

pwilkin approved these changes Jan 21, 2026

View reviewed changes

aldehir approved these changes Jan 21, 2026

View reviewed changes

github-actions Bot added testing Everything test related examples server jinja parser Issues related to the jinja parser labels Jan 21, 2026

loci-dev mentioned this pull request Jan 21, 2026

UPSTREAM PR #18994: server : support preserving reasoning_content in assistant message auroralabs-loci/llama.cpp#992

Open

ngxson added 2 commits January 21, 2026 16:53

rm commented code

0ef0860

Merge branch 'master' into xsn/reasoning_content_input

78a55c3

pwilkin merged commit 51fa458 into ggml-org:master Jan 22, 2026
78 checks passed

ronaldmannak pushed a commit to PicoMLX/llama.cpp that referenced this pull request Jan 24, 2026

server : support preserving reasoning_content in assistant message (g…

3ef58d4

…gml-org#18994) * support reasoning_content input * report template caps to webui * add docs * rm commented code

ronaldmannak pushed a commit to PicoMLX/llama.cpp that referenced this pull request Jan 24, 2026

server : support preserving reasoning_content in assistant message (g…

92f764c

…gml-org#18994) * support reasoning_content input * report template caps to webui * add docs * rm commented code

shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request Feb 6, 2026

server : support preserving reasoning_content in assistant message (g…

a66f0b1

…gml-org#18994) * support reasoning_content input * report template caps to webui * add docs * rm commented code

ServeurpersoCom mentioned this pull request Mar 26, 2026

Send reasoning content back to the model across turns via the reasoning_content API field #21036

Merged

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

server : support preserving reasoning_content in assistant message (g…

396e873

…gml-org#18994) * support reasoning_content input * report template caps to webui * add docs * rm commented code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : support preserving reasoning_content in assistant message#18994

server : support preserving reasoning_content in assistant message#18994
pwilkin merged 5 commits intoggml-org:masterfrom
ngxson:xsn/reasoning_content_input

ngxson commented Jan 21, 2026 •

edited

Loading

Uh oh!

pwilkin left a comment

Uh oh!

pwilkin Jan 21, 2026

Uh oh!

pwilkin Jan 21, 2026

Uh oh!

pwilkin Jan 21, 2026

Uh oh!

pwilkin Jan 21, 2026

Uh oh!

aldehir Jan 21, 2026

Uh oh!

ngxson Jan 21, 2026

Uh oh!

aldehir Jan 21, 2026

Uh oh!

ngxson commented Jan 21, 2026 •

edited

Loading

Uh oh!

pwilkin commented Jan 21, 2026

Uh oh!

aldehir commented Jan 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ngxson commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes included in this PR

Changes in API

Uh oh!

pwilkin left a comment

Choose a reason for hiding this comment

Uh oh!

pwilkin Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

pwilkin Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

pwilkin Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

pwilkin Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

aldehir Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

aldehir Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Jan 21, 2026

Uh oh!

aldehir commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Jan 21, 2026 •

edited

Loading

ngxson commented Jan 21, 2026 •

edited

Loading

aldehir commented Jan 21, 2026 •

edited

Loading