Skip to content

common : merge qwen3-coder and nemotron nano 3 parsers#19765

Merged
pwilkin merged 2 commits intoggml-org:masterfrom
aldehir:migrate-qwen3-coder
Feb 20, 2026
Merged

common : merge qwen3-coder and nemotron nano 3 parsers#19765
pwilkin merged 2 commits intoggml-org:masterfrom
aldehir:migrate-qwen3-coder

Conversation

@aldehir
Copy link
Copy Markdown
Contributor

@aldehir aldehir commented Feb 20, 2026

Users are experiencing several issues with Qwen3-Coder-Next. Until #18675 is merged in, this PR serves as a stop-gap by replacing the existing Qwen3-Coder parsing with the Nemotron Nano 3 PEG parsing variant already present.

This PR also adds parallel tool calling and fixes JSON schema support.

fixes #19382
fixes #19430
fixes #19304
supersedes #19503 and #19753

Copy link
Copy Markdown
Member

@pwilkin pwilkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some tests on schema parameters, especially an array (such as a "todolist" tool)?

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Feb 20, 2026

I'll try to test it today with OpenCode to see how it works.

@bfroemel
Copy link
Copy Markdown

Can we also add the fix for #19513 ?

@pwilkin, I believe it was just removing the trim_trailing_space() filters from reasoning (L30) and content (L34) here,?

https://github.com/aldehir/llama.cpp/blob/e811d2263fc10d1b1d9ed036708d6c9b6f798a7f/common/chat-peg-parser.cpp#L25-L36

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Feb 20, 2026

Can we also add the fix for #19513 ?

As I understand it, the model generates the following tokens:

llama.cpp > curl http://localhost:8080/tokenize -d '{"content": "call:\n\n<tool_call>", "parse_special": true, "with_pieces": true}'
{"tokens":[{"id":6659,"piece":"call"},{"id":1447,"piece":":\n\n"},{"id":151657,"piece":"<tool_call>"}]}

But when we parse it, we strip \n (the template always adds a \n before <tool_call>), and we feed it back as:

llama.cpp > curl http://localhost:8080/tokenize -d '{"content": "call:\n<tool_call>", "parse_special": true, "with_pieces": true}'
{"tokens":[{"id":6659,"piece":"call"},{"id":510,"piece":":\n"},{"id":151657,"piece":"<tool_call>"}]}

Which causes issues? This makes sense to me. It's a bit annoying because ultimately it depends on the model + template. In this case, we need to strip a single \n because it is added back in the chat template.

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Feb 20, 2026

@aldehir Oh, this is a fun one :)

This causes issues because in the tokenizer, :\n\n is a token. A separate one from ":". And apparently that's a trick used to train the model toolcalling - if the previous token was the ":\n\n", the next token is more likely to be the tool call start.

When you pass this back as ":\n", this is no longer the same token. Therefore, the model at some point gets confused and stops doing tool calls if enough of those bad historical calls are done because it doesn't know what to do after a token it wasn't trained to call tools on.

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Feb 20, 2026

This causes issues because in the tokenizer, :\n\n is a token. A separate one from ":". And apparently that's a trick used to train the model toolcalling - if the previous token was the ":\n\n", the next token is more likely to be the tool call start.

Upon further inspection, I am struggling to see the issue.

The chat template will strip the message content and append a \n.

        {%- if message.content is defined and message.content is string and message.content | trim | length > 0 %}
            {{- '\n' + message.content | trim + '\n' }}
        {%- endif %}

https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/chat_template.jinja

Then, it prepends a \n for tool calls

            {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }

Assuming the end of the content is :\n\n, even if we stripped trailing whitespace, the template will add back the \n\n. Due to the trimming in the template, it actually doesn't matter if we strip or not.

@bfroemel
Copy link
Copy Markdown

@aldehir Are you sure that the template adds the newline again?

I believe the tool call preamble (as it occurs with codex use/potentially related to the responses API) is rendered as a "normal" assistant message:

        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>\n' }}

Anyway, I can confirm that currently it's an issue with this PR, @pwilkin's fix (#18675 (comment)) lead to the apparently expected format and the issue was gone.

@bfroemel
Copy link
Copy Markdown

oh, remembering - I fell for the same "trap" :) #18675 (comment) (marked it as off-topic, contains the actually rendered prompt missing the required newlines)

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Feb 20, 2026

oh, remembering - I fell for the same "trap" :) #18675 (comment) (marked it as off-topic, contains the actually rendered prompt missing the required newlines)

I see. It's a bit weird that the preamble would be its own message. This means the client needs to perform another request to trigger a tool call, which doesn't quite align with the idea that retaining :\n\n yields uninterrupted sessions. Unless the Responses API is incorrectly formatting the input to the template.

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Feb 20, 2026

Sure enough, the Responses API is the culprit:
https://github.com/ggml-org/llama.cpp/blob/b908baf1825b1a89afef87b09e22c32af2ca6548/tools/server/server-common.cpp#L1217C1-L1241C50

Rather than generate a single message with content + tool_call, it generates two messages. I think this needs to be fixed instead, because feeding two separate messages will undoubtedly differ from how the model was trained.

@bfroemel
Copy link
Copy Markdown

bfroemel commented Feb 20, 2026

Here just for reference a snippet from a responses API request as generated by codex (this example misses the 2 newlines at the end of tool preamble message), input array elements:

    {
      "type": "message",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "Let me search for the main agent loop or turn handling:"
        }
      ]
    },
    {
      "type": "function_call",
      "name": "exec_command",
      "arguments": "{\"cmd\":\"rg --files /home/b/work/codex-new/codex-rs/core/src | xargs -I{} rg -l \\\"turn\\\\|loop\\\" {} 2>/dev/null | head -20\",\"justification\":\"Find files related to turn/loop\"}",
      "call_id": "fc_o1wk8CXzBhlun0u8esiYzJjZuz58M5Iz"
    },
    {
      "type": "function_call_output",
      "call_id": "fc_o1wk8CXzBhlun0u8esiYzJjZuz58M5Iz",
      "output": "Chunk ID: 7054eb\nWall time: 0.4461 seconds\nProcess exited with code 0\nOriginal token count: 0\nOutput:\n"
    },
    {
      "type": "function_call",
      "name": "exec_command",
      "arguments": "{\"cmd\":\"find /home/b/work/codex-new/codex-rs/core -type f -name \\\"*.rs\\\" | head -50\",\"justification\":\"List core module files\"}",
      "call_id": "fc_VP2ElL5yP9Y2knTm3gTh79VGOmn8s39e"
    },

Not seeing a case where content + tool call could be currently possible with https://github.com/openingnow/llama.cpp/blob/292f6908cdc6abb5c38581e34fa141973e5aba82/tools/server/server-common.cpp#L1072

I believe in the responses API there is no way to have type "message" contain also "function_call" elements. https://developers.openai.com/api/reference/resources/responses#(resource)%20responses%20%3E%20(model)%20response_input_item%20%3E%20(schema)

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Feb 20, 2026

Yes, I believe this is called a "leaky abstraction" 😊.

We should add a post step in the conversion that merges consecutive assistant messages. That seems like the most correction solution.

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Feb 20, 2026

Aight, merging this one and we'll do a followup on the Responses problem.

@pwilkin pwilkin merged commit 94b0200 into ggml-org:master Feb 20, 2026
78 checks passed
@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Feb 20, 2026

@bfroemel expect to see a PR for the Responses API here shortly. If you don't mind, I'd like to enlist your services to help test it out.

@bfroemel
Copy link
Copy Markdown

bfroemel commented Feb 20, 2026

Hmm, so the model generates content output + tool call(s) - everything in a single generation; so single response. codex (as a responses API client) processes that response and builds the conversation in the input array for the next request with tool call responses. There it ends up as several elements (output_text message, tool calls, tool responses).

IMO that's fine and wouldn't immediately require to do post processing (trying to get multiple responses input elements into single chat completions elements), as long as it ends up as the expected prompt for the model (I looked at the rendered prompt - it's imo indistinguishable)? but ofc fine with any other solution that might be more clean/maintainable - especially if @pwilkin's fix could be problematic for other models.

/edit: adding a rendered prompt snippet (with the required newlines)

<|im_start|>assistant
Let me look at the parse_turn_item function:

<|im_end|>
<|im_start|>assistant
<tool_call>
<function=exec_command>
<parameter=cmd>
grep -n "fn parse_turn_item" /home/b/work/codex-new/codex-rs/core/src/*.rs
</parameter>
<parameter=justification>
Find parse_turn_item function
</parameter>
</function>
</tool_call><|im_end|>
<|im_start|>user
<tool_response>
Chunk ID: d74fd7
Wall time: 0.0510 seconds
Process exited with code 0
Original token count: 32
Output:
/home/b/work/codex-new/codex-rs/core/src/event_mapping.rs:96:pub fn parse_turn_item(item: &ResponseItem) -> Option<TurnItem> {

</tool_response><|im_end|>
<|im_start|>assistant
<tool_call>
<function=exec_command>
<parameter=cmd>
sed -n '96,200p' /home/b/work/codex-new/codex-rs/core/src/event_mapping.rs
</parameter>
<parameter=justification>
Read parse_turn_item function
</parameter>
</function>
</tool_call><|im_end|>

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Feb 20, 2026

This is what I expect:

<|im_start|>assistant
Let me look at the parse_turn_item function:

<tool_call>
<function=exec_command>
<parameter=cmd>

The additional <|im_end|>...<|im_start|> will likely cause the model to in-context learn and not generate a tool call in a single generation.

ikawrakow pushed a commit to ikawrakow/ik_llama.cpp that referenced this pull request Feb 23, 2026
* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* ggml-org/llama.cpp#19635
* ggml-org/llama.cpp#19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* common : migrate qwen3-coder to PEG parsing variant

* cont : add JSON parameter test
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* ggml-org/llama.cpp#19635
* ggml-org/llama.cpp#19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Better estimate for max. nuber of compute nodes

* Just in case

server: fix crash from adaptive p (ikawrakow#1304)

Co-authored-by: firecoperana <firecoperana>

Fix tool call for Qwen3.5 (ikawrakow#1300)

* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* ggml-org/llama.cpp#19635
* ggml-org/llama.cpp#19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one

Graph parallel for Qwen3-Next (ikawrakow#1292)

* WIP

* This works, but is slower than split mode layer

Fix llm_arch_is_hybrid (ikawrakow#1305)

Fix max nodes (again) (ikawrakow#1306)

Fix typo in merge-up-gate-experts argument (ikawrakow#1311)

llama-quantize: --dry-run option (ikawrakow#1309)

Slightly better graph parallel for Qwen3-Next (ikawrakow#1307)

* Make sure we pick the reduced tensor from the right GPU

* Minor

Minor delta-net tweak (ikawrakow#1308)

* Make sure we pick the reduced tensor from the right GPU

* Minor

* Minor delta-net tweak

adaptive p: collect probability before logit bias (ikawrakow#1314)

server: propagate task index to response objects for batch requests (ikawrakow#1303)

When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.

Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.

Generated with [Devin](https://cli.devin.ai/docs)

Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>

Llama-quantize: Partial requant feature (ikawrakow#1313)

* Partial Requant feature for llama-quantize

- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.

* Create output directory if it doesn't exist in llama-quantize

* Create output directory if it doesn't exist in gguf-split

* Add exit when directory fails to be created on Windows

* Use std::filesystem

* cleanup

Display the size of the tensors overriden during the tensor loading (ikawrakow#1318)

* Display the size of the tensors overriden during the tensor loading

Ex:

`Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU`

become

`Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU`

And pass in debug the later displayed size of the unnamed buffer overrides.

Ex : `llm_load_tensors:        CPU buffer size =   XXX.XX MiB`

That double display is cluttering the screen without being very informative.

* change bytes display to MiB.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

Fused delta-net (ikawrakow#1315)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

Fix KT quantization yet again (ikawrakow#1321)

* Fix KT quantization yet again

* Add same 1e-16f check for all quants in iqk_uantize.cpp

* Fixes for k-quants

* Also this one

server: enable checkpoint for recurrent models (ikawrakow#1310)

* server: enable checkpoint for recurrent models

create checkpoint after cancel

fix ban string and rm context during rewind

add checkpoint interval

only save recurrent cache

* save checkpoint during pp

---------

Co-authored-by: firecoperana <firecoperana>

Faster quantization for MoE models with many experts (ikawrakow#1322)

Fused delta net 2 (ikawrakow#1320)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.

iAdding support for dense Qwen-3.5 models (ikawrakow#1326)

add directio to llama-bench
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
* common : migrate qwen3-coder to PEG parsing variant

* cont : add JSON parameter test
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Mar 3, 2026
* common : migrate qwen3-coder to PEG parsing variant

* cont : add JSON parameter test
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* common : migrate qwen3-coder to PEG parsing variant

* cont : add JSON parameter test
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
… and new jinja template engine (ggml-org#1369)

---------

Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>

common : add nemotron 3 parsing (ggml-org#18077)

common : add parser for ministral/mistral large 3/devstral 2 (ggml-org#17713)

common : default content to an empty string (ggml-org#18485)

chat: make tool description and parameters optional per OpenAI spec (ggml-org#18478)

Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.

Attempts to fix ggml-org#17667

common : implement new jinja template engine (ggml-org#18462)
---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

jinja: correct member access rule (ggml-org#18905)

jinja : fix lexing of float literals with sign (ggml-org#18901)

jinja : add missing tojson filter for bool (ggml-org#18900)

jinja : attribute support for join, map and sort (ggml-org#18883)

jinja : fix object item order (and properly implement dictsort) (ggml-org#18904)

tests : add test-jinja -py option for cross-checking (ggml-org#18906)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ci : run test-jinja -py on high perf [no ci] (ggml-org#18916)

jinja : fix undefined keys and attributes and int/float as bool (ggml-org#18924)

jinja: support none|string (ggml-org#18995)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

jinja : implement mixed type object keys (ggml-org#18955)

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (ggml-org#19147)

`tojson` is not a supported `undefined` filter

keep it DRY and fix some types

jinja : do not pass empty tools and add some none filters (ggml-org#19176)

jinja : add unordered_map include to value.h [no ci] (ggml-org#19205)

jinja : add missing 'in' test to template engine (ggml-org#19004) (ggml-org#19239)

The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".

This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.

Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.

Includes test cases for all three containment types plus
reject/select filter usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

Add Jinja support for "indent" string filter (ggml-org#19529)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

add vendor

refactor chat

server : support preserving reasoning_content in assistant message (ggml-org#18994)

chat : fix translategemma crash on common_chat_format_example (ggml-org#19019)

chat: fix language input for translategemma (ggml-org#19052)

Co-authored-by: Aldehir Rojas <hello@alde.dev>

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>

chat: fix case where template accepts type content only (ggml-org#19419)

mtmd : chat : Fix extra \n between text and media marker (ggml-org#19595)

Thanks to @tugot17 for detecting and reporting the issue.

For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.

However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.

This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.

PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.

With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.

I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`

Please propose alternative ways of fixing this issue.

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

common : merge qwen3-coder and nemotron nano 3 parsers (ggml-org#19765)

common : fix improper trimming in XML parser on complete message (ggml-org#19805)

Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com>

jinja: correct stats for tojson and string filters (ggml-org#19785)

jinja : correct default size for string slices (ggml-org#19913)

common : handle unicode during partial json parsing (ggml-org#16526)

common : fix json schema with '\' in literals (ggml-org#17307)

add back qwen_coder_xml and mirothinker

Co-authored-by: Aldehir Rojas <hello@alde.dev>
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* common : migrate qwen3-coder to PEG parsing variant

* cont : add JSON parameter test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

3 participants