$ ./llama.cpp/build/bin/llama-server \
--model /home/alvis/Workspace/MachineLearning/Data/LinkedData/Qwen3-Coder-Next-UD-Q4_K_XL.gguf \
--alias "Qwen3-Coder-Next" \
--ctx-size 16384 \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--host 0.0.0.0 \
--port 8001 \
--jinja
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8233 (c5a778891) with GNU 11.4.0 for Linux x86_64
system info: n_threads = 6, n_threads_batch = 6, total_threads = 20
system_info: n_threads = 6 (n_threads_batch = 6) / 20 | CUDA : ARCHS = 750,800,860,890,1200,1210 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 19 threads for HTTP server
start: binding port with default address family
main: loading model
...
main: server is listening on http://0.0.0.0:8001
main: starting the main loop...
# Then triggered the API request:
$ curl -s http://127.0.0.1:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-Coder-Next",
"messages": [{"role": "user", "content": "What is 1+1? Use the add tool."}],
"tools": [{"type": "function", "function": {"name": "add", "description": "Add two numbers", "parameters": {"type": "object", "properties": {"a": {"type": "string"}, "b": {"type": "string"}}, "required": ["a", "b"]}}}],
"tool_choice": "auto"
}' | python3 -m json.tool
# Server output:
srv update_slots: all slots are idle
srv params_from_: Chat format: peg-native
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, task.n_tokens = 288
slot init_sampler: id 3 | task 0 | init sampler, took 0.04 ms, tokens: text = 288, total = 288
slot update_slots: id 3 | task 0 | prompt processing done, n_tokens = 288, batch.n_tokens = 288
slot release: id 3 | task 0 | stop processing: n_tokens = 311, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
# Received JSON response violating the string arguments spec:
{
"choices": [
{
"finish_reason": "tool_calls",
"index": 0,
"message": {
"role": "assistant",
"content": "",
"tool_calls": [
{
"type": "function",
"function": {
"name": "add",
"arguments": {
"a": "1",
"b": "1"
}
},
"id": "VR8m59fStegbYHZWeoJlj4nI0j9hhTXt"
}
]
}
}
],
...
}
Name and Version
build: 8233 (c5a7788) with GNU 11.4.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
NVIDIA GeForce RTX 5090
Models
Qwen3-Coder-Next-UD-Q4_K_XL.gguf
https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF
Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf
https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF
Problem description & steps to reproduce
Dear mods, I am trying to run quantized model in llama.cpp through the instructions. However, following the recent Autoparser refactoring PR (#18675),
llama-serverreturns theargumentsfield intool_callsas a parsed JSON object rather than a JSON string.This breaks strict OpenAI API compatibility. According to the OpenAI API Reference,
tool_calls[].function.argumentsmust be a string containing JSON, not a parsed object. Because of this change, the officialopenaiPython SDK (which uses Pydantic for strong type checking) crashes with aTypeErrorwhen attempting to process tool calls.What I tried
Adding the argument --jinja or --chat-template chatml both failed to adhere to OpenAI API compatibility, which made the *claw frameworks (openclaw, nanoclaw, zeroclaw, ironclaw) fail to work as expected.
To Reproduce
Start
llama-serverwith a tool-capable model (e.g., Qwen3-Coder-Next) and Jinja template enabled:Send a
curlrequest to the/v1/chat/completionsendpoint with tools provided:Observe the
tool_callsblock in the raw JSON response. It shows:Instead of the expected OpenAI-compatible format:
If you run the official
openaiPython SDK (v2.21.0), it immediately crashes upon receiving the tool call response:Root Cause
I traced this back to the massive parser refactoring in PR #18675 (commit
566059a26b0ce8faec4ea053605719d399c64cc5).In
common/chat.cpparound line 132, theargumentsfield is explicitly parsed into a JSON object:{"type", "function"}, {"function", { {"name", tool_call.name}, {"arguments", json::parse(tool_call.arguments)}, // <-- This causes the issue }},It should output the raw serialized JSON string instead of parsing it.
First Bad Commit
566059a
(From PR #18675: Autoparser - complete refactoring of parser architecture)
Relevant log output
Logs