Skip to content

Eval bug: llama-server tool_calls returns arguments as JSON object instead of string, breaking OpenAI compatibility #20198

@alvis233

Description

@alvis233

Name and Version

build: 8233 (c5a7788) with GNU 11.4.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

NVIDIA GeForce RTX 5090

Models

Problem description & steps to reproduce

Dear mods, I am trying to run quantized model in llama.cpp through the instructions. However, following the recent Autoparser refactoring PR (#18675), llama-server returns the arguments field in tool_calls as a parsed JSON object rather than a JSON string.

This breaks strict OpenAI API compatibility. According to the OpenAI API Reference, tool_calls[].function.arguments must be a string containing JSON, not a parsed object. Because of this change, the official openai Python SDK (which uses Pydantic for strong type checking) crashes with a TypeError when attempting to process tool calls.

What I tried
Adding the argument --jinja or --chat-template chatml both failed to adhere to OpenAI API compatibility, which made the *claw frameworks (openclaw, nanoclaw, zeroclaw, ironclaw) fail to work as expected.

To Reproduce

  1. Start llama-server with a tool-capable model (e.g., Qwen3-Coder-Next) and Jinja template enabled:

    ./llama-server --model Qwen3-Coder-Next-UD-Q4_K_XL.gguf --jinja --port 8001
  2. Send a curl request to the /v1/chat/completions endpoint with tools provided:

    curl -s http://127.0.0.1:8001/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "Qwen3-Coder-Next",
        "messages": [{"role": "user", "content": "What is 1+1? Use the add tool."}],
        "tools": [{"type": "function", "function": {"name": "add", "description": "Add two numbers", "parameters": {"type": "object", "properties": {"a": {"type": "string"}, "b": {"type": "string"}}, "required": ["a", "b"]}}}],
        "tool_choice": "auto"
      }'
  3. Observe the tool_calls block in the raw JSON response. It shows:

    "tool_calls": [
      {
        "type": "function",
        "function": {
          "name": "add",
          "arguments": {
            "a": "1",
            "b": "1"
          }
        },
        "id": "..."
      }
    ]

    Instead of the expected OpenAI-compatible format:

    "arguments": "{\"a\": \"1\", \"b\": \"1\"}"
  4. If you run the official openai Python SDK (v2.21.0), it immediately crashes upon receiving the tool call response:

    TypeError: the JSON object must be str, bytes or bytearray, not dict

Root Cause
I traced this back to the massive parser refactoring in PR #18675 (commit 566059a26b0ce8faec4ea053605719d399c64cc5).
In common/chat.cpp around line 132, the arguments field is explicitly parsed into a JSON object:

{"type", "function"},
{"function", {
    {"name", tool_call.name},
    {"arguments", json::parse(tool_call.arguments)}, // <-- This causes the issue
}},

It should output the raw serialized JSON string instead of parsing it.

First Bad Commit

566059a
(From PR #18675: Autoparser - complete refactoring of parser architecture)

Relevant log output

Logs
$ ./llama.cpp/build/bin/llama-server \
    --model /home/alvis/Workspace/MachineLearning/Data/LinkedData/Qwen3-Coder-Next-UD-Q4_K_XL.gguf \
    --alias "Qwen3-Coder-Next" \
    --ctx-size 16384 \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --host 0.0.0.0 \
    --port 8001 \
    --jinja

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8233 (c5a778891) with GNU 11.4.0 for Linux x86_64
system info: n_threads = 6, n_threads_batch = 6, total_threads = 20

system_info: n_threads = 6 (n_threads_batch = 6) / 20 | CUDA : ARCHS = 750,800,860,890,1200,1210 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

init: using 19 threads for HTTP server
start: binding port with default address family
main: loading model
...
main: server is listening on http://0.0.0.0:8001
main: starting the main loop...

# Then triggered the API request:
$ curl -s http://127.0.0.1:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-Coder-Next",
    "messages": [{"role": "user", "content": "What is 1+1? Use the add tool."}],
    "tools": [{"type": "function", "function": {"name": "add", "description": "Add two numbers", "parameters": {"type": "object", "properties": {"a": {"type": "string"}, "b": {"type": "string"}}, "required": ["a", "b"]}}}],
    "tool_choice": "auto"
  }' | python3 -m json.tool

# Server output:
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, task.n_tokens = 288
slot init_sampler: id  3 | task 0 | init sampler, took 0.04 ms, tokens: text = 288, total = 288
slot update_slots: id  3 | task 0 | prompt processing done, n_tokens = 288, batch.n_tokens = 288
slot      release: id  3 | task 0 | stop processing: n_tokens = 311, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

# Received JSON response violating the string arguments spec:
{
    "choices": [
        {
            "finish_reason": "tool_calls",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "type": "function",
                        "function": {
                            "name": "add",
                            "arguments": {
                                "a": "1",
                                "b": "1"
                            }
                        },
                        "id": "VR8m59fStegbYHZWeoJlj4nI0j9hhTXt"
                    }
                ]
            }
        }
    ],
    ...
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions