From cb46713e6108d32827e782c213ada11beebf430f Mon Sep 17 00:00:00 2001
From: stevhliu <steven.liu@huggingface.co>
Date: Tue, 21 Apr 2026 10:25:03 -0700
Subject: [PATCH 1/2] docs

---
 docs/source/en/serve-cli/serving.md | 103 +++++++++++++++++++++++++---
 1 file changed, 92 insertions(+), 11 deletions(-)
diff --git a/docs/source/en/serve-cli/serving.md b/docs/source/en/serve-cli/serving.md
index a6d3dbcc6238..404ae13f37c0 100644
--- a/docs/source/en/serve-cli/serving.md
+++ b/docs/source/en/serve-cli/serving.md
@@ -455,7 +455,7 @@ data: {"id":"f47ac10b-58cc-4372-a567-0e02b2c3d479","choices":[{"delta":{"content
 
 ### Audio-based completions
 
-Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input using the OpenAI `input_audio` content type. The audio must be base64-encoded and the format (`mp3` or `wav`) must be specified.
+Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input through the OpenAI `input_audio` content type. Base64-encode the audio and specify the format (`mp3` or `wav`).
 
 <hfoptions id="audio-completions">
 <hfoption id="huggingface_hub">
@@ -694,7 +694,7 @@ data: {"id":"cb997e1d-98b9-414a-be89-1880288610ef","choices":[{"delta":{"content
 > [!WARNING]
 > The `audio_url` content type is an extension not part of the OpenAI standard and may change in future versions.
 
-As a convenience, audio can also be passed by URL using the `audio_url` content type, avoiding the need for base64 encoding.
+You can also pass audio by URL with the `audio_url` content type to skip base64 encoding.
 
 ```python
 completion = client.chat.completions.create(
@@ -716,7 +716,7 @@ completion = client.chat.completions.create(
 > [!WARNING]
 > The `video_url` content type is an extension not part of the OpenAI standard and may change in future versions.
 
-Video input is supported using the `video_url` content type. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the audio track is automatically extracted from the video and processed alongside the visual frames.
+Use the `video_url` content type for video input. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the server extracts the audio track from the video and processes it with the visual frames.
 
 > [!TIP]
 > Video processing requires [torchcodec](https://github.com/pytorch/torchcodec). Install it with `pip install torchcodec`.
@@ -933,7 +933,7 @@ data: {"id":"cb997e1d-98b9-414a-be89-1880288610ef","choices":[{"delta":{"content
 </hfoption>
 </hfoptions>
 
-### Multi-turn conversations
+### Multi-turn conversations[[completions]]
 
 To have a multi-turn conversation, include the full conversation history in the `messages` list with alternating `user` and `assistant` roles. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.
 
@@ -953,7 +953,7 @@ completion = client.chat.completions.create(
 print(completion.choices[0].message.content)
 ```
 
-The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly.
+The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris.
 
 ```
 As of 2021, the population of Paris is approximately 2.2 million people.
@@ -1385,7 +1385,7 @@ data: {"content_index":0,"delta":"This ","item_id":"msg_a1b2c3d4","output_index"
 > [!WARNING]
 > The `audio_url` content type is an extension not part of the OpenAI standard and may change in future versions.
 
-As a convenience, audio can also be passed by URL using the `audio_url` content type, avoiding the need for base64 encoding.
+You can also pass audio by URL with the `audio_url` content type to skip base64 encoding.
 
 ```python
 response = client.responses.create(
@@ -1540,7 +1540,7 @@ data: {"content_index":0,"delta":"Based ","item_id":"msg_b2c3d4e5","output_index
 </hfoption>
 </hfoptions>
 
-### Multi-turn conversations
+### Multi-turn conversations[[responses]]
 
 For multi-turn conversations, pass a list of messages with `role` keys in the `input` field. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.
 
@@ -1562,7 +1562,7 @@ response = client.responses.create(
 print(response.output[0].content[0].text)
 ```
 
-The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly.
+The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris.
 
 ```
 As of 2021, Paris has a population of approximately 2.8 million people.
@@ -1653,7 +1653,7 @@ The stream ends with exactly one terminal event, `ready` (success) or `error` (f
 
 ## Timeout
 
-`transformers serve` supports different requests by different models. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free up GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading entirely.
+`transformers serve` handles requests for any model. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading.
 
 ```shell
 transformers serve --model-timeout 400
@@ -1661,7 +1661,7 @@ transformers serve --model-timeout 400
 
 ### Loading examples
 
-See the example responses below for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model that already exists in memory.
+The examples below show responses for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model already in memory.
 
 <hfoptions id="load-model-examples">
 <hfoption id="fresh load">
@@ -1703,7 +1703,7 @@ data: {"status": "ready", "model": "org/model@main", "cached": true}
 The `transformers serve` server supports OpenAI-style function calling. Models trained for tool-use generate structured function calls that your application executes.
 
 > [!NOTE]
-> Tool calling is currently limited to the Qwen model family.
+> Tool calling works with any model whose tokenizer declares tool call tokens. Qwen and Gemma 4 work out of the box.
 
 Define tools as a list of function specifications following the OpenAI format.
 
@@ -1765,6 +1765,87 @@ for event in response:
   print(event)
 ```
 
+### Multi-turn tool calling
+
+After the model returns a tool call, execute the function locally, then send the result back in a follow-up request to get the model's final answer. The pattern differs slightly between the two APIs.
+
+<hfoptions id="multi-turn-tool-calling">
+<hfoption id="v1/chat/completions">
+
+Pass the tool result as a `role: "tool"` message with the matching `tool_call_id`.
+
+```py
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="<KEY>")
+
+# Model returns a tool call
+messages = [{"role": "user", "content": "What's the weather like in San Francisco?"}]
+response = client.chat.completions.create(
+    model="Qwen/Qwen2.5-7B-Instruct",
+    messages=messages,
+    tools=tools,
+)
+assistant_message = response.choices[0].message
+
+# Execute the tool locally
+tool_call = assistant_message.tool_calls[0]
+result = {"temperature": 22, "condition": "sunny"}  # your actual function call here
+
+# Send the tool result back
+messages.append(assistant_message)
+messages.append({
+    "role": "tool",
+    "tool_call_id": tool_call.id,
+    "content": json.dumps(result),
+})
+final_response = client.chat.completions.create(
+    model="Qwen/Qwen2.5-7B-Instruct",
+    messages=messages,
+    tools=tools,
+)
+print(final_response.choices[0].message.content)
+```
+
+</hfoption>
+<hfoption id="v1/responses">
+
+Pass the tool result as a `function_call_output` item in the `input` list of the follow-up request.
+
+```py
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="<KEY>")
+
+user_message = {"role": "user", "content": "What's the weather like in San Francisco?"}
+response = client.responses.create(
+    model="Qwen/Qwen2.5-7B-Instruct",
+    instructions="You are a helpful weather assistant. Use the get_weather tool to answer questions.",
+    input=[user_message],
+    tools=tools,
+    stream=False,
+)
+tool_call = next(item for item in response.output if item.type == "function_call")
+
+result = {"temperature": 22, "condition": "sunny"}
+
+final_response = client.responses.create(
+    model="Qwen/Qwen2.5-7B-Instruct",
+    instructions="You are a helpful weather assistant. Use the get_weather tool to answer questions.",
+    input=[
+        user_message,
+        tool_call.model_dump(exclude_none=True),
+        {"type": "function_call_output", "call_id": tool_call.call_id, "output": json.dumps(result)},
+    ],
+    tools=tools,
+    stream=False,
+)
+print(final_response.output_text)
+```
+
+</hfoption>
+</hfoptions>
+
 ## Port forwarding
 
 Port forwarding lets you serve models from a remote server. Make sure you have SSH access to the server, then run this command on your local machine.

From f41487061da1b0a63bda86c0d12cb86788d6d02d Mon Sep 17 00:00:00 2001
From: stevhliu <steven.liu@huggingface.co>
Date: Wed, 22 Apr 2026 09:29:35 -0700
Subject: [PATCH 2/2] feedback

---
 docs/source/en/serve-cli/serving.md | 16 ++++------------
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/docs/source/en/serve-cli/serving.md b/docs/source/en/serve-cli/serving.md
index 404ae13f37c0..ca45a1ede7a5 100644
--- a/docs/source/en/serve-cli/serving.md
+++ b/docs/source/en/serve-cli/serving.md
@@ -1703,7 +1703,7 @@ data: {"status": "ready", "model": "org/model@main", "cached": true}
 The `transformers serve` server supports OpenAI-style function calling. Models trained for tool-use generate structured function calls that your application executes.
 
 > [!NOTE]
-> Tool calling works with any model whose tokenizer declares tool call tokens. Qwen and Gemma 4 work out of the box.
+> Tool calling works with any model whose tokenizer declares tool call tokens. Qwen and Gemma 4 work out of the box. Open an [issue](https://github.com/huggingface/transformers/issues/new/choose) to request support for a specific model.
 
 Define tools as a list of function specifications following the OpenAI format.
 
@@ -1767,7 +1767,9 @@ for event in response:
 
 ### Multi-turn tool calling
 
-After the model returns a tool call, execute the function locally, then send the result back in a follow-up request to get the model's final answer. The pattern differs slightly between the two APIs.
+After the model returns a tool call, execute the function locally, then send the result back in a follow-up request to get the model's final answer. The pattern differs slightly between the two APIs. See the [OpenAI function calling guide](https://developers.openai.com/api/docs/guides/function-calling?api-mode=chat) for the full spec.
+
+The examples below reuse the `tools` list defined above.
 
 <hfoptions id="multi-turn-tool-calling">
 <hfoption id="v1/chat/completions">
@@ -1775,10 +1777,6 @@ After the model returns a tool call, execute the function locally, then send the
 Pass the tool result as a `role: "tool"` message with the matching `tool_call_id`.
 
 ```py
-from openai import OpenAI
-
-client = OpenAI(base_url="http://localhost:8000/v1", api_key="<KEY>")
-
 # Model returns a tool call
 messages = [{"role": "user", "content": "What's the weather like in San Francisco?"}]
 response = client.chat.completions.create(
@@ -1813,14 +1811,9 @@ print(final_response.choices[0].message.content)
 Pass the tool result as a `function_call_output` item in the `input` list of the follow-up request.
 
 ```py
-from openai import OpenAI
-
-client = OpenAI(base_url="http://localhost:8000/v1", api_key="<KEY>")
-
 user_message = {"role": "user", "content": "What's the weather like in San Francisco?"}
 response = client.responses.create(
     model="Qwen/Qwen2.5-7B-Instruct",
-    instructions="You are a helpful weather assistant. Use the get_weather tool to answer questions.",
     input=[user_message],
     tools=tools,
     stream=False,
@@ -1831,7 +1824,6 @@ result = {"temperature": 22, "condition": "sunny"}
 
 final_response = client.responses.create(
     model="Qwen/Qwen2.5-7B-Instruct",
-    instructions="You are a helpful weather assistant. Use the get_weather tool to answer questions.",
     input=[
         user_message,
         tool_call.model_dump(exclude_none=True),