diff --git a/docs/source/en/serve-cli/serving.md b/docs/source/en/serve-cli/serving.md
index a6d3dbcc6238..ca45a1ede7a5 100644
--- a/docs/source/en/serve-cli/serving.md
+++ b/docs/source/en/serve-cli/serving.md
@@ -455,7 +455,7 @@ data: {"id":"f47ac10b-58cc-4372-a567-0e02b2c3d479","choices":[{"delta":{"content
### Audio-based completions
-Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input using the OpenAI `input_audio` content type. The audio must be base64-encoded and the format (`mp3` or `wav`) must be specified.
+Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input through the OpenAI `input_audio` content type. Base64-encode the audio and specify the format (`mp3` or `wav`).
@@ -694,7 +694,7 @@ data: {"id":"cb997e1d-98b9-414a-be89-1880288610ef","choices":[{"delta":{"content
> [!WARNING]
> The `audio_url` content type is an extension not part of the OpenAI standard and may change in future versions.
-As a convenience, audio can also be passed by URL using the `audio_url` content type, avoiding the need for base64 encoding.
+You can also pass audio by URL with the `audio_url` content type to skip base64 encoding.
```python
completion = client.chat.completions.create(
@@ -716,7 +716,7 @@ completion = client.chat.completions.create(
> [!WARNING]
> The `video_url` content type is an extension not part of the OpenAI standard and may change in future versions.
-Video input is supported using the `video_url` content type. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the audio track is automatically extracted from the video and processed alongside the visual frames.
+Use the `video_url` content type for video input. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the server extracts the audio track from the video and processes it with the visual frames.
> [!TIP]
> Video processing requires [torchcodec](https://github.com/pytorch/torchcodec). Install it with `pip install torchcodec`.
@@ -933,7 +933,7 @@ data: {"id":"cb997e1d-98b9-414a-be89-1880288610ef","choices":[{"delta":{"content
-### Multi-turn conversations
+### Multi-turn conversations[[completions]]
To have a multi-turn conversation, include the full conversation history in the `messages` list with alternating `user` and `assistant` roles. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.
@@ -953,7 +953,7 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content)
```
-The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly.
+The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris.
```
As of 2021, the population of Paris is approximately 2.2 million people.
@@ -1385,7 +1385,7 @@ data: {"content_index":0,"delta":"This ","item_id":"msg_a1b2c3d4","output_index"
> [!WARNING]
> The `audio_url` content type is an extension not part of the OpenAI standard and may change in future versions.
-As a convenience, audio can also be passed by URL using the `audio_url` content type, avoiding the need for base64 encoding.
+You can also pass audio by URL with the `audio_url` content type to skip base64 encoding.
```python
response = client.responses.create(
@@ -1540,7 +1540,7 @@ data: {"content_index":0,"delta":"Based ","item_id":"msg_b2c3d4e5","output_index
-### Multi-turn conversations
+### Multi-turn conversations[[responses]]
For multi-turn conversations, pass a list of messages with `role` keys in the `input` field. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.
@@ -1562,7 +1562,7 @@ response = client.responses.create(
print(response.output[0].content[0].text)
```
-The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly.
+The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris.
```
As of 2021, Paris has a population of approximately 2.8 million people.
@@ -1653,7 +1653,7 @@ The stream ends with exactly one terminal event, `ready` (success) or `error` (f
## Timeout
-`transformers serve` supports different requests by different models. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free up GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading entirely.
+`transformers serve` handles requests for any model. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading.
```shell
transformers serve --model-timeout 400
@@ -1661,7 +1661,7 @@ transformers serve --model-timeout 400
### Loading examples
-See the example responses below for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model that already exists in memory.
+The examples below show responses for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model already in memory.
@@ -1703,7 +1703,7 @@ data: {"status": "ready", "model": "org/model@main", "cached": true}
The `transformers serve` server supports OpenAI-style function calling. Models trained for tool-use generate structured function calls that your application executes.
> [!NOTE]
-> Tool calling is currently limited to the Qwen model family.
+> Tool calling works with any model whose tokenizer declares tool call tokens. Qwen and Gemma 4 work out of the box. Open an [issue](https://github.com/huggingface/transformers/issues/new/choose) to request support for a specific model.
Define tools as a list of function specifications following the OpenAI format.
@@ -1765,6 +1765,79 @@ for event in response:
print(event)
```
+### Multi-turn tool calling
+
+After the model returns a tool call, execute the function locally, then send the result back in a follow-up request to get the model's final answer. The pattern differs slightly between the two APIs. See the [OpenAI function calling guide](https://developers.openai.com/api/docs/guides/function-calling?api-mode=chat) for the full spec.
+
+The examples below reuse the `tools` list defined above.
+
+
+
+
+Pass the tool result as a `role: "tool"` message with the matching `tool_call_id`.
+
+```py
+# Model returns a tool call
+messages = [{"role": "user", "content": "What's the weather like in San Francisco?"}]
+response = client.chat.completions.create(
+ model="Qwen/Qwen2.5-7B-Instruct",
+ messages=messages,
+ tools=tools,
+)
+assistant_message = response.choices[0].message
+
+# Execute the tool locally
+tool_call = assistant_message.tool_calls[0]
+result = {"temperature": 22, "condition": "sunny"} # your actual function call here
+
+# Send the tool result back
+messages.append(assistant_message)
+messages.append({
+ "role": "tool",
+ "tool_call_id": tool_call.id,
+ "content": json.dumps(result),
+})
+final_response = client.chat.completions.create(
+ model="Qwen/Qwen2.5-7B-Instruct",
+ messages=messages,
+ tools=tools,
+)
+print(final_response.choices[0].message.content)
+```
+
+
+
+
+Pass the tool result as a `function_call_output` item in the `input` list of the follow-up request.
+
+```py
+user_message = {"role": "user", "content": "What's the weather like in San Francisco?"}
+response = client.responses.create(
+ model="Qwen/Qwen2.5-7B-Instruct",
+ input=[user_message],
+ tools=tools,
+ stream=False,
+)
+tool_call = next(item for item in response.output if item.type == "function_call")
+
+result = {"temperature": 22, "condition": "sunny"}
+
+final_response = client.responses.create(
+ model="Qwen/Qwen2.5-7B-Instruct",
+ input=[
+ user_message,
+ tool_call.model_dump(exclude_none=True),
+ {"type": "function_call_output", "call_id": tool_call.call_id, "output": json.dumps(result)},
+ ],
+ tools=tools,
+ stream=False,
+)
+print(final_response.output_text)
+```
+
+
+
+
## Port forwarding
Port forwarding lets you serve models from a remote server. Make sure you have SSH access to the server, then run this command on your local machine.