From cb46713e6108d32827e782c213ada11beebf430f Mon Sep 17 00:00:00 2001 From: stevhliu Date: Tue, 21 Apr 2026 10:25:03 -0700 Subject: [PATCH 1/2] docs --- docs/source/en/serve-cli/serving.md | 103 +++++++++++++++++++++++++--- 1 file changed, 92 insertions(+), 11 deletions(-) diff --git a/docs/source/en/serve-cli/serving.md b/docs/source/en/serve-cli/serving.md index a6d3dbcc6238..404ae13f37c0 100644 --- a/docs/source/en/serve-cli/serving.md +++ b/docs/source/en/serve-cli/serving.md @@ -455,7 +455,7 @@ data: {"id":"f47ac10b-58cc-4372-a567-0e02b2c3d479","choices":[{"delta":{"content ### Audio-based completions -Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input using the OpenAI `input_audio` content type. The audio must be base64-encoded and the format (`mp3` or `wav`) must be specified. +Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input through the OpenAI `input_audio` content type. Base64-encode the audio and specify the format (`mp3` or `wav`). @@ -694,7 +694,7 @@ data: {"id":"cb997e1d-98b9-414a-be89-1880288610ef","choices":[{"delta":{"content > [!WARNING] > The `audio_url` content type is an extension not part of the OpenAI standard and may change in future versions. -As a convenience, audio can also be passed by URL using the `audio_url` content type, avoiding the need for base64 encoding. +You can also pass audio by URL with the `audio_url` content type to skip base64 encoding. ```python completion = client.chat.completions.create( @@ -716,7 +716,7 @@ completion = client.chat.completions.create( > [!WARNING] > The `video_url` content type is an extension not part of the OpenAI standard and may change in future versions. -Video input is supported using the `video_url` content type. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the audio track is automatically extracted from the video and processed alongside the visual frames. +Use the `video_url` content type for video input. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the server extracts the audio track from the video and processes it with the visual frames. > [!TIP] > Video processing requires [torchcodec](https://github.com/pytorch/torchcodec). Install it with `pip install torchcodec`. @@ -933,7 +933,7 @@ data: {"id":"cb997e1d-98b9-414a-be89-1880288610ef","choices":[{"delta":{"content -### Multi-turn conversations +### Multi-turn conversations[[completions]] To have a multi-turn conversation, include the full conversation history in the `messages` list with alternating `user` and `assistant` roles. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history. @@ -953,7 +953,7 @@ completion = client.chat.completions.create( print(completion.choices[0].message.content) ``` -The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly. +The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris. ``` As of 2021, the population of Paris is approximately 2.2 million people. @@ -1385,7 +1385,7 @@ data: {"content_index":0,"delta":"This ","item_id":"msg_a1b2c3d4","output_index" > [!WARNING] > The `audio_url` content type is an extension not part of the OpenAI standard and may change in future versions. -As a convenience, audio can also be passed by URL using the `audio_url` content type, avoiding the need for base64 encoding. +You can also pass audio by URL with the `audio_url` content type to skip base64 encoding. ```python response = client.responses.create( @@ -1540,7 +1540,7 @@ data: {"content_index":0,"delta":"Based ","item_id":"msg_b2c3d4e5","output_index -### Multi-turn conversations +### Multi-turn conversations[[responses]] For multi-turn conversations, pass a list of messages with `role` keys in the `input` field. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history. @@ -1562,7 +1562,7 @@ response = client.responses.create( print(response.output[0].content[0].text) ``` -The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly. +The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris. ``` As of 2021, Paris has a population of approximately 2.8 million people. @@ -1653,7 +1653,7 @@ The stream ends with exactly one terminal event, `ready` (success) or `error` (f ## Timeout -`transformers serve` supports different requests by different models. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free up GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading entirely. +`transformers serve` handles requests for any model. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading. ```shell transformers serve --model-timeout 400 @@ -1661,7 +1661,7 @@ transformers serve --model-timeout 400 ### Loading examples -See the example responses below for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model that already exists in memory. +The examples below show responses for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model already in memory. @@ -1703,7 +1703,7 @@ data: {"status": "ready", "model": "org/model@main", "cached": true} The `transformers serve` server supports OpenAI-style function calling. Models trained for tool-use generate structured function calls that your application executes. > [!NOTE] -> Tool calling is currently limited to the Qwen model family. +> Tool calling works with any model whose tokenizer declares tool call tokens. Qwen and Gemma 4 work out of the box. Define tools as a list of function specifications following the OpenAI format. @@ -1765,6 +1765,87 @@ for event in response: print(event) ``` +### Multi-turn tool calling + +After the model returns a tool call, execute the function locally, then send the result back in a follow-up request to get the model's final answer. The pattern differs slightly between the two APIs. + + + + +Pass the tool result as a `role: "tool"` message with the matching `tool_call_id`. + +```py +from openai import OpenAI + +client = OpenAI(base_url="http://localhost:8000/v1", api_key="") + +# Model returns a tool call +messages = [{"role": "user", "content": "What's the weather like in San Francisco?"}] +response = client.chat.completions.create( + model="Qwen/Qwen2.5-7B-Instruct", + messages=messages, + tools=tools, +) +assistant_message = response.choices[0].message + +# Execute the tool locally +tool_call = assistant_message.tool_calls[0] +result = {"temperature": 22, "condition": "sunny"} # your actual function call here + +# Send the tool result back +messages.append(assistant_message) +messages.append({ + "role": "tool", + "tool_call_id": tool_call.id, + "content": json.dumps(result), +}) +final_response = client.chat.completions.create( + model="Qwen/Qwen2.5-7B-Instruct", + messages=messages, + tools=tools, +) +print(final_response.choices[0].message.content) +``` + + + + +Pass the tool result as a `function_call_output` item in the `input` list of the follow-up request. + +```py +from openai import OpenAI + +client = OpenAI(base_url="http://localhost:8000/v1", api_key="") + +user_message = {"role": "user", "content": "What's the weather like in San Francisco?"} +response = client.responses.create( + model="Qwen/Qwen2.5-7B-Instruct", + instructions="You are a helpful weather assistant. Use the get_weather tool to answer questions.", + input=[user_message], + tools=tools, + stream=False, +) +tool_call = next(item for item in response.output if item.type == "function_call") + +result = {"temperature": 22, "condition": "sunny"} + +final_response = client.responses.create( + model="Qwen/Qwen2.5-7B-Instruct", + instructions="You are a helpful weather assistant. Use the get_weather tool to answer questions.", + input=[ + user_message, + tool_call.model_dump(exclude_none=True), + {"type": "function_call_output", "call_id": tool_call.call_id, "output": json.dumps(result)}, + ], + tools=tools, + stream=False, +) +print(final_response.output_text) +``` + + + + ## Port forwarding Port forwarding lets you serve models from a remote server. Make sure you have SSH access to the server, then run this command on your local machine. From f41487061da1b0a63bda86c0d12cb86788d6d02d Mon Sep 17 00:00:00 2001 From: stevhliu Date: Wed, 22 Apr 2026 09:29:35 -0700 Subject: [PATCH 2/2] feedback --- docs/source/en/serve-cli/serving.md | 16 ++++------------ 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/docs/source/en/serve-cli/serving.md b/docs/source/en/serve-cli/serving.md index 404ae13f37c0..ca45a1ede7a5 100644 --- a/docs/source/en/serve-cli/serving.md +++ b/docs/source/en/serve-cli/serving.md @@ -1703,7 +1703,7 @@ data: {"status": "ready", "model": "org/model@main", "cached": true} The `transformers serve` server supports OpenAI-style function calling. Models trained for tool-use generate structured function calls that your application executes. > [!NOTE] -> Tool calling works with any model whose tokenizer declares tool call tokens. Qwen and Gemma 4 work out of the box. +> Tool calling works with any model whose tokenizer declares tool call tokens. Qwen and Gemma 4 work out of the box. Open an [issue](https://github.com/huggingface/transformers/issues/new/choose) to request support for a specific model. Define tools as a list of function specifications following the OpenAI format. @@ -1767,7 +1767,9 @@ for event in response: ### Multi-turn tool calling -After the model returns a tool call, execute the function locally, then send the result back in a follow-up request to get the model's final answer. The pattern differs slightly between the two APIs. +After the model returns a tool call, execute the function locally, then send the result back in a follow-up request to get the model's final answer. The pattern differs slightly between the two APIs. See the [OpenAI function calling guide](https://developers.openai.com/api/docs/guides/function-calling?api-mode=chat) for the full spec. + +The examples below reuse the `tools` list defined above. @@ -1775,10 +1777,6 @@ After the model returns a tool call, execute the function locally, then send the Pass the tool result as a `role: "tool"` message with the matching `tool_call_id`. ```py -from openai import OpenAI - -client = OpenAI(base_url="http://localhost:8000/v1", api_key="") - # Model returns a tool call messages = [{"role": "user", "content": "What's the weather like in San Francisco?"}] response = client.chat.completions.create( @@ -1813,14 +1811,9 @@ print(final_response.choices[0].message.content) Pass the tool result as a `function_call_output` item in the `input` list of the follow-up request. ```py -from openai import OpenAI - -client = OpenAI(base_url="http://localhost:8000/v1", api_key="") - user_message = {"role": "user", "content": "What's the weather like in San Francisco?"} response = client.responses.create( model="Qwen/Qwen2.5-7B-Instruct", - instructions="You are a helpful weather assistant. Use the get_weather tool to answer questions.", input=[user_message], tools=tools, stream=False, @@ -1831,7 +1824,6 @@ result = {"temperature": 22, "condition": "sunny"} final_response = client.responses.create( model="Qwen/Qwen2.5-7B-Instruct", - instructions="You are a helpful weather assistant. Use the get_weather tool to answer questions.", input=[ user_message, tool_call.model_dump(exclude_none=True),