Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 84 additions & 11 deletions docs/source/en/serve-cli/serving.md
Original file line number Diff line number Diff line change
Expand Up @@ -455,7 +455,7 @@ data: {"id":"f47ac10b-58cc-4372-a567-0e02b2c3d479","choices":[{"delta":{"content

### Audio-based completions

Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input using the OpenAI `input_audio` content type. The audio must be base64-encoded and the format (`mp3` or `wav`) must be specified.
Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input through the OpenAI `input_audio` content type. Base64-encode the audio and specify the format (`mp3` or `wav`).

<hfoptions id="audio-completions">
<hfoption id="huggingface_hub">
Expand Down Expand Up @@ -694,7 +694,7 @@ data: {"id":"cb997e1d-98b9-414a-be89-1880288610ef","choices":[{"delta":{"content
> [!WARNING]
> The `audio_url` content type is an extension not part of the OpenAI standard and may change in future versions.

As a convenience, audio can also be passed by URL using the `audio_url` content type, avoiding the need for base64 encoding.
You can also pass audio by URL with the `audio_url` content type to skip base64 encoding.

```python
completion = client.chat.completions.create(
Expand All @@ -716,7 +716,7 @@ completion = client.chat.completions.create(
> [!WARNING]
> The `video_url` content type is an extension not part of the OpenAI standard and may change in future versions.

Video input is supported using the `video_url` content type. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the audio track is automatically extracted from the video and processed alongside the visual frames.
Use the `video_url` content type for video input. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the server extracts the audio track from the video and processes it with the visual frames.

> [!TIP]
> Video processing requires [torchcodec](https://github.com/pytorch/torchcodec). Install it with `pip install torchcodec`.
Expand Down Expand Up @@ -933,7 +933,7 @@ data: {"id":"cb997e1d-98b9-414a-be89-1880288610ef","choices":[{"delta":{"content
</hfoption>
</hfoptions>

### Multi-turn conversations
### Multi-turn conversations[[completions]]

To have a multi-turn conversation, include the full conversation history in the `messages` list with alternating `user` and `assistant` roles. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.

Expand All @@ -953,7 +953,7 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content)
```

The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly.
The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris.

```
As of 2021, the population of Paris is approximately 2.2 million people.
Expand Down Expand Up @@ -1385,7 +1385,7 @@ data: {"content_index":0,"delta":"This ","item_id":"msg_a1b2c3d4","output_index"
> [!WARNING]
> The `audio_url` content type is an extension not part of the OpenAI standard and may change in future versions.

As a convenience, audio can also be passed by URL using the `audio_url` content type, avoiding the need for base64 encoding.
You can also pass audio by URL with the `audio_url` content type to skip base64 encoding.

```python
response = client.responses.create(
Expand Down Expand Up @@ -1540,7 +1540,7 @@ data: {"content_index":0,"delta":"Based ","item_id":"msg_b2c3d4e5","output_index
</hfoption>
</hfoptions>

### Multi-turn conversations
### Multi-turn conversations[[responses]]

For multi-turn conversations, pass a list of messages with `role` keys in the `input` field. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.

Expand All @@ -1562,7 +1562,7 @@ response = client.responses.create(
print(response.output[0].content[0].text)
```

The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly.
The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris.

```
As of 2021, Paris has a population of approximately 2.8 million people.
Expand Down Expand Up @@ -1653,15 +1653,15 @@ The stream ends with exactly one terminal event, `ready` (success) or `error` (f

## Timeout

`transformers serve` supports different requests by different models. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free up GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading entirely.
`transformers serve` handles requests for any model. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading.

```shell
transformers serve --model-timeout 400
```

### Loading examples

See the example responses below for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model that already exists in memory.
The examples below show responses for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model already in memory.

<hfoptions id="load-model-examples">
<hfoption id="fresh load">
Expand Down Expand Up @@ -1703,7 +1703,7 @@ data: {"status": "ready", "model": "org/model@main", "cached": true}
The `transformers serve` server supports OpenAI-style function calling. Models trained for tool-use generate structured function calls that your application executes.

> [!NOTE]
> Tool calling is currently limited to the Qwen model family.
> Tool calling works with any model whose tokenizer declares tool call tokens. Qwen and Gemma 4 work out of the box. Open an [issue](https://github.com/huggingface/transformers/issues/new/choose) to request support for a specific model.

Define tools as a list of function specifications following the OpenAI format.

Expand Down Expand Up @@ -1765,6 +1765,79 @@ for event in response:
print(event)
```

### Multi-turn tool calling

After the model returns a tool call, execute the function locally, then send the result back in a follow-up request to get the model's final answer. The pattern differs slightly between the two APIs. See the [OpenAI function calling guide](https://developers.openai.com/api/docs/guides/function-calling?api-mode=chat) for the full spec.

The examples below reuse the `tools` list defined above.

<hfoptions id="multi-turn-tool-calling">
<hfoption id="v1/chat/completions">

Pass the tool result as a `role: "tool"` message with the matching `tool_call_id`.

```py
Comment thread
SunMarc marked this conversation as resolved.
# Model returns a tool call
messages = [{"role": "user", "content": "What's the weather like in San Francisco?"}]
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=messages,
tools=tools,
)
assistant_message = response.choices[0].message

# Execute the tool locally
tool_call = assistant_message.tool_calls[0]
result = {"temperature": 22, "condition": "sunny"} # your actual function call here

# Send the tool result back
messages.append(assistant_message)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result),
})
final_response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=messages,
tools=tools,
)
print(final_response.choices[0].message.content)
```

</hfoption>
<hfoption id="v1/responses">

Pass the tool result as a `function_call_output` item in the `input` list of the follow-up request.

```py
user_message = {"role": "user", "content": "What's the weather like in San Francisco?"}
response = client.responses.create(
model="Qwen/Qwen2.5-7B-Instruct",
input=[user_message],
tools=tools,
stream=False,
)
tool_call = next(item for item in response.output if item.type == "function_call")

result = {"temperature": 22, "condition": "sunny"}

final_response = client.responses.create(
model="Qwen/Qwen2.5-7B-Instruct",
input=[
user_message,
tool_call.model_dump(exclude_none=True),
{"type": "function_call_output", "call_id": tool_call.call_id, "output": json.dumps(result)},
],
tools=tools,
stream=False,
)
print(final_response.output_text)
```

</hfoption>
</hfoptions>

## Port forwarding

Port forwarding lets you serve models from a remote server. Make sure you have SSH access to the server, then run this command on your local machine.
Expand Down
Loading