Multimodal serve support by SunMarc · Pull Request #45220 · huggingface/transformers

SunMarc · 2026-04-03T14:16:33Z

What does this PR do?

This PR adds transformers serve compatibility to multimodal models like qwen omni or gemma 4. We add support for audio with chat completion and response though input_audio -> the client need to base64-encode the audio and send it as input_audio.

For video, OpenAI API doesn't natively support video_url as a content type. So we extended it so that we can still play with it. For simplicity also, we also allow to pass url for audio through audio_url.

Results (tested with `google/gemma-4-E2B-it` and `Qwen/Qwen2.5-Omni-3B`)

import base64
import socket
import time

import httpx
from openai import OpenAI

from transformers.cli.serve import Serve

AUDIO_URL = "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"
VIDEO_URL = "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/concert.mp4"
MODEL = "google/gemma-4-E2B-it"

# Qwen Omni
# MODEL = "Qwen/Qwen2.5-Omni-3B"


def find_free_port():
    with socket.socket() as s:
        s.bind(("", 0))
        return s.getsockname()[1]


def start_serve():
    port = find_free_port()
    serve = Serve(port=port, non_blocking=True)
    for _ in range(30):
        try:
            if httpx.get(f"http://localhost:{port}/health", timeout=2).status_code == 200:
                return serve, port
        except Exception:
            pass
        time.sleep(1)
    raise RuntimeError("Server did not start in time")


serve, port = start_serve()
client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="unused")

audio_bytes = httpx.get(AUDIO_URL, follow_redirects=True).content
audio_b64 = base64.b64encode(audio_bytes).decode()

print("=== Audio via responses API ===")
resp = client.responses.create(
    model=MODEL,
    input=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe this audio."},
                {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "mp3"}},
            ],
        }
    ],
    stream=False,
    max_output_tokens=200,
)
print(resp.output[0].content[0].text)
print()

# --- Video with audio (responses API) ---
print("=== Video via responses API ===")
resp = client.responses.create(
    model=MODEL,
    input=[
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": VIDEO_URL}},
                {"type": "text", "text": "Transcribe the lyrics of the song being played in this video."},
            ],
        }
    ],
    stream=False,
    max_output_tokens=500,
)
print(resp.output[0].content[0].text)
print()

serve.kill_server()
print("Done!")

Audio via responses API
This week, I traveled to Chicago to deliver my final farewell address to the nation, following in the tradition of
presidents before me. It was an opportunity to say thank you. Whether we've seen eye-to-eye or rarely agreed at
all, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors,
at diners, and on distant military outposts, all these conversations are what have kept me honest.

Video via responses API
(Song lyrics)

I don't care how straight
From neck to chest
We're in the same predicament
Another one wantin' is in the storm alone
I'm the one down below this
You don't wanna be my
I never thought you'd say
Of this nice sad place you've been
I don't want it my face
But I don't wanna die

HuggingFaceDocBuilderDev · 2026-04-03T14:25:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc · 2026-04-03T15:02:07Z

            return_tensors=None if use_cb else "pt",
            return_dict=True,
            tokenize=True,
+            load_audio_from_video=modality == Modality.MULTIMODAL and has_video,


managed to use it but torchcodec is required for that. Otherwise, it fails back to the other lib and it fails. Also torchcodec + ffmpeg was a bit of a pain to install correctly.
Also, we should maybe should force the user to install torchcodec when using this no cc @eustlb ?

Indeed load_audio only works with torchcodec as video containers are not supported by librosa. Agree we need to raise a clear error, let me raise a PR for that

note that this will prob error out when video is passes and it is silent, iirc old torchcodec would complain that no audio is found

Also the whole load_audio_from_video is mostly heuristic for models except for qwen-omni, no guarantee on performance

I feel like this is better if we can delegate this is the audio processing, like have a try catch when trying to load the model.

eustlb

@SunMarc what would you think of having an ALM modality, and to differentiate:
ALM audio + text
VLM vision + text
~~MULTIMODEL~~ EDIT MULTIMODAL (my bad typo) audio + vision + text

eustlb · 2026-04-03T15:50:44Z

+        if load_audio_from_video and not is_torchcodec_available():
+            raise ValueError(
+                "Extracting audio from video requires `torchcodec`. Install it with: `pip install torchcodec`."
+            )


I guess this can be removed, better to locate the error in load_audio

eustlb · 2026-04-03T15:53:11Z

+        All modalities extract text. VLM additionally handles ``image_url`` and ``video_url``.
+        MULTIMODAL handles all of the above plus ``input_audio`` and ``audio_url``.
+        For LLMs, the content parts are collapsed into a plain text string.


My issue with this is that ALMs are seen as a sub-category of omni, while is the case of gemma4 but for others models too, we can use the ALM and VLM separately, and together. This makes even more sense knowing that audio + vision is emergent capability: the model as not been trained on both

yes, the separation line now looks weird with audio LLMs. Actually I was planning to keep a single MULTIMODAL key for all types and combinations. Tho first it needs to be aligned with Lucain and the hub

SunMarc · 2026-04-03T18:45:27Z

@SunMarc what would you think of having an ALM modality, and to differentiate:
ALM audio + text
VLM vision + text
MULTIMODEL audio + vision + text

Maybe we should have a MODEL_FOR_AUDIO_TEXT_MAPPING_NAMES mapping ? I created MULTIMODEL as we have this MODEL_FOR_MULTIMODAL_LM_MAPPING_NAMES . We have MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES but this is more for decoder-encoder style model, so we can't really use that no . cc @eustlb

SunMarc · 2026-04-14T13:41:00Z

run-slow: serve

SunMarc · 2026-04-14T13:53:10Z

run-slow: cli

LysandreJik

Ok, looks good! Implementation is nice.

Let's please add some docs which go over the new features, each with examples of how to use the feature/new arguments

LysandreJik · 2026-04-14T13:53:20Z

+        # Default to 32 frames for video (Gemma 4 default); some processors load all frames otherwise
+        chat_template_kwargs = {}
+        if has_video:
+            chat_template_kwargs["num_frames"] = 32


(nit) should we apply this to only gemma 4 then? but maybe easier to do down the road when adding support for other video models

gemma 4 have this default of 32 frames which is okay since it is coming from their official implementation. For now, I was thinking about hardcoding this to 32 because otherwise, with qwen omni, you get OOM very quickly even if the video is 10s for example. But we should definitely improve that and probably set the fps instead or another default. cc @zucchini-nlp

yep, if videos models are supported when serving, there needs to be a default sampling arg or users need to pass it explicitly. In inference users usually are encouraged to pass a value of their own, because most video processor classes don't have a default

let's fix this in a follow-up PR then. Let's keep this for now, so that it runs smoothly for all kind of models

github-actions · 2026-04-14T13:54:38Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["cli"]
quantizations: []

github-actions · 2026-04-14T14:01:32Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	d35cca92	workflow commit (merge commit)
PR	e85c9c1f	branch commit (from PR)
main	27fbb514	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

zucchini-nlp

Adding my 5 cents 😄

zucchini-nlp · 2026-04-14T15:44:32Z

            return_tensors=None if use_cb else "pt",
            return_dict=True,
            tokenize=True,
+            load_audio_from_video=modality == Modality.MULTIMODAL and has_video,


note that this will prob error out when video is passes and it is silent, iirc old torchcodec would complain that no audio is found

Also the whole load_audio_from_video is mostly heuristic for models except for qwen-omni, no guarantee on performance

zucchini-nlp · 2026-04-14T15:47:22Z

+        All modalities extract text. VLM additionally handles ``image_url`` and ``video_url``.
+        MULTIMODAL handles all of the above plus ``input_audio`` and ``audio_url``.
+        For LLMs, the content parts are collapsed into a plain text string.


yes, the separation line now looks weird with audio LLMs. Actually I was planning to keep a single MULTIMODAL key for all types and combinations. Tho first it needs to be aligned with Lucain and the hub

zucchini-nlp · 2026-04-14T15:48:53Z

+                    if "base64" in url:
+                        image_data = re.sub("^data:image/.+;base64,", "", url)
+                        image = Image.open(BytesIO(base64.b64decode(image_data)))
+                        file = tempfile.NamedTemporaryFile(suffix=".png", delete=False)
+                        image.save(file.name)
+                        url = file.name


we can decode images from base 64 in processing, or we don't call the processor's loading?

transformers/src/transformers/image_utils.py

Lines 493 to 496 in 5b565a5

# Try to load as base64

try:

b64 = base64.decodebytes(image.encode())

image = PIL.Image.open(BytesIO(b64))

i'll update that in a follow-up PR. It was there from the start but it looks like we can indeed simplify this a bit

SunMarc · 2026-04-15T13:44:44Z

@bot /style

github-actions · 2026-04-15T13:45:26Z

Style fix fix runs successfully without any file modified.

SunMarc added 2 commits April 3, 2026 13:22

add audio and video

63bde4f

multimodal in chat and response api

a37af34

SunMarc added 2 commits April 3, 2026 14:37

fix !

872c336

remove file

2091505

SunMarc requested a review from LysandreJik April 3, 2026 14:57

style

12a94a3

SunMarc commented Apr 3, 2026

View reviewed changes

let's have this here for now

431a307

SunMarc requested a review from eustlb April 3, 2026 15:13

eustlb reviewed Apr 3, 2026

View reviewed changes

SunMarc added 4 commits April 3, 2026 18:59

qwen omni works also

85221db

more coverage for test

3d489b6

style

a9e5236

fix merge conflits

e85c9c1

LysandreJik approved these changes Apr 14, 2026

View reviewed changes

add comment

a4d6921

zucchini-nlp reviewed Apr 14, 2026

View reviewed changes

SunMarc added 7 commits April 15, 2026 12:23

Merge remote-tracking branch 'origin/main' into audio-video-serve

6022b58

update chat completion docs + redirect

5b64af5

fix

687aef3

get

6549e7f

fix response

c427245

much better doc

9384866

update

f069355

typing

2939872

SunMarc added this pull request to the merge queue Apr 15, 2026

Merged via the queue into main with commit fd45a42 Apr 15, 2026
18 checks passed

SunMarc deleted the audio-video-serve branch April 15, 2026 14:13

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

	# Try to load as base64
	try:
	b64 = base64.decodebytes(image.encode())
	image = PIL.Image.open(BytesIO(b64))

Conversation

SunMarc commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Results (tested with google/gemma-4-E2B-it and Qwen/Qwen2.5-Omni-3B)

Uh oh!

HuggingFaceDocBuilderDev commented Apr 3, 2026

Uh oh!

SunMarc Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eustlb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunMarc commented Apr 14, 2026

Uh oh!

SunMarc commented Apr 14, 2026

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 14, 2026

CI Results

Commit Info

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

SunMarc commented Apr 3, 2026 •

edited

Loading

Results (tested with `google/gemma-4-E2B-it` and `Qwen/Qwen2.5-Omni-3B`)

SunMarc Apr 3, 2026 •

edited

Loading

eustlb left a comment •

edited

Loading

SunMarc commented Apr 3, 2026 •

edited

Loading

github-actions Bot commented Apr 15, 2026 •

edited

Loading