Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| return_tensors=None if use_cb else "pt", | ||
| return_dict=True, | ||
| tokenize=True, | ||
| load_audio_from_video=modality == Modality.MULTIMODAL and has_video, |
There was a problem hiding this comment.
managed to use it but torchcodec is required for that. Otherwise, it fails back to the other lib and it fails. Also torchcodec + ffmpeg was a bit of a pain to install correctly.
Also, we should maybe should force the user to install torchcodec when using this no cc @eustlb ?
There was a problem hiding this comment.
Indeed load_audio only works with torchcodec as video containers are not supported by librosa. Agree we need to raise a clear error, let me raise a PR for that
There was a problem hiding this comment.
note that this will prob error out when video is passes and it is silent, iirc old torchcodec would complain that no audio is found
Also the whole load_audio_from_video is mostly heuristic for models except for qwen-omni, no guarantee on performance
There was a problem hiding this comment.
I feel like this is better if we can delegate this is the audio processing, like have a try catch when trying to load the model.
| if load_audio_from_video and not is_torchcodec_available(): | ||
| raise ValueError( | ||
| "Extracting audio from video requires `torchcodec`. Install it with: `pip install torchcodec`." | ||
| ) |
There was a problem hiding this comment.
I guess this can be removed, better to locate the error in load_audio
| All modalities extract text. VLM additionally handles ``image_url`` and ``video_url``. | ||
| MULTIMODAL handles all of the above plus ``input_audio`` and ``audio_url``. | ||
| For LLMs, the content parts are collapsed into a plain text string. |
There was a problem hiding this comment.
My issue with this is that ALMs are seen as a sub-category of omni, while is the case of gemma4 but for others models too, we can use the ALM and VLM separately, and together. This makes even more sense knowing that audio + vision is emergent capability: the model as not been trained on both
There was a problem hiding this comment.
yes, the separation line now looks weird with audio LLMs. Actually I was planning to keep a single MULTIMODAL key for all types and combinations. Tho first it needs to be aligned with Lucain and the hub
Maybe we should have a |
|
run-slow: serve |
|
run-slow: cli |
LysandreJik
left a comment
There was a problem hiding this comment.
Ok, looks good! Implementation is nice.
Let's please add some docs which go over the new features, each with examples of how to use the feature/new arguments
| # Default to 32 frames for video (Gemma 4 default); some processors load all frames otherwise | ||
| chat_template_kwargs = {} | ||
| if has_video: | ||
| chat_template_kwargs["num_frames"] = 32 |
There was a problem hiding this comment.
(nit) should we apply this to only gemma 4 then? but maybe easier to do down the road when adding support for other video models
There was a problem hiding this comment.
gemma 4 have this default of 32 frames which is okay since it is coming from their official implementation. For now, I was thinking about hardcoding this to 32 because otherwise, with qwen omni, you get OOM very quickly even if the video is 10s for example. But we should definitely improve that and probably set the fps instead or another default. cc @zucchini-nlp
There was a problem hiding this comment.
yep, if videos models are supported when serving, there needs to be a default sampling arg or users need to pass it explicitly. In inference users usually are encouraged to pass a value of their own, because most video processor classes don't have a default
There was a problem hiding this comment.
let's fix this in a follow-up PR then. Let's keep this for now, so that it runs smoothly for all kind of models
|
This comment contains models: ["cli"] |
| return_tensors=None if use_cb else "pt", | ||
| return_dict=True, | ||
| tokenize=True, | ||
| load_audio_from_video=modality == Modality.MULTIMODAL and has_video, |
There was a problem hiding this comment.
note that this will prob error out when video is passes and it is silent, iirc old torchcodec would complain that no audio is found
Also the whole load_audio_from_video is mostly heuristic for models except for qwen-omni, no guarantee on performance
| All modalities extract text. VLM additionally handles ``image_url`` and ``video_url``. | ||
| MULTIMODAL handles all of the above plus ``input_audio`` and ``audio_url``. | ||
| For LLMs, the content parts are collapsed into a plain text string. |
There was a problem hiding this comment.
yes, the separation line now looks weird with audio LLMs. Actually I was planning to keep a single MULTIMODAL key for all types and combinations. Tho first it needs to be aligned with Lucain and the hub
| if "base64" in url: | ||
| image_data = re.sub("^data:image/.+;base64,", "", url) | ||
| image = Image.open(BytesIO(base64.b64decode(image_data))) | ||
| file = tempfile.NamedTemporaryFile(suffix=".png", delete=False) | ||
| image.save(file.name) | ||
| url = file.name |
There was a problem hiding this comment.
we can decode images from base 64 in processing, or we don't call the processor's loading?
transformers/src/transformers/image_utils.py
Lines 493 to 496 in 5b565a5
There was a problem hiding this comment.
i'll update that in a follow-up PR. It was there from the start but it looks like we can indeed simplify this a bit
|
@bot /style |
|
Style fix fix runs successfully without any file modified. |
What does this PR do?
This PR adds transformers serve compatibility to multimodal models like qwen omni or gemma 4. We add support for audio with chat completion and response though
input_audio-> the client need tobase64-encodethe audio and send it asinput_audio.For video, OpenAI API doesn't natively support
video_urlas a content type. So we extended it so that we can still play with it. For simplicity also, we also allow to pass url for audio throughaudio_url.Results (tested with
google/gemma-4-E2B-itandQwen/Qwen2.5-Omni-3B)