smolvlm video processing by pcuenca · Pull Request #39006 · huggingface/transformers

pcuenca · 2025-06-24T14:56:13Z

There's a bug in smolvlm2 video processing (but keep reading, there's more): the list of frames that make up the prompt is malformed. While debugging transformers v4.52.4, this appeared to be because the return_row_col_info was removed from the kwargs, possibly in #38105.

However, this fix only works if we apply it on top of v4.52.4, but not on main. On main, the chat template goes through a new path and generation is wrong (before or after the fix). In addition, main seems to decode all the frames in the video at full resolution, I got a tensor with shape (559, 730, 1920, 3) here. This is not the case in v4.52.4 (I get 9 frames for the same video, already downscaled).

cc @zucchini-nlp, happy to take a deeper look if you have any hints on how to proceed.

Reported in Blaizzy/mlx-vlm#388
Processing works in #37291, but it looks out of date with main.

HuggingFaceDocBuilderDev · 2025-06-24T15:09:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2025-06-25T08:40:59Z

Thanks for reporting! Do you mean the generation in main branch in currently incorrect? That's bad, I thought the slow tests were passing 😢
Yes, if that's the case, can you dig further so we can make a fix until the release? I will be on my laptop back tonight and will also take a look, thanks

pcuenca · 2025-07-01T10:47:34Z

Repro (transformers @ 20901f1d68):

import torch
from transformers import AutoProcessor, SmolVLMForConditionalGeneration

model_id = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
video_path="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/assisted-generation/gif_1_1080p.mov"

dtype = torch.bfloat16
processor = AutoProcessor.from_pretrained(model_id)
model = SmolVLMForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
    device_map="cuda:0",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": video_path},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(device=model.device, dtype=dtype)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=100)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=False)
print(generated_texts[0])

Output:

Token indices sequence length is longer than the specified maximum sequence length for this model (43092 > 8192). Running this sequence through the model will result in indexing errors
<|im_start|>User: You are provided the following series of five hundred and fifty-nine frames from a 0:00:23 [H:MM:SS] video.

Frame from 00:00:<fake_token_around_image><global-img><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image>
Frame from 00:00:<fake_token_around_image><global-img><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image>

... etc

Frame from 00:23:<fake_token_around_image><global-img><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image>

Describe this video in detail<end_of_utterance>
Assistant: Source Source Source Chain Source Chain #0
 Ch Source Source Source to  0    Ch #0
_ #0 0 0 Ch #0 <row_1_col_1>
 # # Bon # Bon # Bon Bon Bon # # # Bon # Bon # # # #0    Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source Source to<row_1_col_1><row_1_col_1><row_1_col_1> 1<row_1_col_1>

Notes:

559 frames selected (!)
Bad generations (the inputs are malformed)

transformers @ v4.51.3:

<|im_start|>User: You are provided the following series of nine frames from a 0:00:09 [H:MM:SS] video.

... [frames skipped]

Assistant: The video presents a graphical representation of a large language model, specifically the "Largest Language Model" (LLM), which is designed to process and generate text. The model is divided into four layers, each with a specific function. The top layer, labeled "Languages," consists of 1000 neurons, each with a weight of 0.00000000000000000000000000

transformers @ v4.52.4:

<|im_start|>User: You are provided the following series of nine frames from a 0:00:09 [H:MM:SS] video.

Frame from 00:00:<fake_token_around_image><global-img><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image>
Frame from 00:01:00 [H:MM:SS]

Frame from 00:01:00 [H:MM:SS]

Frame from 00:01:00 [H:MM:SS]

Frame from 00:01:00 [H:MM:SS]

Frame from 00:01:00 [H:MM:SS]

Frame from 00:01:

(truncated there)

transformers @ v4.52.4 with 0a1ae31ee6e01ebb0618e324da8a381d8abf4152 cherry-picked on top.

Assistant: The video presents a graphical representation of a large language model, specifically the "Largest Language Model" (LLM), which is designed to process and generate text. The model is divided into four layers, each with a specific function. The top layer, labeled "Languages," consists of 1000 neurons, each with a weight of 0.00000000000000000000000000

(Same as v4.51.3)

pcuenca · 2025-07-01T12:30:24Z

Closing, superseded by #39147.

smolvlm video processing: use rows and cols

0a1ae31

zucchini-nlp mentioned this pull request Jul 1, 2025

[smolvlm] fix video inference #39147

Merged

pcuenca closed this Jul 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smolvlm video processing#39006

smolvlm video processing#39006
pcuenca wants to merge 1 commit intomainfrom
smolvlm-video-processing

pcuenca commented Jun 24, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 24, 2025

Uh oh!

zucchini-nlp commented Jun 25, 2025

Uh oh!

pcuenca commented Jul 1, 2025

Uh oh!

pcuenca commented Jul 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pcuenca commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 24, 2025

Uh oh!

zucchini-nlp commented Jun 25, 2025

Uh oh!

pcuenca commented Jul 1, 2025

Uh oh!

pcuenca commented Jul 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pcuenca commented Jun 24, 2025 •

edited

Loading