Add support for including in-memory videos (not just files/urls) in apply_chat_template by akibjawad · Pull Request #39494 · huggingface/transformers

akibjawad · 2025-07-18T03:14:35Z

What does this PR do?

Fixes #36560, This PR allows inclusion of in-memory video objects, as dictionary of frames and metadata, in the chat template.

Previously:
Chat template accepted only file-paths or urls in the chat_template. If user (a developer using transformers library) collected videos from a continuous stream or any input devices, user had to store the video in a file and provide file path in chat messages.

Now (after this PR):
Users can collect video frames from streams or devices, provide metadata (describing fps), and directly pass those in the chat_template as a dictionary object. It frees the user from saving the video in files, and increases efficiency by reducing extra IO operation to reload the video again from files.

Notes:
Additionally, this PR also fixes hardcoded values used for testing (in assertions) apply_chat_template_videos for models like internvl, qwen2_vl, qwen2_5_vl, qwen2_5_omni.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). No
Did you read the contributor guideline,
Pull Request section? Yes
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Yes. Issue link: Allow video objects (np array etc.) in apply_chat_template (not just paths or urls) #36560
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?
- in tests/test_processing_common.py file: I added a new type of video input which will be included in the chat messages while testing functionality of apply_chat_template.
- Added a new test with batchsize 3 for testing in-memory video objects in chat_template. Additionally updated hardcoded assertion (video_len check) for testing with increased batch_size in 4 models:
  - tests/models/internvl/test_processor_internvl.py
  - tests/models/qwen2_vl/test_processor_qwen2_vl.py
  - tests/models/qwen2_5_vl/test_processor_qwen2_5_vl.py
  - tests/models/qwen2_5_omni/test_processor_qwen2_5_omni.py
  - tests/models/smolvlm/test_processor_smolvlm.py (skip testing smolvlm with list of frames)

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Specifically mentioning @zucchini-nlp for review. Feel free to tag other members/contributors who may be interested to review this PR.

Rocketknight1 · 2025-07-18T12:31:03Z

cc @zucchini-nlp

akibjawad · 2025-07-30T05:05:08Z

requesting review @zucchini-nlp @Rocketknight1 @ArthurZucker @FredrikNoren

zucchini-nlp

Thanks a lot for the PR @akibjawad ! I feel like this is increasing LOC unnecessarily and could be done with less changes from our side. What if we update load_video to early exit when an array is found instead of trying to decode

The only constraint would be that users have to be consistent with video type within one conversation. So if one started using decoded frames in convo, they have to use decoded frames format for subsequent videos in the chat. Otherwise handling video_metadata can become hard

akibjawad · 2025-07-31T00:02:01Z

Thanks a lot for the PR @akibjawad ! I feel like this is increasing LOC unnecessarily and could be done with less changes from our side. What if we update load_video to early exit when an array is found instead of trying to decode

The only constraint would be that users have to be consistent with video type within one conversation. So if one started using decoded frames in convo, they have to use decoded frames format for subsequent videos in the chat. Otherwise handling video_metadata can become hard

@zucchini-nlp, thank you very much for reviewing. I do agree with your notion to keep the library lean. In fact, my initial implementation was exactly what you mentioned. Later I changed the design, because load_video() is meant to collect video_frames and metadata from a source (such as file, url). When a user provides video frames (ndarray/tensor) in a conversation, user already loaded the video some way (either from a file, livestream, screen record, camera devices, or randomly generated etc.) and user do not want to save the frames to a file. While collecting frames, user might also collect metadata. That is why I kept option for both frames & metadata. Additionally, without any metadata, frame_sampling with fps is not possible. As you mentioned earlier, user must be consistent and cannot use fps parameter for sampling while using decoded_frames or video as a list of image file names. Because in those cases metadata will be none. Although we can provide a default metadata for consistent sampling.

To accept video frames as array, do we actually need to modify load_video() function? Because, if we are returning early from the function, we can simply detect video type is an array and collect the video frames from the if else block in apply_chat_template function of the processor, saving an extra step of calling the function. However load_video() is an utility function and I assume it is used in many parts of the code-base, If we include array handling in the load_video() function, it will be useful for other places also. Although load_video() has other parameters (fps, num_frames) for sampling. From the current implementation, it looks like sampling is done at the video processor class, not at the apply_chat_template phase.

I have noticed you have been working on video_processor for a long time and you have better idea about the complete pipeline and future of this ever changing code. Let me know, which solution would you prefer.

zucchini-nlp · 2025-07-31T11:16:46Z

To accept video frames as array, do we actually need to modify load_video() function?

Yes, it currently has a bug because it checks for isinstance(video, array) later and thus fails. We just need to change the order of conditional checks

The user is still free to pass metadata as kwargs to apply_chat_template and it should be picked up, I am doing another update here #39600 and prob that will fix it. I am currently stuck on different task but will continue on video decoding soon

akibjawad · 2025-07-31T20:57:19Z

@zucchini-nlp Thank you for the clarification, I updated the code of load_video() and handled decoded frames same as handling a list of image file names. I kept everything else (metadata handling) same so that this changes will not create too much conflict with your PR (#39600). Please review again and let me know if I need to change anything else.

zucchini-nlp

Thanks for iterating on this, a few comments and if tests are passing, let's merge

akibjawad · 2025-08-01T11:35:55Z

@zucchini-nlp Thank you very much for the review. I addressed your reviews in the most recent commit. As a maintainer, you need to initiate some github workflows for complete testing. Let me know, if there is any remaining issues with current implementation.

…adata, in chat template

…tadata) in the chat template

… number of tests cases.

…tests

…ation about including video object in chat template.

github-actions · 2025-08-02T12:28:26Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: internvl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, smolvlm

zucchini-nlp

Thanks for iterating! LGTM

HuggingFaceDocBuilderDev · 2025-08-04T09:32:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…pply_chat_template (huggingface#39494) * added code for handling video object ,as dictionary of frames and metadata, in chat template * added new test where videos are passed as objects (dict of frames, metadata) in the chat template * modified hardcoded video_len check that does not match with increased number of tests cases. * Modify hardcoded video_len check that fails with increased number of tests * update documentation of multi-modal chat templating with extra information about including video object in chat template. * add array handling in load_video() * temporary test video inlcuded * skip testing smolvlm with videos that are list of frames * update documentation & make fixup * Address review comments

akibjawad marked this pull request as draft July 18, 2025 03:38

akibjawad force-pushed the video_object_in_apply_chat_template branch 2 times, most recently from 7e0880b to a3d74ed Compare July 27, 2025 17:37

akibjawad force-pushed the video_object_in_apply_chat_template branch 4 times, most recently from dacb18e to c203e9f Compare July 29, 2025 00:09

akibjawad changed the title ~~[WIP] Add support for including video object in apply_chat_template function~~ [WIP] Add support for including in-memory videos (not just files/urls) in apply_chat_template Jul 29, 2025

akibjawad marked this pull request as ready for review July 29, 2025 01:12

github-actions Bot requested review from ArthurZucker and Rocketknight1 July 29, 2025 01:13

akibjawad changed the title ~~[WIP] Add support for including in-memory videos (not just files/urls) in apply_chat_template~~ Add support for including in-memory videos (not just files/urls) in apply_chat_template Jul 29, 2025

akibjawad force-pushed the video_object_in_apply_chat_template branch 6 times, most recently from f2855b0 to c7142ea Compare July 30, 2025 04:53

zucchini-nlp reviewed Jul 30, 2025

View reviewed changes

akibjawad force-pushed the video_object_in_apply_chat_template branch from c7142ea to b4b905e Compare July 30, 2025 19:46

akibjawad force-pushed the video_object_in_apply_chat_template branch 2 times, most recently from 03e400b to ef6ce1f Compare July 31, 2025 19:37

akibjawad force-pushed the video_object_in_apply_chat_template branch from 377acc9 to cf3df35 Compare July 31, 2025 21:02

zucchini-nlp reviewed Aug 1, 2025

View reviewed changes

Comment thread src/transformers/video_utils.py Outdated

Comment thread src/transformers/video_utils.py Outdated

Comment thread tests/models/qwen2_5_omni/test_processor_qwen2_5_omni.py Outdated

akibjawad force-pushed the video_object_in_apply_chat_template branch from cf3df35 to 08612aa Compare August 1, 2025 10:39

akibjawad added 10 commits August 2, 2025 05:27

added code for handling video object ,as dictionary of frames and met…

289376e

…adata, in chat template

added new test where videos are passed as objects (dict of frames, me…

cec5f3c

…tadata) in the chat template

modified hardcoded video_len check that does not match with increased…

d699398

… number of tests cases.

Modify hardcoded video_len check that fails with increased number of …

6895aaf

…tests

update documentation of multi-modal chat templating with extra inform…

f79d0d8

…ation about including video object in chat template.

add array handling in load_video()

e9f95d3

temporary test video inlcuded

410c812

skip testing smolvlm with videos that are list of frames

09318d8

update documentation & make fixup

8719706

Address review comments

db06f5b

akibjawad force-pushed the video_object_in_apply_chat_template branch from 7429b30 to db06f5b Compare August 2, 2025 12:27

zucchini-nlp approved these changes Aug 4, 2025

View reviewed changes

zucchini-nlp merged commit 2a9febd into huggingface:main Aug 4, 2025
25 checks passed

Conversation

akibjawad commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Rocketknight1 commented Jul 18, 2025

Uh oh!

akibjawad commented Jul 30, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

akibjawad commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akibjawad commented Jul 31, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akibjawad commented Aug 1, 2025

Uh oh!

github-actions Bot commented Aug 2, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

akibjawad commented Jul 18, 2025 •

edited

Loading

akibjawad commented Jul 31, 2025 •

edited

Loading

zucchini-nlp commented Jul 31, 2025 •

edited

Loading