Add support for including in-memory videos (not just files/urls) in apply_chat_template#39494
Conversation
7e0880b to
a3d74ed
Compare
dacb18e to
c203e9f
Compare
f2855b0 to
c7142ea
Compare
|
requesting review @zucchini-nlp @Rocketknight1 @ArthurZucker @FredrikNoren |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Thanks a lot for the PR @akibjawad ! I feel like this is increasing LOC unnecessarily and could be done with less changes from our side. What if we update load_video to early exit when an array is found instead of trying to decode
The only constraint would be that users have to be consistent with video type within one conversation. So if one started using decoded frames in convo, they have to use decoded frames format for subsequent videos in the chat. Otherwise handling video_metadata can become hard
c7142ea to
b4b905e
Compare
@zucchini-nlp, thank you very much for reviewing. I do agree with your notion to keep the library lean. In fact, my initial implementation was exactly what you mentioned. Later I changed the design, because To accept video frames as array, do we actually need to modify I have noticed you have been working on video_processor for a long time and you have better idea about the complete pipeline and future of this ever changing code. Let me know, which solution would you prefer. |
Yes, it currently has a bug because it checks for The user is still free to pass metadata as kwargs to |
03e400b to
ef6ce1f
Compare
|
@zucchini-nlp Thank you for the clarification, I updated the code of |
377acc9 to
cf3df35
Compare
zucchini-nlp
left a comment
There was a problem hiding this comment.
Thanks for iterating on this, a few comments and if tests are passing, let's merge
cf3df35 to
08612aa
Compare
|
@zucchini-nlp Thank you very much for the review. I addressed your reviews in the most recent commit. As a maintainer, you need to initiate some github workflows for complete testing. Let me know, if there is any remaining issues with current implementation. |
…adata, in chat template
…tadata) in the chat template
… number of tests cases.
…ation about including video object in chat template.
7429b30 to
db06f5b
Compare
|
[For maintainers] Suggested jobs to run (before merge) run-slow: internvl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, smolvlm |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Thanks for iterating! LGTM
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…pply_chat_template (huggingface#39494) * added code for handling video object ,as dictionary of frames and metadata, in chat template * added new test where videos are passed as objects (dict of frames, metadata) in the chat template * modified hardcoded video_len check that does not match with increased number of tests cases. * Modify hardcoded video_len check that fails with increased number of tests * update documentation of multi-modal chat templating with extra information about including video object in chat template. * add array handling in load_video() * temporary test video inlcuded * skip testing smolvlm with videos that are list of frames * update documentation & make fixup * Address review comments
…pply_chat_template (huggingface#39494) * added code for handling video object ,as dictionary of frames and metadata, in chat template * added new test where videos are passed as objects (dict of frames, metadata) in the chat template * modified hardcoded video_len check that does not match with increased number of tests cases. * Modify hardcoded video_len check that fails with increased number of tests * update documentation of multi-modal chat templating with extra information about including video object in chat template. * add array handling in load_video() * temporary test video inlcuded * skip testing smolvlm with videos that are list of frames * update documentation & make fixup * Address review comments
…pply_chat_template (huggingface#39494) * added code for handling video object ,as dictionary of frames and metadata, in chat template * added new test where videos are passed as objects (dict of frames, metadata) in the chat template * modified hardcoded video_len check that does not match with increased number of tests cases. * Modify hardcoded video_len check that fails with increased number of tests * update documentation of multi-modal chat templating with extra information about including video object in chat template. * add array handling in load_video() * temporary test video inlcuded * skip testing smolvlm with videos that are list of frames * update documentation & make fixup * Address review comments
…pply_chat_template (huggingface#39494) * added code for handling video object ,as dictionary of frames and metadata, in chat template * added new test where videos are passed as objects (dict of frames, metadata) in the chat template * modified hardcoded video_len check that does not match with increased number of tests cases. * Modify hardcoded video_len check that fails with increased number of tests * update documentation of multi-modal chat templating with extra information about including video object in chat template. * add array handling in load_video() * temporary test video inlcuded * skip testing smolvlm with videos that are list of frames * update documentation & make fixup * Address review comments
…pply_chat_template (huggingface#39494) * added code for handling video object ,as dictionary of frames and metadata, in chat template * added new test where videos are passed as objects (dict of frames, metadata) in the chat template * modified hardcoded video_len check that does not match with increased number of tests cases. * Modify hardcoded video_len check that fails with increased number of tests * update documentation of multi-modal chat templating with extra information about including video object in chat template. * add array handling in load_video() * temporary test video inlcuded * skip testing smolvlm with videos that are list of frames * update documentation & make fixup * Address review comments
…pply_chat_template (huggingface#39494) * added code for handling video object ,as dictionary of frames and metadata, in chat template * added new test where videos are passed as objects (dict of frames, metadata) in the chat template * modified hardcoded video_len check that does not match with increased number of tests cases. * Modify hardcoded video_len check that fails with increased number of tests * update documentation of multi-modal chat templating with extra information about including video object in chat template. * add array handling in load_video() * temporary test video inlcuded * skip testing smolvlm with videos that are list of frames * update documentation & make fixup * Address review comments
…pply_chat_template (huggingface#39494) * added code for handling video object ,as dictionary of frames and metadata, in chat template * added new test where videos are passed as objects (dict of frames, metadata) in the chat template * modified hardcoded video_len check that does not match with increased number of tests cases. * Modify hardcoded video_len check that fails with increased number of tests * update documentation of multi-modal chat templating with extra information about including video object in chat template. * add array handling in load_video() * temporary test video inlcuded * skip testing smolvlm with videos that are list of frames * update documentation & make fixup * Address review comments
What does this PR do?
Fixes #36560, This PR allows inclusion of in-memory video objects, as dictionary of frames and metadata, in the chat template.
Previously:
Chat template accepted only file-paths or urls in the chat_template. If user (a developer using transformers library) collected videos from a continuous stream or any input devices, user had to store the video in a file and provide file path in chat messages.
Now (after this PR):
Users can collect video frames from streams or devices, provide metadata (describing fps), and directly pass those in the chat_template as a dictionary object. It frees the user from saving the video in files, and increases efficiency by reducing extra IO operation to reload the video again from files.
Notes:
Additionally, this PR also fixes hardcoded values used for testing (in assertions) apply_chat_template_videos for models like internvl, qwen2_vl, qwen2_5_vl, qwen2_5_omni.
Before submitting
Pull Request section? Yes
to it if that's the case.
Yes. Issue link: Allow video objects (np array etc.) in apply_chat_template (not just paths or urls) #36560
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Specifically mentioning @zucchini-nlp for review. Feel free to tag other members/contributors who may be interested to review this PR.