Add Fast Image Processor for Video-LLaVA by ankithsavio · Pull Request #37023 · huggingface/transformers

ankithsavio · 2025-03-27T00:43:31Z

Related #36978
adds fast image processor to video-llava with appropriate tests
please let me know how it goes 😇

github-actions · 2025-03-27T00:43:41Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

yonigozlan

Hi @ankithsavio , thanks for working on this! Amazing job, I don't see anything to change. Are all the tests passing?
We are adding video processor as a separate class, but I think we could still use this for backward compatibility, what do you think @zucchini-nlp ?

ankithsavio · 2025-03-27T16:36:47Z

Hi @yonigozlan, Thank you reviewing. Yes, all the tests are passing on my end. I have also included appropriate tests for the changes. Keeping this for backward compatibility sounds good to me!

zucchini-nlp · 2025-03-28T08:57:21Z

Hey @ankithsavio ! Thanks for the PR!

Indeed we are in the process of adding fast video processors, and the PR is almost ready to be merged (hopefully 🤞🏻 ). But the video processor is responsible only for video inputs, while VideoLlava can handle images as well. Therefore adding a fast image processor is a good idea

Though restricting its input signature to images will be better imo, I'll merge the video PR until fast processors become default and the model will have to use different classes for images vs videos

ankithsavio · 2025-03-29T08:37:26Z

Hi @zucchini-nlp, Thanks for the review!
Just to clarify, are you suggesting that the fast image processor for video-llava should strictly handle only image inputs and not support videos? Should I modify the input signature to enforce this restriction explicitly, or would you prefer a different approach?
Looking forward to your thoughts!

zucchini-nlp · 2025-03-31T07:14:34Z

Should I modify the input signature to enforce this restriction explicitly, or would you prefer a different approach?

Image processors already accept only images, and we added a keyword arg videos for certain models only. For fast processors, I prefer to not accept any videos as possible arg

ankithsavio · 2025-03-31T10:27:44Z

Hi @zucchini-nlp, I’ve updated the code to handle only images as input - this will allow cleaner integration with the upcoming fast video processor. I’ve also modifies the tests to handle the changes between the two image processors. Please let me know if anything else needs adjustment.

zucchini-nlp · 2025-03-31T14:40:54Z

Thanks, lgtm in relation to videos. @yonigozlan will review the PR for fast processing :)

ankithsavio · 2025-04-03T15:23:58Z

Hi @yonigozlan, just following up to see if you had a chance to review the new updates. Let me know if there's anything else that needs adjustment. Thank you!

samrae7 · 2025-04-04T07:30:30Z

@ankithsavio

Can you help me understand something about the PR I'm trying to do:

I went down a similar path to you and tried to add the FastImageProcessor to Vivit, which is a video model, and tried to handle videos as well (although I think that's why I still have two test failing - I haven't properly overwritten preprocess to handle video format as opposed to image).Draft PR

Based on feedback above you changed your PR to only handle images, not videos, in the Fast Processor. I looked at the commit where you did that and I can't see how or where it conditionally only handles images. Could you help me understand?

@zucchini-nlp Based on the feedback you gave here, and given ViVit is all about video, should I do the same for Vivit and only handle images, or just leave it alone completely for now?

Thanks

ankithsavio · 2025-04-04T10:47:58Z

Hi @samrae7 happy to help!

where you did that and I can't see how or where it conditionally only handles images. Could you help me understand?

As mentioned in #36978, the basic fast image processor that was generated was sufficient in this case as it only handles images. However, the tests required some tweaking since the fast processor behaves a bit differently from the base image processor used by video_llava.
As for the videos, the maintainers plan to use different fast processors to handle images and videos separately.

samrae7 · 2025-04-07T06:11:00Z

Hi @samrae7 happy to help!

where you did that and I can't see how or where it conditionally only handles images. Could you help me understand?

As mentioned in #36978, the basic fast image processor that was generated was sufficient in this case as it only handles images. However, the tests required some tweaking since the fast processor behaves a bit differently from the base image processor used by video_llava. As for the videos, the maintainers plan to use different fast processors to handle images and videos separately.

Thanks for the reply @ankithsavio

My confusion was that I couldn't see some logic that says "if the input is image use this processor, otherwise if it's video don't" but I guess that is up to the consumer to manage - they choose the processor appropriate to their use case

Thanks again

github-actions Bot marked this pull request as draft March 27, 2025 00:43

ankithsavio marked this pull request as ready for review March 27, 2025 01:20

github-actions Bot requested review from ydshieh and yonigozlan March 27, 2025 01:21

yonigozlan reviewed Mar 27, 2025

View reviewed changes

yonigozlan mentioned this pull request Mar 27, 2025

[Contributions Welcome] Add Fast Image Processors #36978

Closed

81 tasks

ankithsavio added 5 commits March 29, 2025 09:39

add video_llava fast implementation

eefa077

add test and fixes

b4dc251

minor fix

48fc7c6

minor fixes

4e6ed4a

add docs

5f773fa

ankithsavio force-pushed the llava_fast branch from 103bc49 to 5f773fa Compare March 29, 2025 09:56

change fast processor to image only, modify tests

95c2481

ankithsavio closed this Aug 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Fast Image Processor for Video-LLaVA#37023

Add Fast Image Processor for Video-LLaVA#37023
ankithsavio wants to merge 6 commits intohuggingface:mainfrom
ankithsavio:llava_fast

ankithsavio commented Mar 27, 2025

Uh oh!

github-actions Bot commented Mar 27, 2025

Uh oh!

yonigozlan left a comment

Uh oh!

ankithsavio commented Mar 27, 2025

Uh oh!

zucchini-nlp commented Mar 28, 2025

Uh oh!

ankithsavio commented Mar 29, 2025

Uh oh!

zucchini-nlp commented Mar 31, 2025

Uh oh!

ankithsavio commented Mar 31, 2025

Uh oh!

zucchini-nlp commented Mar 31, 2025

Uh oh!

ankithsavio commented Apr 3, 2025

Uh oh!

samrae7 commented Apr 4, 2025 •

edited

Loading

Uh oh!

ankithsavio commented Apr 4, 2025

Uh oh!

samrae7 commented Apr 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ankithsavio commented Mar 27, 2025

Uh oh!

github-actions Bot commented Mar 27, 2025

Uh oh!

yonigozlan left a comment

Choose a reason for hiding this comment

Uh oh!

ankithsavio commented Mar 27, 2025

Uh oh!

zucchini-nlp commented Mar 28, 2025

Uh oh!

ankithsavio commented Mar 29, 2025

Uh oh!

zucchini-nlp commented Mar 31, 2025

Uh oh!

ankithsavio commented Mar 31, 2025

Uh oh!

zucchini-nlp commented Mar 31, 2025

Uh oh!

ankithsavio commented Apr 3, 2025

Uh oh!

samrae7 commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ankithsavio commented Apr 4, 2025

Uh oh!

samrae7 commented Apr 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

samrae7 commented Apr 4, 2025 •

edited

Loading