Skip to content

Add Fast Image Processor for Video-LLaVA#37023

Closed
ankithsavio wants to merge 6 commits intohuggingface:mainfrom
ankithsavio:llava_fast
Closed

Add Fast Image Processor for Video-LLaVA#37023
ankithsavio wants to merge 6 commits intohuggingface:mainfrom
ankithsavio:llava_fast

Conversation

@ankithsavio
Copy link
Copy Markdown

Related #36978
adds fast image processor to video-llava with appropriate tests
please let me know how it goes 😇

cc @yonigozlan

@github-actions
Copy link
Copy Markdown
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@github-actions github-actions Bot marked this pull request as draft March 27, 2025 00:43
@ankithsavio ankithsavio marked this pull request as ready for review March 27, 2025 01:20
@github-actions github-actions Bot requested review from ydshieh and yonigozlan March 27, 2025 01:21
Copy link
Copy Markdown
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ankithsavio , thanks for working on this! Amazing job, I don't see anything to change. Are all the tests passing?
We are adding video processor as a separate class, but I think we could still use this for backward compatibility, what do you think @zucchini-nlp ?

@ankithsavio
Copy link
Copy Markdown
Author

Hi @yonigozlan, Thank you reviewing. Yes, all the tests are passing on my end. I have also included appropriate tests for the changes. Keeping this for backward compatibility sounds good to me!

@zucchini-nlp
Copy link
Copy Markdown
Member

Hey @ankithsavio ! Thanks for the PR!

Indeed we are in the process of adding fast video processors, and the PR is almost ready to be merged (hopefully 🤞🏻 ). But the video processor is responsible only for video inputs, while VideoLlava can handle images as well. Therefore adding a fast image processor is a good idea

Though restricting its input signature to images will be better imo, I'll merge the video PR until fast processors become default and the model will have to use different classes for images vs videos

@ankithsavio
Copy link
Copy Markdown
Author

Hi @zucchini-nlp, Thanks for the review!
Just to clarify, are you suggesting that the fast image processor for video-llava should strictly handle only image inputs and not support videos? Should I modify the input signature to enforce this restriction explicitly, or would you prefer a different approach?
Looking forward to your thoughts!

@zucchini-nlp
Copy link
Copy Markdown
Member

Should I modify the input signature to enforce this restriction explicitly, or would you prefer a different approach?

Image processors already accept only images, and we added a keyword arg videos for certain models only. For fast processors, I prefer to not accept any videos as possible arg

@ankithsavio
Copy link
Copy Markdown
Author

Hi @zucchini-nlp, I’ve updated the code to handle only images as input - this will allow cleaner integration with the upcoming fast video processor. I’ve also modifies the tests to handle the changes between the two image processors. Please let me know if anything else needs adjustment.

@zucchini-nlp
Copy link
Copy Markdown
Member

Thanks, lgtm in relation to videos. @yonigozlan will review the PR for fast processing :)

@ankithsavio
Copy link
Copy Markdown
Author

Hi @yonigozlan, just following up to see if you had a chance to review the new updates. Let me know if there's anything else that needs adjustment. Thank you!

@samrae7
Copy link
Copy Markdown
Contributor

samrae7 commented Apr 4, 2025

@ankithsavio

Can you help me understand something about the PR I'm trying to do:

I went down a similar path to you and tried to add the FastImageProcessor to Vivit, which is a video model, and tried to handle videos as well (although I think that's why I still have two test failing - I haven't properly overwritten preprocess to handle video format as opposed to image).Draft PR

Based on feedback above you changed your PR to only handle images, not videos, in the Fast Processor. I looked at the commit where you did that and I can't see how or where it conditionally only handles images. Could you help me understand?

@zucchini-nlp Based on the feedback you gave here, and given ViVit is all about video, should I do the same for Vivit and only handle images, or just leave it alone completely for now?

Thanks

@ankithsavio
Copy link
Copy Markdown
Author

Hi @samrae7 happy to help!

where you did that and I can't see how or where it conditionally only handles images. Could you help me understand?

As mentioned in #36978, the basic fast image processor that was generated was sufficient in this case as it only handles images. However, the tests required some tweaking since the fast processor behaves a bit differently from the base image processor used by video_llava.
As for the videos, the maintainers plan to use different fast processors to handle images and videos separately.

@samrae7
Copy link
Copy Markdown
Contributor

samrae7 commented Apr 7, 2025

Hi @samrae7 happy to help!

where you did that and I can't see how or where it conditionally only handles images. Could you help me understand?

As mentioned in #36978, the basic fast image processor that was generated was sufficient in this case as it only handles images. However, the tests required some tweaking since the fast processor behaves a bit differently from the base image processor used by video_llava. As for the videos, the maintainers plan to use different fast processors to handle images and videos separately.

Thanks for the reply @ankithsavio

My confusion was that I couldn't see some logic that says "if the input is image use this processor, otherwise if it's video don't" but I guess that is up to the consumer to manage - they choose the processor appropriate to their use case

Thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants