Add Fast Image Processor for Video-LLaVA#37023
Add Fast Image Processor for Video-LLaVA#37023ankithsavio wants to merge 6 commits intohuggingface:mainfrom
Conversation
|
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
yonigozlan
left a comment
There was a problem hiding this comment.
Hi @ankithsavio , thanks for working on this! Amazing job, I don't see anything to change. Are all the tests passing?
We are adding video processor as a separate class, but I think we could still use this for backward compatibility, what do you think @zucchini-nlp ?
|
Hi @yonigozlan, Thank you reviewing. Yes, all the tests are passing on my end. I have also included appropriate tests for the changes. Keeping this for backward compatibility sounds good to me! |
|
Hey @ankithsavio ! Thanks for the PR! Indeed we are in the process of adding fast video processors, and the PR is almost ready to be merged (hopefully 🤞🏻 ). But the video processor is responsible only for video inputs, while VideoLlava can handle images as well. Therefore adding a fast image processor is a good idea Though restricting its input signature to |
|
Hi @zucchini-nlp, Thanks for the review! |
Image processors already accept only images, and we added a keyword arg |
|
Hi @zucchini-nlp, I’ve updated the code to handle only images as input - this will allow cleaner integration with the upcoming fast video processor. I’ve also modifies the tests to handle the changes between the two image processors. Please let me know if anything else needs adjustment. |
|
Thanks, lgtm in relation to |
|
Hi @yonigozlan, just following up to see if you had a chance to review the new updates. Let me know if there's anything else that needs adjustment. Thank you! |
|
Can you help me understand something about the PR I'm trying to do: I went down a similar path to you and tried to add the FastImageProcessor to Vivit, which is a video model, and tried to handle videos as well (although I think that's why I still have two test failing - I haven't properly overwritten preprocess to handle video format as opposed to image).Draft PR Based on feedback above you changed your PR to only handle images, not videos, in the Fast Processor. I looked at the commit where you did that and I can't see how or where it conditionally only handles images. Could you help me understand? @zucchini-nlp Based on the feedback you gave here, and given ViVit is all about video, should I do the same for Vivit and only handle images, or just leave it alone completely for now? Thanks |
|
Hi @samrae7 happy to help!
As mentioned in #36978, the basic fast image processor that was generated was sufficient in this case as it only handles images. However, the tests required some tweaking since the fast processor behaves a bit differently from the base image processor used by |
Thanks for the reply @ankithsavio My confusion was that I couldn't see some logic that says "if the input is image use this processor, otherwise if it's video don't" but I guess that is up to the consumer to manage - they choose the processor appropriate to their use case Thanks again |
Related #36978
adds fast image processor to video-llava with appropriate tests
please let me know how it goes 😇
cc @yonigozlan