Support batch size > 1 image-text inference#36682
Support batch size > 1 image-text inference#36682zucchini-nlp merged 27 commits intohuggingface:mainfrom
Conversation
|
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the |
2d81f59 to
03e338e
Compare
|
Question before reviewing: why we pass an empty list for no-image prompt? What if we just do |
|
@zucchini-nlp Assuming the batch size is 2, we expect that the length of image lists should be the same is the batch size |
zucchini-nlp
left a comment
There was a problem hiding this comment.
I see, makes sense. Also cc @yonigozlan since you added these functions, do you see any edge cases if we check any?
Otherwise LGTM
|
Hi @hiyouga ! Thanks for flagging this issue. I agree we should support inputs such as The issue I see is that we wouldn't catch an error now if we have |
|
@yonigozlan agreed, I think we can expect users to use consistent format within one input. @hiyouga there's a failing test which is caused by this PR i think, can you take a look? |
ac56330 to
0b9acfc
Compare
|
Hi @zucchini-nlp , I have made necessary changes to |
| images = [self.image1] | ||
| with self.assertRaises(ValueError): | ||
| processor(text=text, images=images, padding=True) |
There was a problem hiding this comment.
didn't get why this doesn't throw error anymore, IMO passing flat images is ambiguous, and we throw errors instead of trying to infer which text corresponds to which image
zucchini-nlp
left a comment
There was a problem hiding this comment.
@hiyouga great, thanks for handling the tests!
I see why we need to flatten images with the new changes, but i don't like calling it every time when one image is needed. I'd suggest to save one image in a variable at the beginning and add a small comment we we do that, so future us don't delete it :)
e2c82a4 to
5d4a4fb
Compare
|
Waiting for the CI to be green to merge 😄 |
|
@qubvel It seems that the integration of llama4 breaks all the processor unit tests https://github.com/huggingface/transformers/commits/main/ |
ArthurZucker
left a comment
There was a problem hiding this comment.
Can you documment what this enables? Like in the pipeline md?
|
@ArthurZucker This PR mainly enables the |
ArthurZucker
left a comment
There was a problem hiding this comment.
sorry you are right its kind of obvious that it should support batch > 1 image, what I mean is to have a small doc example somewhere for people to play with it! Let's fix the conflicts and get this merged 🔥
|
@hiyouga I am taking over this PR, due to a demand from TRL team to support the feature. Would be great to merge it soon |
|
@ArthurZucker let's merge this to fix VLM training in TRL If anyone wants to have another look, I added a test case, fixed a few new models and changed all occurrences of I will merge end of week if no-one has comments |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: beit, bit, chinese_clip, conditional_detr, convnext, deformable_detr, deit, depth_pro, detr, donut, dpt, efficientnet, emu3, eomt, flava, fuyu |

What does this PR do?
This PR follows #35558 #40263
Consider a batch of image lists, where the first example has 1 image and the second example has 0 image. e.g.,
Using the latest code, it will receive a value error
Invalid input type. Must be a single image, a list of images, or a list of batches of images..In this PR, we use
anyinstead ofallto judge if it is a valid nested list of images. Note that this behavior is the same as the one in transformers 4.48.0.https://github.com/huggingface/transformers/blob/v4.48.0/src/transformers/models/mllama/image_processing_mllama.py#L535-L541
transformers/src/transformers/models/mllama/image_processing_mllama.py
Lines 535 to 541 in 6bc0fbc
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@zucchini-nlp