Skip to content

Support batch size > 1 image-text inference#36682

Merged
zucchini-nlp merged 27 commits intohuggingface:mainfrom
hiyouga:patch-14
Sep 1, 2025
Merged

Support batch size > 1 image-text inference#36682
zucchini-nlp merged 27 commits intohuggingface:mainfrom
hiyouga:patch-14

Conversation

@hiyouga
Copy link
Contributor

@hiyouga hiyouga commented Mar 12, 2025

What does this PR do?

This PR follows #35558 #40263

Consider a batch of image lists, where the first example has 1 image and the second example has 0 image. e.g.,

images = [
  [Image],
  []
]

Using the latest code, it will receive a value error Invalid input type. Must be a single image, a list of images, or a list of batches of images..

In this PR, we use any instead of all to judge if it is a valid nested list of images. Note that this behavior is the same as the one in transformers 4.48.0.

https://github.com/huggingface/transformers/blob/v4.48.0/src/transformers/models/mllama/image_processing_mllama.py#L535-L541

# If it's a list of batches, it's already in the right format
elif (
isinstance(images, (list, tuple))
and all(isinstance(images_i, (list, tuple)) for images_i in images)
and any(is_valid_list_of_images(images_i) for images_i in images)
):
output_images = images

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp

@github-actions github-actions bot marked this pull request as draft March 12, 2025 17:56
@github-actions
Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

@hiyouga hiyouga marked this pull request as ready for review March 12, 2025 17:58
@hiyouga hiyouga force-pushed the patch-14 branch 3 times, most recently from 2d81f59 to 03e338e Compare March 13, 2025 11:48
@zucchini-nlp
Copy link
Member

Question before reviewing: why we pass an empty list for no-image prompt? What if we just do images = [ [Image]] instead of images = [ [Image], [] ] ?

@hiyouga
Copy link
Contributor Author

hiyouga commented Mar 13, 2025

@zucchini-nlp Assuming the batch size is 2, we expect that the length of image lists should be the same is the batch size

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense. Also cc @yonigozlan since you added these functions, do you see any edge cases if we check any?

Otherwise LGTM

@yonigozlan
Copy link
Member

Hi @hiyouga ! Thanks for flagging this issue. I agree we should support inputs such as [[image1], []] Right now it seems to be causing some issues with SmolVLM processor, but this is more of a problem with SmolVLM than with this PR.

The issue I see is that we wouldn't catch an error now if we have [[image1], image2] for example, when we should. But we cannot catch every possible wrong input formats, so this might not be too bad. WDYT @zucchini-nlp ?

@zucchini-nlp
Copy link
Member

@yonigozlan agreed, I think we can expect users to use consistent format within one input.

@hiyouga there's a failing test which is caused by this PR i think, can you take a look?

@hiyouga hiyouga force-pushed the patch-14 branch 7 times, most recently from ac56330 to 0b9acfc Compare March 14, 2025 16:21
@hiyouga
Copy link
Contributor Author

hiyouga commented Mar 14, 2025

Hi @zucchini-nlp , I have made necessary changes to Gemma3ImageProcessor, Idefics2ImageProcessor, Idefics3ImageProcessor and SmolVLMImageProcessor, to make them support inputs like [[image], []] and [[], [image]]

Comment on lines 292 to 294
images = [self.image1]
with self.assertRaises(ValueError):
processor(text=text, images=images, padding=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't get why this doesn't throw error anymore, IMO passing flat images is ambiguous, and we throw errors instead of trying to infer which text corresponds to which image

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hiyouga great, thanks for handling the tests!

I see why we need to flatten images with the new changes, but i don't like calling it every time when one image is needed. I'd suggest to save one image in a variable at the beginning and add a small comment we we do that, so future us don't delete it :)

@hiyouga hiyouga force-pushed the patch-14 branch 7 times, most recently from e2c82a4 to 5d4a4fb Compare March 17, 2025 16:19
@qubvel
Copy link
Contributor

qubvel commented Apr 7, 2025

Waiting for the CI to be green to merge 😄

@hiyouga
Copy link
Contributor Author

hiyouga commented Apr 7, 2025

@qubvel It seems that the integration of llama4 breaks all the processor unit tests https://github.com/huggingface/transformers/commits/main/
image

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you documment what this enables? Like in the pipeline md?

@hiyouga
Copy link
Contributor Author

hiyouga commented Apr 8, 2025

@ArthurZucker This PR mainly enables the ImageTextToTextPipeline to have both image-text and text-only inputs in a whole batch. However, I'm not sure where I should add the document. Could you provide some instructions?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry you are right its kind of obvious that it should support batch > 1 image, what I mean is to have a small doc example somewhere for people to play with it! Let's fix the conflicts and get this merged 🔥

@zucchini-nlp
Copy link
Member

@hiyouga I am taking over this PR, due to a demand from TRL team to support the feature. Would be great to merge it soon

@zucchini-nlp
Copy link
Member

zucchini-nlp commented Aug 26, 2025

@ArthurZucker let's merge this to fix VLM training in TRL

If anyone wants to have another look, I added a test case, fixed a few new models and changed all occurrences of make_list_of_images with make_flat_list_of_images. The first one should be deleted with a small deprecation as it is not used anywhere, the two have same functionality

I will merge end of week if no-one has comments

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: beit, bit, chinese_clip, conditional_detr, convnext, deformable_detr, deit, depth_pro, detr, donut, dpt, efficientnet, emu3, eomt, flava, fuyu

@zucchini-nlp zucchini-nlp enabled auto-merge (squash) September 1, 2025 12:16
@zucchini-nlp zucchini-nlp merged commit 564be6d into huggingface:main Sep 1, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants

Comments