Support batch size > 1 image-text inference by hiyouga · Pull Request #36682 · huggingface/transformers

hiyouga · 2025-03-12T17:56:39Z

What does this PR do?

Consider a batch of image lists, where the first example has 1 image and the second example has 0 image. e.g.,

images = [
  [Image],
  []
]

Using the latest code, it will receive a value error Invalid input type. Must be a single image, a list of images, or a list of batches of images..

In this PR, we use any instead of all to judge if it is a valid nested list of images. Note that this behavior is the same as the one in transformers 4.48.0.

https://github.com/huggingface/transformers/blob/v4.48.0/src/transformers/models/mllama/image_processing_mllama.py#L535-L541

transformers/src/transformers/models/mllama/image_processing_mllama.py

Lines 535 to 541 in 6bc0fbc

    
           # If it's a list of batches, it's already in the right format 
        
           elif ( 
        
               isinstance(images, (list, tuple)) 
        
               and all(isinstance(images_i, (list, tuple)) for images_i in images) 
        
               and any(is_valid_list_of_images(images_i) for images_i in images) 
        
           ): 
        
               output_images = images

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp

github-actions · 2025-03-12T17:56:52Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

zucchini-nlp · 2025-03-13T22:05:21Z

Question before reviewing: why we pass an empty list for no-image prompt? What if we just do images = [ [Image]] instead of images = [ [Image], [] ] ?

hiyouga · 2025-03-13T22:09:04Z

@zucchini-nlp Assuming the batch size is 2, we expect that the length of image lists should be the same is the batch size

zucchini-nlp

I see, makes sense. Also cc @yonigozlan since you added these functions, do you see any edge cases if we check any?

Otherwise LGTM

yonigozlan · 2025-03-14T14:10:11Z

Hi @hiyouga ! Thanks for flagging this issue. I agree we should support inputs such as [[image1], []] Right now it seems to be causing some issues with SmolVLM processor, but this is more of a problem with SmolVLM than with this PR.

The issue I see is that we wouldn't catch an error now if we have [[image1], image2] for example, when we should. But we cannot catch every possible wrong input formats, so this might not be too bad. WDYT @zucchini-nlp ?

zucchini-nlp · 2025-03-14T14:28:38Z

@yonigozlan agreed, I think we can expect users to use consistent format within one input.

@hiyouga there's a failing test which is caused by this PR i think, can you take a look?

hiyouga · 2025-03-14T16:30:41Z

Hi @zucchini-nlp , I have made necessary changes to Gemma3ImageProcessor, Idefics2ImageProcessor, Idefics3ImageProcessor and SmolVLMImageProcessor, to make them support inputs like [[image], []] and [[], [image]]

zucchini-nlp · 2025-03-17T09:47:44Z

tests/models/idefics2/test_processor_idefics2.py

-        images = [self.image1]
-        with self.assertRaises(ValueError):
-            processor(text=text, images=images, padding=True)


didn't get why this doesn't throw error anymore, IMO passing flat images is ambiguous, and we throw errors instead of trying to infer which text corresponds to which image

zucchini-nlp

@hiyouga great, thanks for handling the tests!

I see why we need to flatten images with the new changes, but i don't like calling it every time when one image is needed. I'd suggest to save one image in a variable at the beginning and add a small comment we we do that, so future us don't delete it :)

qubvel · 2025-04-07T11:18:07Z

Waiting for the CI to be green to merge 😄

hiyouga · 2025-04-07T11:36:57Z

@qubvel It seems that the integration of llama4 breaks all the processor unit tests https://github.com/huggingface/transformers/commits/main/

ArthurZucker

Can you documment what this enables? Like in the pipeline md?

hiyouga · 2025-04-08T15:42:28Z

@ArthurZucker This PR mainly enables the ImageTextToTextPipeline to have both image-text and text-only inputs in a whole batch. However, I'm not sure where I should add the document. Could you provide some instructions?

ArthurZucker

sorry you are right its kind of obvious that it should support batch > 1 image, what I mean is to have a small doc example somewhere for people to play with it! Let's fix the conflicts and get this merged 🔥

zucchini-nlp · 2025-08-26T08:14:25Z

@hiyouga I am taking over this PR, due to a demand from TRL team to support the feature. Would be great to merge it soon

zucchini-nlp · 2025-08-26T14:26:26Z

@ArthurZucker let's merge this to fix VLM training in TRL

If anyone wants to have another look, I added a test case, fixed a few new models and changed all occurrences of make_list_of_images with make_flat_list_of_images. The first one should be deleted with a small deprecation as it is not used anywhere, the two have same functionality

I will merge end of week if no-one has comments

HuggingFaceDocBuilderDev · 2025-08-31T18:54:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2025-09-01T11:32:27Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: beit, bit, chinese_clip, conditional_detr, convnext, deformable_detr, deit, depth_pro, detr, donut, dpt, efficientnet, emu3, eomt, flava, fuyu

github-actions bot marked this pull request as draft March 12, 2025 17:56

hiyouga marked this pull request as ready for review March 12, 2025 17:58

github-actions bot requested review from ArthurZucker and Rocketknight1 March 12, 2025 17:58

hiyouga force-pushed the patch-14 branch 3 times, most recently from 2d81f59 to 03e338e Compare March 13, 2025 11:48

zucchini-nlp approved these changes Mar 14, 2025

View reviewed changes

hiyouga force-pushed the patch-14 branch 7 times, most recently from ac56330 to 0b9acfc Compare March 14, 2025 16:21

zucchini-nlp reviewed Mar 17, 2025

View reviewed changes

hiyouga force-pushed the patch-14 branch 7 times, most recently from e2c82a4 to 5d4a4fb Compare March 17, 2025 16:19

hiyouga and others added 2 commits April 7, 2025 11:20

Merge branch 'main' into patch-14

524ae5b

Merge branch 'main' into patch-14

05119a4

ArthurZucker approved these changes Apr 8, 2025

View reviewed changes

ArthurZucker reviewed Jun 19, 2025

View reviewed changes

merge main

4523207

zucchini-nlp added 6 commits August 26, 2025 12:58

add tests and fix some processors

dab0b33

fix copies

8bde8d2

merge main

38bb0db

fix after rebase

c564209

make the test cover chat templates

8e5fab9

sjip udop, no point in fixing it

2b342df

Merge branch 'main' into patch-14

420bfc5

zucchini-nlp mentioned this pull request Sep 1, 2025

fix: Passing empty list to images/videos for some multi-modal models #40569

Closed

5 tasks

zucchini-nlp added 4 commits September 1, 2025 11:07

Merge branch 'main' into patch-14

59fa41a

fix after rebase

e71bef3

fix a few more tests

8e4c67a

Merge branch 'main' into patch-14

d9a5d13

zucchini-nlp enabled auto-merge (squash) September 1, 2025 12:16

zucchini-nlp merged commit 564be6d into huggingface:main Sep 1, 2025
24 checks passed

remi-or mentioned this pull request Sep 1, 2025

Multiple fixes to FA tests in AMD #40498

Merged

ZENGXH mentioned this pull request Sep 1, 2025

Processing utils bug in apply_chat_template when image content is PIL image #40603

Closed

4 tasks

abdokaseb mentioned this pull request Sep 2, 2025

Fix: PIL image load in Processing utils apply_chat_template #40622

Merged

5 tasks

qgallouedec mentioned this pull request Oct 4, 2025

🎨 Support mixing image+text and text-only examples huggingface/trl#4203

Merged

	# If it's a list of batches, it's already in the right format
	elif (
	isinstance(images, (list, tuple))
	and all(isinstance(images_i, (list, tuple)) for images_i in images)
	and any(is_valid_list_of_images(images_i) for images_i in images)
	):
	output_images = images

Conversation

hiyouga commented Mar 12, 2025 • edited by zucchini-nlp Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

github-actions bot commented Mar 12, 2025

Uh oh!

zucchini-nlp commented Mar 13, 2025

Uh oh!

hiyouga commented Mar 13, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

yonigozlan commented Mar 14, 2025

Uh oh!

zucchini-nlp commented Mar 14, 2025

Uh oh!

hiyouga commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

qubvel commented Apr 7, 2025

Uh oh!

hiyouga commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

hiyouga commented Apr 8, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Aug 26, 2025

Uh oh!

zucchini-nlp commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Aug 31, 2025

Uh oh!

github-actions bot commented Sep 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Comments

hiyouga commented Mar 12, 2025 •

edited by zucchini-nlp

Loading

hiyouga commented Mar 14, 2025 •

edited

Loading

hiyouga commented Apr 7, 2025 •

edited

Loading

zucchini-nlp commented Aug 26, 2025 •

edited

Loading