Fix missing video inputs for PerceptionLM. by shuminghu · Pull Request #39971 · huggingface/transformers

shuminghu · 2025-08-06T22:18:42Z

Critical: Fixes missing video input for PerceptionLM (accidentally removed in PR)

Minor: Add support for vanilla image that only has C,H,W dims but not tiles dim.
This is non-default image shapes used in PLM but it's useful in demos and low-resoure devices.
e.g., in just added "PLM Simple Fine-tuning Example" under
https://huggingface.co/facebook/Perception-LM-1B#plm-usage

molbap

LGTM for the fix, cc @zucchini-nlp who made the initial change!
For the non-standard image inputs, OK but would be better with a test that goes with it

zucchini-nlp

Oke, thanks! I think we need to standardize output shapes from the image processor to be consistent though

Maybe we can always return 5D pixels or already flattened 4D pixels? Whichever way looks good, we have models doing both options

HuggingFaceDocBuilderDev · 2025-08-07T15:26:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

shuminghu · 2025-08-07T15:35:16Z

@zucchini-nlp The reason shape unification is done in models rather than image_processing is that i noticed in training model sees a different input shape than in eval/inference.

In training with per_device_train_batch_size=8, when image processor output 4D: processed_images.shape: torch.Size([1, 3, 448, 448]), model input is 5D: pixel_values.shape: torch.Size([8, 1, 3, 448, 448]). , when image processor output is 5D, model input is 6D and will error out.
In inference, both are 5D, processed_images.shape: torch.Size([1, 37, 3, 448, 448]), pixel_values.shape: torch.Size([1, 37, 3, 448, 448])

shuminghu · 2025-08-07T15:35:44Z

@zucchini-nlp Let me split the PR and merge the more urgent fix first?

This reverts commit 181d87b.

github-actions · 2025-08-07T15:41:00Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: perception_lm

zucchini-nlp

Yeah, lets merge this one first and add it in the next patch release

shuminghu · 2025-08-07T15:58:39Z

@zucchini-nlp My bad. Just realized this from collate_fn in my training script ( I added one dimension)

    pixel_values = torch.stack(
        [inst["pixel_values"] for inst in instances], dim=0
    )

Let me open another PR for this simple fix for image_preprocessor and update corresponding training script in model card.

The reason shape unification is done in models rather than image_processing is that i noticed in training model sees a different input shape than in eval/inference.

In training with per_device_train_batch_size=8, when image processor output 4D: processed_images.shape: torch.Size([1, 3, 448, 448]), model input is 5D: pixel_values.shape: torch.Size([8, 1, 3, 448, 448]). , when image processor output is 5D, model input is 6D and will error out.
In inference, both are 5D, processed_images.shape: torch.Size([1, 37, 3, 448, 448]), pixel_values.shape: torch.Size([1, 37, 3, 448, 448])

* Fix missing video inputs for PerceptionLM. * Minor fix for vanilla input image (only C,H,W, no tiles dim). * Revert "Minor fix for vanilla input image (only C,H,W, no tiles dim)." This reverts commit 181d87b.

Fix missing video inputs for PerceptionLM.

db489b2

shuminghu marked this pull request as draft August 6, 2025 22:57

shuminghu marked this pull request as ready for review August 6, 2025 22:58

github-actions Bot requested review from molbap and yonigozlan August 6, 2025 22:59

Minor fix for vanilla input image (only C,H,W, no tiles dim).

181d87b

molbap reviewed Aug 7, 2025

View reviewed changes

zucchini-nlp reviewed Aug 7, 2025

View reviewed changes

zucchini-nlp mentioned this pull request Aug 7, 2025

video_inputs are not passed to perception_lm #40004

Closed

4 tasks

zucchini-nlp added the for patch Tag issues / labels that should be included in the next patch label Aug 7, 2025

Revert "Minor fix for vanilla input image (only C,H,W, no tiles dim)."

a4cfbde

This reverts commit 181d87b.

zucchini-nlp approved these changes Aug 7, 2025

View reviewed changes

zucchini-nlp enabled auto-merge (squash) August 7, 2025 15:42

zucchini-nlp merged commit 27997ee into huggingface:main Aug 7, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing video inputs for PerceptionLM.#39971

Fix missing video inputs for PerceptionLM.#39971
zucchini-nlp merged 3 commits intohuggingface:mainfrom
shuminghu:plm_video_fix

shuminghu commented Aug 6, 2025 •

edited

Loading

Uh oh!

molbap left a comment

Uh oh!

zucchini-nlp left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 7, 2025

Uh oh!

shuminghu commented Aug 7, 2025 •

edited

Loading

Uh oh!

shuminghu commented Aug 7, 2025

Uh oh!

github-actions Bot commented Aug 7, 2025

Uh oh!

zucchini-nlp left a comment •

edited

Loading

Uh oh!

Uh oh!

shuminghu commented Aug 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shuminghu commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 7, 2025

Uh oh!

shuminghu commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shuminghu commented Aug 7, 2025

Uh oh!

github-actions Bot commented Aug 7, 2025

Uh oh!

zucchini-nlp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shuminghu commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shuminghu commented Aug 6, 2025 •

edited

Loading

shuminghu commented Aug 7, 2025 •

edited

Loading

zucchini-nlp left a comment •

edited

Loading

shuminghu commented Aug 7, 2025 •

edited

Loading