Add Phi3.5 Vision Model by yaswanth19 · Pull Request #41977 · huggingface/transformers

yaswanth19 · 2025-11-02T15:34:47Z

Closes #36036

Rocketknight1 · 2025-11-05T12:18:48Z

cc @zucchini-nlp when you get a chance!

zucchini-nlp · 2025-11-05T12:57:33Z

Phi3.5 💀 I will take a look some time this week

yaswanth19 · 2025-11-05T14:30:13Z

@zucchini-nlp PR should be ready for review next week. Will ping you at that time.

yaswanth19 · 2025-11-10T10:44:27Z

@zucchini-nlp PR is ready for initial review 🤗. CI has been broken from weekend but it was all green previously so no issues there.

yaswanth19 · 2025-11-10T12:37:00Z

+    def get_image_features(self, pixel_values: torch.Tensor, image_sizes, num_images, num_crops):
+        # Process image using CLIP model.
+        vision_outputs = self.vision_model(pixel_values, output_hidden_states=True)
+
+        # Extract the hidden states from the second last layer.
+        hidden_state = vision_outputs.hidden_states[-2][:, 1:]
+        hidden_state = hidden_state.reshape(num_images, num_crops, -1, self.image_dim_out)
+
+        # Transform the image features to text embedding space.
+        image_features = self.transform_image_embeds(hidden_state, image_sizes)
+        return image_features


Majorly because of this function where the image features are transformed and projected in a bit non-standard way, also the utilization of image sizes makes it difficult to run inference with num_return_sequence>1 as then the image inputs are not synced with the repeated input_ids. Thus appropriate tests like beam search and any other tests are skipped.

I see where it can go wrong, generation has a huge bias towards text-like inputs and assumes the first dimension is batch size. Oke, let's skip for now, it's not a super common feature

zucchini-nlp

@yaswanth19 thanks for the PR, looks much cleaner already!

I left some comments, mostly nitty-picking for better standardization. Also I believe there's one test failing with Phi3.5V :)

zucchini-nlp · 2025-11-10T14:18:16Z

+        """Add the newline token embeds to the image feature patches"""
+        num_images, h, w, hid_dim = image_features.shape
+        newline_embeddings = self.sub_GN.expand(num_images, h, -1, -1)
+        image_features_newline = torch.cat([image_features, newline_embeddings], dim=2)


we might need to adjust the devices and dtypes before merging

not addressed. Image feats and the newline embed can end up in different devices with mutliGPU or dtypes if user changes config's torch_dtype param

We can move the newline to the same device/dtype as image embeddings

zucchini-nlp · 2025-11-10T14:25:59Z

+        processor_dict = self.prepare_processor_dict()
+        self.assertTrue(processor_loaded.chat_template == processor_dict.get("chat_template", None))
+
+    @unittest.skip("Not possible as processor creates a custom attention mask.")


mask format doesn't look custom, even though prepared manually instead of passing to tokenizer

I am skipping this test because it requires offset mapping which is quite difficult to fetch because of the way we tokenize the prompt.

yaswanth19 · 2025-11-17T11:47:31Z

@zucchini-nlp Ready for another review. The test failures are unrelated and can you please also trigger slow CI here.

yaswanth19 · 2025-11-20T04:46:17Z

@zucchini-nlp Can you trigger the slow tests so that I can push changes for Multi-GPU tests if required and you can do the review later based on your bandwidth.

zucchini-nlp · 2025-11-20T09:20:11Z

run-slow: phi3_v

zucchini-nlp · 2025-11-20T09:20:24Z

great, reviewing tomorrow!

github-actions · 2025-11-20T09:21:21Z

This comment contains run-slow, running the specified jobs:

models: ["models/phi3_v"]
quantizations: []

HuggingFaceDocBuilderDev · 2025-11-20T09:29:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2025-11-20T09:37:21Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

phi3_v:
tests/models/phi3_v/test_modeling_phi3_v.py::Phi3VIntegrationTest::test_model_text_generation
tests/models/phi3_v/test_modeling_phi3_v.py::Phi3VIntegrationTest::test_model_text_generation_batched
tests/models/phi3_v/test_modeling_phi3_v.py::Phi3VIntegrationTest::test_model_text_generation_with_multi_image

zucchini-nlp

Thanks a lot for iterating! The PR looks good overall with a few nits to address. After resolving these comment, I think you can request a review from core maintainers (ArthurZucker or Cyrilvallez)

We are currently focusing on the major v5 release so the core maintainer review can be a bit delayed. Prob the model will be merged after the release. Also there was a big refactor on weight loading recently, and if your tests are failing for unrelated reasons feel free to rebase

zucchini-nlp · 2025-11-24T10:10:57Z

+</div>
+
+# Phi-3.5 Vision
+


nit: can we add a small intro for the model here before the abstract?

zucchini-nlp · 2025-11-24T10:15:14Z

@@ -0,0 +1,359 @@
+# coding=utf-8


not sure if Microsoft will be willing to host converted weights on the hub. We just recently added a dynamic weight converter as public API, I think it will be the way-to-go for Phi3V

If you want to play around with it and update conversion, this is the file that needs to be modified. IMO it's totally fine if you are out of bandwidth, we'll add conversion mapping after v5 for this model and Phi4-Multimodal

zucchini-nlp · 2025-11-24T10:19:30Z

+
+        # Calculate the number of image tokens for each image based on height and width dynamically.
+        num_img_tokens = [
+            int(((h // size["height"]) * (w // size["width"]) + 1) * 144 + 1 + (h // size["height"] + 1) * 12)


i think it's still nice to comment clearly where 144 comes from. Ig it has to do with the model's patch size when embedding image features or with pooling stride

zucchini-nlp · 2025-11-24T10:21:15Z

+        image_processor=None,
+        tokenizer=None,
+        chat_template=None,
+        image_token="<|image|>",


we don't want to have two sources of truth for the same attribute, so if users want to change an image_token they can do so in the tokenizer. For ex tokenizer.image_token_id = 10000

zucchini-nlp · 2025-11-24T10:22:02Z

+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "AutoImageProcessor"
+    tokenizer_class = "AutoTokenizer"
+


these are not needed after the recent refactor, we will infer dynamically the attributes from the __init__ signature and always load from auto-mapping

zucchini-nlp · 2025-11-24T10:27:02Z

+        """Add the newline token embeds to the image feature patches"""
+        num_images, h, w, hid_dim = image_features.shape
+        newline_embeddings = self.sub_GN.expand(num_images, h, -1, -1)
+        image_features_newline = torch.cat([image_features, newline_embeddings], dim=2)


not addressed. Image feats and the newline embed can end up in different devices with mutliGPU or dtypes if user changes config's torch_dtype param

We can move the newline to the same device/dtype as image embeddings

zucchini-nlp · 2025-11-24T10:30:27Z

+    def get_image_features(self, pixel_values: torch.Tensor, image_sizes, num_images, num_crops):
+        # Process image using CLIP model.
+        vision_outputs = self.vision_model(pixel_values, output_hidden_states=True)
+
+        # Extract the hidden states from the second last layer.
+        hidden_state = vision_outputs.hidden_states[-2][:, 1:]
+        hidden_state = hidden_state.reshape(num_images, num_crops, -1, self.image_dim_out)
+
+        # Transform the image features to text embedding space.
+        image_features = self.transform_image_embeds(hidden_state, image_sizes)
+        return image_features


I see where it can go wrong, generation has a huge bias towards text-like inputs and assumes the first dimension is batch size. Oke, let's skip for now, it's not a super common feature

zucchini-nlp · 2025-11-24T10:33:56Z

+    @unittest.skip("Not possible now as processor creates a custom attention mask.")
+    def test_assisted_decoding_matches_greedy_search_0_random(self):
+        pass
+
+    @unittest.skip("Not possible now as processor creates a custom attention mask.")
+    def test_assisted_decoding_matches_greedy_search_1_same(self):
+        pass
+
+    @unittest.skip("Not possible now as processor creates a custom attention mask.")
+    def test_prompt_lookup_decoding_matches_greedy_search(self):
+        pass
+
+    @unittest.skip("Not possible now as processor creates a custom attention mask.")
+    def test_assisted_decoding_sample(self):
+        pass
+
+    @unittest.skip("Not possible now as processor creates a custom attention mask.")
+    def test_apply_chat_template_assistant_mask(self):


the skip reason needs an update or we can delete the skips if it's already supported

zucchini-nlp · 2025-11-24T10:35:14Z

+    @unittest.skip("Not possible as processor can't create an assistant mask.")
+    def test_apply_chat_template_assistant_mask(self):
+        pass


same here, i expect it works now after changing the code for placeholder expansion

zucchini-nlp · 2025-11-24T10:36:11Z

+    def test_apply_chat_template_assistant_mask(self):
+        pass
+
+    def test_unstructured_kwargs_batched(self):


curious why we needed to overwrite one test while other similar tests are passing? A comment explaining the reason would be great

github-actions · 2025-12-18T16:25:54Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, phi3_v

github-actions · 2025-12-18T16:34:02Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=41977&sha=796319

yaswanth19 added 3 commits November 2, 2025 21:03

Add model

f280e53

Add model in auto

c6d9b89

processor tests passing

865bf5e

yaswanth19 and others added 23 commits November 6, 2025 22:45

Model is almost done

e95d6d0

Tests and cleanup

fa0f1ba

commit for now

26c6d0b

Merge branch 'main' into add-phi3-vision

89510c4

small fixes

711ac88

Add model tests

e5bdab4

Rename key

24918f2

make style

e27a37f

More tests

d9a5892

Merge branch 'main' into add-phi3-vision

a63903c

processor tests passing

55865ac

make fixup

3f2c293

Add docs

144e200

Merge branch 'main' into add-phi3-vision

51a3f5f

fix docstring

03f1885

up

ee5ab9f

Finally fix check_docstring

2b3e5dc

Add integrations tests

594901b

Update expectation in tests

fd6a5d3

up

79ba322

nit

1192d96

Merge branch 'main' into add-phi3-vision

13fe7ee

Remove prints

b731693

Fix failing text

23131a4

yaswanth19 commented Nov 10, 2025

View reviewed changes

Comment thread src/transformers/models/phi3_v/processing_phi3_v.py Outdated

Merge branch 'main' into add-phi3-vision

4485c3f

zucchini-nlp reviewed Nov 11, 2025

View reviewed changes

yaswanth19 and others added 9 commits November 11, 2025 20:00

doc update

9a5dbb8

up

f9ab6f0

Changes per review

b2ed0d2

update and run tests

197ac13

uncomment slow

f48018e

Merge branch 'main' into add-phi3-vision

edbea31

Fix failing tests

4374c99

image_sizes is set directly

6ebdce5

Merge branch 'main' into add-phi3-vision

6a8a772

Merge branch 'main' into add-phi3-vision

0827a84

Tokenize in a single call

c7bb9c7

zucchini-nlp reviewed Nov 24, 2025

View reviewed changes

Merge branch 'main' into add-phi3-vision

796319c

1himan mentioned this pull request Jan 8, 2026

Add Phi-3.5-vision #36036

Closed

evalstate mentioned this pull request Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Conversation

yaswanth19 commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Nov 5, 2025

Uh oh!

zucchini-nlp commented Nov 5, 2025

Uh oh!

yaswanth19 commented Nov 5, 2025

Uh oh!

yaswanth19 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaswanth19 Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaswanth19 Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaswanth19 commented Nov 17, 2025

Uh oh!

yaswanth19 commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Nov 20, 2025

Uh oh!

zucchini-nlp commented Nov 20, 2025

Uh oh!

github-actions Bot commented Nov 20, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 20, 2025

Uh oh!

github-actions Bot commented Nov 20, 2025

CI Results

Model CI Report

❌ Failed tests

Uh oh!

zucchini-nlp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

yaswanth19 commented Nov 2, 2025 •

edited

Loading

yaswanth19 commented Nov 10, 2025 •

edited

Loading

yaswanth19 Nov 10, 2025 •

edited

Loading

yaswanth19 Nov 15, 2025 •

edited

Loading

yaswanth19 commented Nov 20, 2025 •

edited

Loading

zucchini-nlp left a comment •

edited

Loading