Skip to content

Add Phi3.5 Vision Model#41977

Open
yaswanth19 wants to merge 40 commits intohuggingface:mainfrom
yaswanth19:add-phi3-vision
Open

Add Phi3.5 Vision Model#41977
yaswanth19 wants to merge 40 commits intohuggingface:mainfrom
yaswanth19:add-phi3-vision

Conversation

@yaswanth19
Copy link
Copy Markdown
Contributor

@yaswanth19 yaswanth19 commented Nov 2, 2025

Closes #36036

@Rocketknight1
Copy link
Copy Markdown
Member

cc @zucchini-nlp when you get a chance!

@zucchini-nlp
Copy link
Copy Markdown
Member

Phi3.5 💀 I will take a look some time this week

@yaswanth19
Copy link
Copy Markdown
Contributor Author

@zucchini-nlp PR should be ready for review next week. Will ping you at that time.

@yaswanth19
Copy link
Copy Markdown
Contributor Author

yaswanth19 commented Nov 10, 2025

@zucchini-nlp PR is ready for initial review 🤗. CI has been broken from weekend but it was all green previously so no issues there.

Comment on lines +246 to +256
def get_image_features(self, pixel_values: torch.Tensor, image_sizes, num_images, num_crops):
# Process image using CLIP model.
vision_outputs = self.vision_model(pixel_values, output_hidden_states=True)

# Extract the hidden states from the second last layer.
hidden_state = vision_outputs.hidden_states[-2][:, 1:]
hidden_state = hidden_state.reshape(num_images, num_crops, -1, self.image_dim_out)

# Transform the image features to text embedding space.
image_features = self.transform_image_embeds(hidden_state, image_sizes)
return image_features
Copy link
Copy Markdown
Contributor Author

@yaswanth19 yaswanth19 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Majorly because of this function where the image features are transformed and projected in a bit non-standard way, also the utilization of image sizes makes it difficult to run inference with num_return_sequence>1 as then the image inputs are not synced with the repeated input_ids. Thus appropriate tests like beam search and any other tests are skipped.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where it can go wrong, generation has a huge bias towards text-like inputs and assumes the first dimension is batch size. Oke, let's skip for now, it's not a super common feature

Comment thread src/transformers/models/phi3_v/processing_phi3_v.py Outdated
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaswanth19 thanks for the PR, looks much cleaner already!

I left some comments, mostly nitty-picking for better standardization. Also I believe there's one test failing with Phi3.5V :)

Comment thread docs/source/en/model_doc/phi3_v.md Outdated
Comment thread docs/source/en/model_doc/phi3_v.md Outdated
Comment thread src/transformers/models/auto/modeling_auto.py
Comment thread src/transformers/models/phi3_v/convert_phi3_v_weights_to_hf.py Outdated
Comment thread src/transformers/models/phi3_v/image_processing_phi3_v_fast.py Outdated
"""Add the newline token embeds to the image feature patches"""
num_images, h, w, hid_dim = image_features.shape
newline_embeddings = self.sub_GN.expand(num_images, h, -1, -1)
image_features_newline = torch.cat([image_features, newline_embeddings], dim=2)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might need to adjust the devices and dtypes before merging

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed. Image feats and the newline embed can end up in different devices with mutliGPU or dtypes if user changes config's torch_dtype param

We can move the newline to the same device/dtype as image embeddings

Comment thread src/transformers/models/phi3_v/modular_phi3_v.py
Comment thread src/transformers/models/phi3_v/modular_phi3_v.py Outdated
Comment thread tests/models/phi3_v/test_image_processing_phi3_v.py Outdated
processor_dict = self.prepare_processor_dict()
self.assertTrue(processor_loaded.chat_template == processor_dict.get("chat_template", None))

@unittest.skip("Not possible as processor creates a custom attention mask.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mask format doesn't look custom, even though prepared manually instead of passing to tokenizer

Copy link
Copy Markdown
Contributor Author

@yaswanth19 yaswanth19 Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am skipping this test because it requires offset mapping which is quite difficult to fetch because of the way we tokenize the prompt.

@yaswanth19
Copy link
Copy Markdown
Contributor Author

@zucchini-nlp Ready for another review. The test failures are unrelated and can you please also trigger slow CI here.

@yaswanth19
Copy link
Copy Markdown
Contributor Author

yaswanth19 commented Nov 20, 2025

@zucchini-nlp Can you trigger the slow tests so that I can push changes for Multi-GPU tests if required and you can do the review later based on your bandwidth.

@zucchini-nlp
Copy link
Copy Markdown
Member

run-slow: phi3_v

@zucchini-nlp
Copy link
Copy Markdown
Member

great, reviewing tomorrow!

@github-actions
Copy link
Copy Markdown
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/phi3_v"]
quantizations: []

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

  • phi3_v:
    tests/models/phi3_v/test_modeling_phi3_v.py::Phi3VIntegrationTest::test_model_text_generation
    tests/models/phi3_v/test_modeling_phi3_v.py::Phi3VIntegrationTest::test_model_text_generation_batched
    tests/models/phi3_v/test_modeling_phi3_v.py::Phi3VIntegrationTest::test_model_text_generation_with_multi_image

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for iterating! The PR looks good overall with a few nits to address. After resolving these comment, I think you can request a review from core maintainers (ArthurZucker or Cyrilvallez)

We are currently focusing on the major v5 release so the core maintainer review can be a bit delayed. Prob the model will be merged after the release. Also there was a big refactor on weight loading recently, and if your tests are failing for unrelated reasons feel free to rebase

</div>

# Phi-3.5 Vision

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add a small intro for the model here before the abstract?

@@ -0,0 +1,359 @@
# coding=utf-8
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if Microsoft will be willing to host converted weights on the hub. We just recently added a dynamic weight converter as public API, I think it will be the way-to-go for Phi3V

If you want to play around with it and update conversion, this is the file that needs to be modified. IMO it's totally fine if you are out of bandwidth, we'll add conversion mapping after v5 for this model and Phi4-Multimodal


# Calculate the number of image tokens for each image based on height and width dynamically.
num_img_tokens = [
int(((h // size["height"]) * (w // size["width"]) + 1) * 144 + 1 + (h // size["height"] + 1) * 12)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's still nice to comment clearly where 144 comes from. Ig it has to do with the model's patch size when embedding image features or with pooling stride

image_processor=None,
tokenizer=None,
chat_template=None,
image_token="<|image|>",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to have two sources of truth for the same attribute, so if users want to change an image_token they can do so in the tokenizer. For ex tokenizer.image_token_id = 10000

Comment on lines +54 to +57
attributes = ["image_processor", "tokenizer"]
image_processor_class = "AutoImageProcessor"
tokenizer_class = "AutoTokenizer"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are not needed after the recent refactor, we will infer dynamically the attributes from the __init__ signature and always load from auto-mapping

"""Add the newline token embeds to the image feature patches"""
num_images, h, w, hid_dim = image_features.shape
newline_embeddings = self.sub_GN.expand(num_images, h, -1, -1)
image_features_newline = torch.cat([image_features, newline_embeddings], dim=2)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed. Image feats and the newline embed can end up in different devices with mutliGPU or dtypes if user changes config's torch_dtype param

We can move the newline to the same device/dtype as image embeddings

Comment on lines +246 to +256
def get_image_features(self, pixel_values: torch.Tensor, image_sizes, num_images, num_crops):
# Process image using CLIP model.
vision_outputs = self.vision_model(pixel_values, output_hidden_states=True)

# Extract the hidden states from the second last layer.
hidden_state = vision_outputs.hidden_states[-2][:, 1:]
hidden_state = hidden_state.reshape(num_images, num_crops, -1, self.image_dim_out)

# Transform the image features to text embedding space.
image_features = self.transform_image_embeds(hidden_state, image_sizes)
return image_features
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where it can go wrong, generation has a huge bias towards text-like inputs and assumes the first dimension is batch size. Oke, let's skip for now, it's not a super common feature

Comment on lines +176 to +193
@unittest.skip("Not possible now as processor creates a custom attention mask.")
def test_assisted_decoding_matches_greedy_search_0_random(self):
pass

@unittest.skip("Not possible now as processor creates a custom attention mask.")
def test_assisted_decoding_matches_greedy_search_1_same(self):
pass

@unittest.skip("Not possible now as processor creates a custom attention mask.")
def test_prompt_lookup_decoding_matches_greedy_search(self):
pass

@unittest.skip("Not possible now as processor creates a custom attention mask.")
def test_assisted_decoding_sample(self):
pass

@unittest.skip("Not possible now as processor creates a custom attention mask.")
def test_apply_chat_template_assistant_mask(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the skip reason needs an update or we can delete the skips if it's already supported

Comment on lines +86 to +88
@unittest.skip("Not possible as processor can't create an assistant mask.")
def test_apply_chat_template_assistant_mask(self):
pass
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, i expect it works now after changing the code for placeholder expansion

def test_apply_chat_template_assistant_mask(self):
pass

def test_unstructured_kwargs_batched(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why we needed to overwrite one test while other similar tests are passing? A comment explaining the reason would be great

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, phi3_v

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=41977&sha=796319

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants