Add Phi3.5 Vision Model#41977
Conversation
|
cc @zucchini-nlp when you get a chance! |
|
Phi3.5 💀 I will take a look some time this week |
|
@zucchini-nlp PR should be ready for review next week. Will ping you at that time. |
|
@zucchini-nlp PR is ready for initial review 🤗. CI has been broken from weekend but it was all green previously so no issues there. |
| def get_image_features(self, pixel_values: torch.Tensor, image_sizes, num_images, num_crops): | ||
| # Process image using CLIP model. | ||
| vision_outputs = self.vision_model(pixel_values, output_hidden_states=True) | ||
|
|
||
| # Extract the hidden states from the second last layer. | ||
| hidden_state = vision_outputs.hidden_states[-2][:, 1:] | ||
| hidden_state = hidden_state.reshape(num_images, num_crops, -1, self.image_dim_out) | ||
|
|
||
| # Transform the image features to text embedding space. | ||
| image_features = self.transform_image_embeds(hidden_state, image_sizes) | ||
| return image_features |
There was a problem hiding this comment.
Majorly because of this function where the image features are transformed and projected in a bit non-standard way, also the utilization of image sizes makes it difficult to run inference with num_return_sequence>1 as then the image inputs are not synced with the repeated input_ids. Thus appropriate tests like beam search and any other tests are skipped.
There was a problem hiding this comment.
I see where it can go wrong, generation has a huge bias towards text-like inputs and assumes the first dimension is batch size. Oke, let's skip for now, it's not a super common feature
zucchini-nlp
left a comment
There was a problem hiding this comment.
@yaswanth19 thanks for the PR, looks much cleaner already!
I left some comments, mostly nitty-picking for better standardization. Also I believe there's one test failing with Phi3.5V :)
| """Add the newline token embeds to the image feature patches""" | ||
| num_images, h, w, hid_dim = image_features.shape | ||
| newline_embeddings = self.sub_GN.expand(num_images, h, -1, -1) | ||
| image_features_newline = torch.cat([image_features, newline_embeddings], dim=2) |
There was a problem hiding this comment.
we might need to adjust the devices and dtypes before merging
There was a problem hiding this comment.
not addressed. Image feats and the newline embed can end up in different devices with mutliGPU or dtypes if user changes config's torch_dtype param
We can move the newline to the same device/dtype as image embeddings
| processor_dict = self.prepare_processor_dict() | ||
| self.assertTrue(processor_loaded.chat_template == processor_dict.get("chat_template", None)) | ||
|
|
||
| @unittest.skip("Not possible as processor creates a custom attention mask.") |
There was a problem hiding this comment.
mask format doesn't look custom, even though prepared manually instead of passing to tokenizer
There was a problem hiding this comment.
I am skipping this test because it requires offset mapping which is quite difficult to fetch because of the way we tokenize the prompt.
|
@zucchini-nlp Ready for another review. The test failures are unrelated and can you please also trigger slow CI here. |
|
@zucchini-nlp Can you trigger the slow tests so that I can push changes for Multi-GPU tests if required and you can do the review later based on your bandwidth. |
|
run-slow: phi3_v |
|
great, reviewing tomorrow! |
|
This comment contains models: ["models/phi3_v"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
CI ResultsModel CI Report❌ Failed tests
|
There was a problem hiding this comment.
Thanks a lot for iterating! The PR looks good overall with a few nits to address. After resolving these comment, I think you can request a review from core maintainers (ArthurZucker or Cyrilvallez)
We are currently focusing on the major v5 release so the core maintainer review can be a bit delayed. Prob the model will be merged after the release. Also there was a big refactor on weight loading recently, and if your tests are failing for unrelated reasons feel free to rebase
| </div> | ||
|
|
||
| # Phi-3.5 Vision | ||
|
|
There was a problem hiding this comment.
nit: can we add a small intro for the model here before the abstract?
| @@ -0,0 +1,359 @@ | |||
| # coding=utf-8 | |||
There was a problem hiding this comment.
not sure if Microsoft will be willing to host converted weights on the hub. We just recently added a dynamic weight converter as public API, I think it will be the way-to-go for Phi3V
If you want to play around with it and update conversion, this is the file that needs to be modified. IMO it's totally fine if you are out of bandwidth, we'll add conversion mapping after v5 for this model and Phi4-Multimodal
|
|
||
| # Calculate the number of image tokens for each image based on height and width dynamically. | ||
| num_img_tokens = [ | ||
| int(((h // size["height"]) * (w // size["width"]) + 1) * 144 + 1 + (h // size["height"] + 1) * 12) |
There was a problem hiding this comment.
i think it's still nice to comment clearly where 144 comes from. Ig it has to do with the model's patch size when embedding image features or with pooling stride
| image_processor=None, | ||
| tokenizer=None, | ||
| chat_template=None, | ||
| image_token="<|image|>", |
There was a problem hiding this comment.
we don't want to have two sources of truth for the same attribute, so if users want to change an image_token they can do so in the tokenizer. For ex tokenizer.image_token_id = 10000
| attributes = ["image_processor", "tokenizer"] | ||
| image_processor_class = "AutoImageProcessor" | ||
| tokenizer_class = "AutoTokenizer" | ||
|
|
There was a problem hiding this comment.
these are not needed after the recent refactor, we will infer dynamically the attributes from the __init__ signature and always load from auto-mapping
| """Add the newline token embeds to the image feature patches""" | ||
| num_images, h, w, hid_dim = image_features.shape | ||
| newline_embeddings = self.sub_GN.expand(num_images, h, -1, -1) | ||
| image_features_newline = torch.cat([image_features, newline_embeddings], dim=2) |
There was a problem hiding this comment.
not addressed. Image feats and the newline embed can end up in different devices with mutliGPU or dtypes if user changes config's torch_dtype param
We can move the newline to the same device/dtype as image embeddings
| def get_image_features(self, pixel_values: torch.Tensor, image_sizes, num_images, num_crops): | ||
| # Process image using CLIP model. | ||
| vision_outputs = self.vision_model(pixel_values, output_hidden_states=True) | ||
|
|
||
| # Extract the hidden states from the second last layer. | ||
| hidden_state = vision_outputs.hidden_states[-2][:, 1:] | ||
| hidden_state = hidden_state.reshape(num_images, num_crops, -1, self.image_dim_out) | ||
|
|
||
| # Transform the image features to text embedding space. | ||
| image_features = self.transform_image_embeds(hidden_state, image_sizes) | ||
| return image_features |
There was a problem hiding this comment.
I see where it can go wrong, generation has a huge bias towards text-like inputs and assumes the first dimension is batch size. Oke, let's skip for now, it's not a super common feature
| @unittest.skip("Not possible now as processor creates a custom attention mask.") | ||
| def test_assisted_decoding_matches_greedy_search_0_random(self): | ||
| pass | ||
|
|
||
| @unittest.skip("Not possible now as processor creates a custom attention mask.") | ||
| def test_assisted_decoding_matches_greedy_search_1_same(self): | ||
| pass | ||
|
|
||
| @unittest.skip("Not possible now as processor creates a custom attention mask.") | ||
| def test_prompt_lookup_decoding_matches_greedy_search(self): | ||
| pass | ||
|
|
||
| @unittest.skip("Not possible now as processor creates a custom attention mask.") | ||
| def test_assisted_decoding_sample(self): | ||
| pass | ||
|
|
||
| @unittest.skip("Not possible now as processor creates a custom attention mask.") | ||
| def test_apply_chat_template_assistant_mask(self): |
There was a problem hiding this comment.
the skip reason needs an update or we can delete the skips if it's already supported
| @unittest.skip("Not possible as processor can't create an assistant mask.") | ||
| def test_apply_chat_template_assistant_mask(self): | ||
| pass |
There was a problem hiding this comment.
same here, i expect it works now after changing the code for placeholder expansion
| def test_apply_chat_template_assistant_mask(self): | ||
| pass | ||
|
|
||
| def test_unstructured_kwargs_batched(self): |
There was a problem hiding this comment.
curious why we needed to overwrite one test while other similar tests are passing? A comment explaining the reason would be great
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, phi3_v |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=41977&sha=796319 |
Closes #36036