Add Phi-3.5-vision#36036
Conversation
|
Looks good so far! The TODO list from here is:
Once everything is green, then:
Once the new tests are green as well, then:
At that point, the PR should be finished! |
4911ef9 to
7e16056
Compare
|
Fixes #36166! |
|
I've run into an issue when trying to run the model. From Modeling_phi3_v.py I'm getting "AttributeError: 'DynamicCache' object has no attribute 'get_max_length'" This might be due to a change to the DynamicCache class, but I really can't tell! |
|
@Dahlbomii hey! Yes, we deprecated |
|
Woah - this means the remote code version of Phi-3.5-vision-instruct is broken on |
|
cc @zucchini-nlp @gante I can't figure this one out either! The |
|
Seems like Phi-3.5V was not updated for the latest changes we had in generation, probably it has its own custom generation loop. For this test, we can call |
|
@zucchini-nlp yes, that was it, sorry! They don't override |
1f31710 to
0848b01
Compare
52f6b48 to
56c8b9f
Compare
|
@Rocketknight1 for reasons that escape me, when the test processor tries to call self.get_component for the tokenizer, it breaks. Any insight as to why? |
|
@Rocketknight1 ALRIGHT I got most of the tests into the green, but I have no idea what I'm doing with the chat template stuff! |
|
|
404d2a7 to
f4684f7
Compare
08109e0 to
c26b108
Compare
|
@Rocketknight1 With that I think it might be ready for review |
Rocketknight1
left a comment
There was a problem hiding this comment.
Congrats on the PR! This looks almost ready, and the modular bit looks like it was really annoying. There are a lot of classes that look like they should be inheritable from somewhere else in the codebase, but they're just different enough that you can't.
cc @zucchini-nlp I made some comments, but you're more familiar with VLMs than me, can you review and see if anything else should be changed before we ping a core maintainer?
| class Phi3RMSNorm(nn.Module): | ||
| def __init__(self, hidden_size, eps=1e-6): | ||
| """ | ||
| Phi3RMSNorm is equivalent to T5LayerNorm | ||
| """ | ||
| super().__init__() | ||
| self.weight = nn.Parameter(torch.ones(hidden_size)) | ||
| self.variance_epsilon = eps | ||
|
|
||
| def forward(self, hidden_states): | ||
| input_dtype = hidden_states.dtype | ||
| hidden_states = hidden_states.to(torch.float32) | ||
| variance = hidden_states.pow(2).mean(-1, keepdim=True) | ||
| hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) | ||
| return self.weight * hidden_states.to(input_dtype) |
There was a problem hiding this comment.
I think this can just be inherited from a class like T5LayerNorm!
| class Phi3RotaryEmbedding(nn.Module): | ||
| def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None): | ||
| super().__init__() | ||
|
|
||
| self.dim = dim | ||
| self.max_position_embeddings = max_position_embeddings | ||
| self.base = base | ||
| self.register_buffer("inv_freq", None, persistent=False) | ||
|
|
||
| @torch.no_grad() | ||
| def forward(self, x, position_ids, seq_len=None): | ||
| # x: [bs, num_attention_heads, seq_len, head_size] | ||
| if self.inv_freq is None: | ||
| self.inv_freq = 1.0 / ( | ||
| self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim) | ||
| ) | ||
| inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1) | ||
| position_ids_expanded = position_ids[:, None, :].float() | ||
| # Force float32 since bfloat16 loses precision on long contexts | ||
| # See https://github.com/huggingface/transformers/pull/29285 | ||
| device_type = x.device.type | ||
| device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu" | ||
| with torch.autocast(device_type=device_type, enabled=False): | ||
| freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2) | ||
| emb = torch.cat((freqs, freqs), dim=-1) | ||
| cos = emb.cos() | ||
| sin = emb.sin() | ||
| return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype) |
There was a problem hiding this comment.
Although the code is a little different, I think this could be inherited from another RotaryEmbedding class in the library, with the same outputs. Maybe open_llama?
There was a problem hiding this comment.
I think it is pretty much same as phi3 or phi3-MoE. We also support rope scaling and dynamic rope with unified API, adding a decorator will do the thing
| class Phi3MLP(nn.Module): | ||
| def __init__(self, config): | ||
| super().__init__() | ||
|
|
||
| self.config = config | ||
| self.gate_up_proj = nn.Linear(config.hidden_size, 2 * config.intermediate_size, bias=False) | ||
| self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False) | ||
|
|
||
| self.activation_fn = ACT2FN[config.hidden_act] | ||
|
|
||
| def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor: | ||
| up_states = self.gate_up_proj(hidden_states) | ||
|
|
||
| gate, up_states = up_states.chunk(2, dim=-1) | ||
| up_states = up_states * self.activation_fn(gate) | ||
|
|
||
| return self.down_proj(up_states) |
There was a problem hiding this comment.
I think this could be inherited from Phi3!
| return self.down_proj(up_states) | ||
|
|
||
|
|
||
| class Phi3Attention(nn.Module): |
There was a problem hiding this comment.
Although the code is a little different, I think you could maybe inherit the layer from Phi3 and not need this code here! (But you'd have to test to be sure)
There was a problem hiding this comment.
+1, seems pretty much same. The qkv can be split in conversion script if needed and we can inherit from Phi3, which also adds new attention interface (easier TGI, vLLM integrations)
| } | ||
|
|
||
|
|
||
| class Phi3DecoderLayer(nn.Module): |
There was a problem hiding this comment.
Although the code is a little different, I think this could inherit from the equivalent class in Phi3 without output changes (maybe!)
| _supports_sdpa = False | ||
| _supports_cache_class = True | ||
|
|
||
| _version = "0.0.5" |
There was a problem hiding this comment.
| _version = "0.0.5" |
Probably unnecessary!
There was a problem hiding this comment.
@Dahlbomii thanks a lot for working on this!
The PR seems to be adapted from custom code in the hub mostly, and I realize that custom code is outdated and doesn't follow transformers standards. Would be nice to do little bit more clean up before merging, by unifying text and vision backbones as AutoModel (I guess it is identical to CLIP and Phi3) and leave only the Base/ConditionalLM multimodal class in modular
I left a few comments below on how we can do that. LMK if that makes sense
| @@ -0,0 +1,59 @@ | |||
| <!--Copyright 2024 The HuggingFace Team. All rights reserved. | |||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
Let's add the abstract and finalize docs before pinging core maintainer
| ("persimmon", "PersimmonForCausalLM"), | ||
| ("phi", "PhiForCausalLM"), | ||
| ("phi3", "Phi3ForCausalLM"), | ||
| ("phi3_v", "Phi3VForCausalLM"), |
There was a problem hiding this comment.
has to be in ImageTextToText mapping when it works with "image+text". AutoCausalLM is currently reserved for text modality only
| class Phi3RotaryEmbedding(nn.Module): | ||
| def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None): | ||
| super().__init__() | ||
|
|
||
| self.dim = dim | ||
| self.max_position_embeddings = max_position_embeddings | ||
| self.base = base | ||
| self.register_buffer("inv_freq", None, persistent=False) | ||
|
|
||
| @torch.no_grad() | ||
| def forward(self, x, position_ids, seq_len=None): | ||
| # x: [bs, num_attention_heads, seq_len, head_size] | ||
| if self.inv_freq is None: | ||
| self.inv_freq = 1.0 / ( | ||
| self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim) | ||
| ) | ||
| inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1) | ||
| position_ids_expanded = position_ids[:, None, :].float() | ||
| # Force float32 since bfloat16 loses precision on long contexts | ||
| # See https://github.com/huggingface/transformers/pull/29285 | ||
| device_type = x.device.type | ||
| device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu" | ||
| with torch.autocast(device_type=device_type, enabled=False): | ||
| freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2) | ||
| emb = torch.cat((freqs, freqs), dim=-1) | ||
| cos = emb.cos() | ||
| sin = emb.sin() | ||
| return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype) |
There was a problem hiding this comment.
I think it is pretty much same as phi3 or phi3-MoE. We also support rope scaling and dynamic rope with unified API, adding a decorator will do the thing
| def _init_rope(self): | ||
| if self.rope_scaling is None: | ||
| self.rotary_emb = Phi3RotaryEmbedding( | ||
| self.head_dim, | ||
| max_position_embeddings=self.max_position_embeddings, | ||
| base=self.rope_theta, | ||
| ) | ||
| else: | ||
| scaling_type = self.config.rope_scaling["type"] | ||
| if scaling_type == "su": | ||
| self.rotary_emb = Phi3SuScaledRotaryEmbedding(self.head_dim, self.config) | ||
| elif scaling_type == "yarn": | ||
| self.rotary_emb = Phi3YarnScaledRotaryEmbedding(self.head_dim, self.config) | ||
| else: | ||
| raise ValueError(f"Unknown RoPE scaling type {scaling_type}") |
There was a problem hiding this comment.
we init RoPE once per model in base class, this is an old way and has to be removed
| images: ImageInput = None, | ||
| padding: Union[bool, str, PaddingStrategy] = False, | ||
| truncation: Union[bool, str, TruncationStrategy] = None, | ||
| max_length=None, | ||
| return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH, | ||
| add_special_tokens: bool = True, | ||
| ) -> BatchFeature: |
There was a problem hiding this comment.
let's add kwargs with standard processing API. For ex in llava
| def calc_num_image_tokens(self, images: ImageInput): | ||
| """Calculate the number of image tokens for each image. | ||
| Args: | ||
| images (`ImageInput`): | ||
| Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If | ||
| passing in images with pixel values between 0 and 1, set `do_rescale=False`. | ||
| """ | ||
| return self.image_processor.calc_num_image_tokens(images) | ||
|
|
||
| def calc_num_image_tokens_from_image_size(self, width, height): | ||
| """Calculate the number of image token for an image with given width and height. | ||
| Args: | ||
| width (`int`): | ||
| Width of the image. | ||
| height (`int`): | ||
| Height of the image. | ||
| """ | ||
| return self.image_processor.calc_num_image_tokens_from_image_size(width, height) | ||
|
|
There was a problem hiding this comment.
can we move these all from image processor to processor?
| input_ids = torch.tensor(input_ids, dtype=torch.long).unsqueeze(0) | ||
| attention_mask = (input_ids > -1000000).to(torch.long) |
There was a problem hiding this comment.
inputs are converted to tensor only if asked by users, so we have to put it in Batchfeature and let it handle all type conversion
|
|
||
| pattern = r"<\|image_\d+\|>" | ||
| prompt_chunks = [self.tokenizer(chunk).input_ids for chunk in re.split(pattern, texts)] | ||
|
|
||
| if "num_img_tokens" in images: | ||
| num_img_tokens = images["num_img_tokens"] | ||
| else: | ||
| assert "num_crops" in images, "num_crops must be provided in images if num_img_tokens is not provided" | ||
| num_crops = images["num_crops"] | ||
| num_img_tokens = [_num_crops * self.num_img_tokens for _num_crops in num_crops] | ||
|
|
||
| images, image_sizes = images["pixel_values"], images["image_sizes"] | ||
|
|
||
| # image_tags needs to start from 1 to n | ||
| image_tags = re.findall(pattern, texts) | ||
| # image_ids = [int(s.split("|")[1].split("_")[-1]) * -1 for s in image_tags] | ||
| # image_ids_pad = [[iid]*num_img_tokens[i] for i, iid in enumerate(image_ids)] | ||
| image_ids = [int(s.split("|")[1].split("_")[-1]) for s in image_tags] | ||
| unique_image_ids = sorted(set(image_ids)) | ||
| # image_ids must start from 1, and must be continuous int, e.g. [1, 2, 3], cannot be [1, 4, 5] | ||
| # check the condition | ||
| assert unique_image_ids == list(range(1, len(unique_image_ids) + 1)), ( | ||
| f"image_ids must start from 1, and must be continuous int, e.g. [1, 2, 3], cannot be {unique_image_ids}" | ||
| ) | ||
| # total images must be the same as the number of image tags | ||
| assert len(unique_image_ids) == len(images), ( | ||
| f"total images must be the same as the number of image tags, got {len(unique_image_ids)} image tags and {len(images)} images" |
There was a problem hiding this comment.
this looks too complicated! From what I see adding special number on image tokens is not needed, since what happens here is
text = "<|image|> What is this?"
text_after_separator = " <placeholder> <placeholder> [<placeholder> .... ] What is this?""
This is same as in many other VLMs, and the simplest case is LLaVA. Let's clean up a bit and make nececssary changes to processor config when needed
| token_labels, | ||
| choice_labels, | ||
| ) = config_and_inputs | ||
| inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask} |
There was a problem hiding this comment.
pixel_values have to be in inputs, so we test same way as model will be used by users
|
cc @Dahlbomii removing my review until @zucchini-nlp's comments are adressed! |
|
Any news on this? Phi-3.5-vision has been out for a while, just wanted to know an ETA if there is one |
|
I will try to finish this PR next week |
|
I'm sorry, I left this on the back burner for far longer than I thought! @zucchini-nlp I don't know if you've already started working on this, but I could pick it back up now if it's still up in the air! Whatever works best, really. |
c26b108 to
c78a141
Compare
| ("paligemma", ("SiglipImageProcessor", "SiglipImageProcessorFast")), | ||
| ("perceiver", ("PerceiverImageProcessor", "PerceiverImageProcessorFast")), | ||
| ("perception_lm", (None, "PerceptionLMImageProcessorFast")), | ||
| ("phi3_v", ("Phi3VImageProcessor",)), |
There was a problem hiding this comment.
| ("phi3_v", ("Phi3VImageProcessor",)), | |
| ("phi3_v", ("Phi3VImageProcessor", None)), |
I think a lot of the test failures are caused by this line here! We changed the format of this list, so every model needs an ImageProcessor and an ImageProcessorFast, which can be None.
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, phi3_v |
|
@zucchini-nlp I could use a little help; I've been trying to get the model running again for testing, but another PR removed get_usable_length from cache_utils and I'm unsure how to adjust the model for this change. If it helps I tracked down the removal, #39106 |
|
Hey @Dahlbomii I can see that a lot of modules need standardizing and refactoring to transformers API. I can help with the PR to push changes quickly with my previous experience so that the model can be merged faster. LMK if you’re open to that and if so please add me as a collaborator to this PR. |
|
@Dahlbomii any progress on the PR? We would love to get it merged soon as it will help us to support the new Phi-Grounding model. If not, I'm happy to take over or ask @yaswanth19 for help if they have bandwidth |
|
@zucchini-nlp Since there isn't much activity here for some time. I will start working on this and raise a new PR 🫡 |
|
I've been working on getting things back in line with codebase changes, but I still hadn't finished enough to push any commits! I'm sure @yaswanth19 could handle things faster though, I can make you a collaborator still if that helps! I can also share some of what I've been looking through: phi3_v's SuScaledRotaryEmbedding class is perfectly in line with the rope update's 'longrope' type, so a change to the config and using phi3's RotaryEmbedding class works out fine there. However I couldn't figure out how to preserve phi3_v's YarnScaledRotaryEmbedding, as it still uses the long and short rope factors. Maybe your experience can help with that class, I've been at a loss. |
|
Thanks @Dahlbomii for the confirmation, AFAIK the the text backbone is phi3 language model itself so you don't have to worry about rope - simply utilize phi3 model or utilize modular.
Thanks, although I will open a new PR as this one is a bit clumsy and is in middle ground b/w custom code and ideal code. |
|
@Rocketknight1, should close this as duplicated of #41977 |
|
Closing! |
Draft PR for now, still need to add tests and convert to modular
cc @Rocketknight1
Fixes #36071