Sdpa for owlvit#42136
Conversation
I ran the RUN_SLOW=1 python -m pytest tests/models/owlvit/test_modeling_owlvit.py for the original owlvit implementation and it seemed to fail the same tests as my current implementation. I'm not sure how to infer from that. |
vasqu
left a comment
There was a problem hiding this comment.
Sorry but I've got to be strict about this. We no longer implement separate classes for all the attention flavors but one unified one. I think ViT is a good example in this case, e.g. see https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py
Before changing this to these standards I won't take a proper look for now.
Got it. Thanks a lot! |
d519ced to
77f5221
Compare
I made similar changes as in the vit and removed the seperate sdpa class. Let me know what you think! |
vasqu
left a comment
There was a problem hiding this comment.
Added some comments but in general it would be best to have a green CI before requesting a review. Atm, things are likely not working as expected
| causal_attention_mask = _create_4d_causal_attention_mask( | ||
| input_shape, hidden_states.dtype, device=hidden_states.device | ||
| # OWL-ViT uses a bidirectional (non-causal) encoder. | ||
| attention_mask = create_bidirectional_mask( | ||
| config=self.config, | ||
| input_embeds=hidden_states, | ||
| attention_mask=attention_mask, | ||
| ) | ||
| # expand attention_mask | ||
| if attention_mask is not None: | ||
| # [num_samples, seq_len] -> [num_samples, 1, tgt_seq_len, src_seq_len] | ||
| attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype) |
There was a problem hiding this comment.
This seems to suffer from the same issue as in #41750
It does not use a bidirectional mask, but a causal mask:
- The first mask is a based causal mask
- The second is a padding mask
- These are added on top creating a causal mask with padding included
There was a problem hiding this comment.
This also may need to adjust the is_causal argument dynamically as in the PR I linked - although I'm not sure if it's just causal in general
There was a problem hiding this comment.
Thanks! I made some changes to the code after referring to CLIP - removing the output_attention, return dict and casual_attention_mask. Also copied the eager attention part, attention reshaping from CLIP. Added the flash and flex attn too.
I think that the current CI is failing because the OWL VIT config file is conflicting with the current encoder implementation. Could you guide me here? Thanks a lot!
There was a problem hiding this comment.
Thanks! I made some changes to the code after referring to CLIP - removing the output_attention, return dict and casual_attention_mask. Also copied the eager attention part, attention reshaping from CLIP. Added the flash and flex attn too.
I think that the current CI is failing because the OWL VIT config file is conflicting with the current encoder implementation. Could you guide me here? Thanks a lot!
Hi, I investigated the failing OwlViTForObjectDetectionTest::test_eager_matches_sdpa_inference_09_fp32_pad_left.
The failure is due to the test invoking OwlViTForObjectDetection.forward() without providing pixel_values.
OwlViTForObjectDetection requires pixel_values (image tensors) for its vision backbone. When the test omits them, the model raises a ValueError: 'pixel_values' is None.
There was a problem hiding this comment.
Also, when I run make fix-copies, it's add output_attention and create_causal_mask parameters in owlvitencoderlayer.forward() function.
There was a problem hiding this comment.
Responded here #42136 (comment)
Resolving my previous comments since the state has changed quite a bit from last time
|
[For maintainers] Suggested jobs to run (before merge) run-slow: owlv2, owlvit |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: owlv2, owlvit |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: owlv2, owlvit |
- Add missing can_return_tuple import to owlvit and owlv2 modeling files - Remove duplicate _can_record_outputs in OwlViTPreTrainedModel and Owlv2PreTrainedModel - Remove unused OWLVITModelTesterMixin class from test file Made-with: Cursor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: owlv2, owlvit |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42136&sha=c41625 |
The merge left both old (bmm-based) and new (ALL_ATTENTION_FUNCTIONS) attention code in OwlViTAttention.forward and Owlv2Attention.forward. Remove the old dead code that references the deleted _shape method. Made-with: Cursor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: owlv2, owlvit |
- Match CLIP return types: EncoderLayer -> torch.FloatTensor, Encoder -> BaseModelOutput - Align test ConfigTester hidden_size=32 (divisible by num_heads) Made-with: Cursor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: owlv2, owlvit |
|
@vasqu pls take a look , thank you!! |
vasqu
left a comment
There was a problem hiding this comment.
Got a few small details, but overall looks good! Thanks a lot for sticking with this, I really didn't make it easy for you as well
| queries = self.q_proj(hidden_states).view(*hidden_shape).transpose(1, 2) | ||
| keys = self.k_proj(hidden_states).view(*hidden_shape).transpose(1, 2) | ||
| values = self.v_proj(hidden_states).view(*hidden_shape).transpose(1, 2) |
There was a problem hiding this comment.
| queries = self.q_proj(hidden_states).view(*hidden_shape).transpose(1, 2) | |
| keys = self.k_proj(hidden_states).view(*hidden_shape).transpose(1, 2) | |
| values = self.v_proj(hidden_states).view(*hidden_shape).transpose(1, 2) | |
| query_states = self.q_proj(hidden_states).view(*hidden_shape).transpose(1, 2) | |
| key_states = self.k_proj(hidden_states).view(*hidden_shape).transpose(1, 2) | |
| value_states = self.v_proj(hidden_states).view(*hidden_shape).transpose(1, 2) |
super nit: but that naming is just more standard across the library
There was a problem hiding this comment.
Done, renamed to query_states/key_states/value_states.
| self.config = config | ||
| embed_dim = config.hidden_size | ||
|
|
||
| self.embeddings = OwlViTVisionEmbeddings(config) | ||
| self.pre_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) | ||
| self.pre_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps) | ||
| self.encoder = OwlViTEncoder(config) | ||
| self.post_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) | ||
| self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps) |
There was a problem hiding this comment.
This change seems not necessary? Or does it come from copies?
|
|
||
| # Get image embeddings | ||
| last_hidden_state = outputs.vision_model_output[0] | ||
| last_hidden_state = outputs.vision_model_output.last_hidden_state |
There was a problem hiding this comment.
This can break, no? We have no can_return_tuple decorator and if someone may pass return_dict=False, this will fail
Would rather revert these changes here at least
| input_ids: torch.Tensor, | ||
| pixel_values: torch.FloatTensor, | ||
| input_ids: torch.Tensor | None = None, |
There was a problem hiding this comment.
This seems breaking to me, any reason we need it?
There was a problem hiding this comment.
Reverted to original signature.
There was a problem hiding this comment.
When I put input_ids back as the first required param, the main_input_name = "pixel_values" on OwlViTForObjectDetection no longer matched, causes the test failure.
There was a problem hiding this comment.
So i had to :
Remove main_input_name = "pixel_values" from the class
Changed additional_model_inputs in the test from ["input_ids", "attention_mask"] to ["pixel_values", "attention_mask"] — since input_ids is now the main input, the test needs to provide pixel_values as an additional input
There was a problem hiding this comment.
Same comments apply here so not mentioning things twice
- Rename queries/keys/values to query_states/key_states/value_states - Revert VisionTransformer embed_dim local var (unnecessary) - Revert attribute access (.last_hidden_state, .text_embeds) back to index access to avoid breaking with return_dict=False - Revert ForObjectDetection.forward param order to original Made-with: Cursor
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42136&sha=8fb57e |
Remove main_input_name="pixel_values" from OwlViTForObjectDetection since forward keeps input_ids first. Update additional_model_inputs in detection tests to provide pixel_values instead of input_ids. Made-with: Cursor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: owlv2, owlvit |
1 similar comment
|
[For maintainers] Suggested jobs to run (before merge) run-slow: owlv2, owlvit |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: owlv2, owlvit |
haha, no worries!! thank u for helping out!!! :))) |
|
run-slow: owlv2, owlvit |
|
This comment contains models: ["models/owlv2", "models/owlvit"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What does this PR do?
Implements SDPA for OWL VIT.
Fixes #28103
Before submitting
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@vasqu @younesbelkada