Sdpa for owlvit by Aravind-11 · Pull Request #42136 · huggingface/transformers

Aravind-11 · 2025-11-10T23:38:23Z

What does this PR do?

Implements SDPA for OWL VIT.

Fixes #28103

Before submitting

Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[] Did you write any new necessary tests?

Who can review?

@vasqu @younesbelkada

Aravind-11 · 2025-11-11T00:35:19Z

What does this PR do?

Implements SDPA for OWL VIT. Revamp of #28818

Fixes #28103

Before submitting

Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.

Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.

[] Did you write any new necessary tests?

Who can review?

@vasqu @younesbelkada

I ran the RUN_SLOW=1 python -m pytest tests/models/owlvit/test_modeling_owlvit.py for the original owlvit implementation and it seemed to fail the same tests as my current implementation. I'm not sure how to infer from that.

vasqu

Sorry but I've got to be strict about this. We no longer implement separate classes for all the attention flavors but one unified one. I think ViT is a good example in this case, e.g. see https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py

Before changing this to these standards I won't take a proper look for now.

Aravind-11 · 2025-11-11T18:51:20Z

Sorry but I've got to be strict about this. We no longer implement separate classes for all the attention flavors but one unified one. I think ViT is a good example in this case, e.g. see https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py

Before changing this to these standards I won't take a proper look for now.

Got it. Thanks a lot!

Aravind-11 · 2025-11-12T06:42:48Z

Sorry but I've got to be strict about this. We no longer implement separate classes for all the attention flavors but one unified one. I think ViT is a good example in this case, e.g. see https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py

Before changing this to these standards I won't take a proper look for now.

I made similar changes as in the vit and removed the seperate sdpa class. Let me know what you think!

vasqu

Added some comments but in general it would be best to have a green CI before requesting a review. Atm, things are likely not working as expected

vasqu · 2025-11-12T12:43:42Z

-        causal_attention_mask = _create_4d_causal_attention_mask(
-            input_shape, hidden_states.dtype, device=hidden_states.device
+        # OWL-ViT uses a bidirectional (non-causal) encoder.
+        attention_mask = create_bidirectional_mask(
+            config=self.config,
+            input_embeds=hidden_states,
+            attention_mask=attention_mask,
        )
-        # expand attention_mask
-        if attention_mask is not None:
-            # [num_samples, seq_len] -> [num_samples, 1, tgt_seq_len, src_seq_len]
-            attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype)


This seems to suffer from the same issue as in #41750

It does not use a bidirectional mask, but a causal mask:

The first mask is a based causal mask

The second is a padding mask

These are added on top creating a causal mask with padding included

This also may need to adjust the is_causal argument dynamically as in the PR I linked - although I'm not sure if it's just causal in general

Thanks! I made some changes to the code after referring to CLIP - removing the output_attention, return dict and casual_attention_mask. Also copied the eager attention part, attention reshaping from CLIP. Added the flash and flex attn too.

I think that the current CI is failing because the OWL VIT config file is conflicting with the current encoder implementation. Could you guide me here? Thanks a lot!

Thanks! I made some changes to the code after referring to CLIP - removing the output_attention, return dict and casual_attention_mask. Also copied the eager attention part, attention reshaping from CLIP. Added the flash and flex attn too.

I think that the current CI is failing because the OWL VIT config file is conflicting with the current encoder implementation. Could you guide me here? Thanks a lot!

Hi, I investigated the failing OwlViTForObjectDetectionTest::test_eager_matches_sdpa_inference_09_fp32_pad_left.

The failure is due to the test invoking OwlViTForObjectDetection.forward() without providing pixel_values.

OwlViTForObjectDetection requires pixel_values (image tensors) for its vision backbone. When the test omits them, the model raises a ValueError: 'pixel_values' is None.

Also, when I run make fix-copies, it's add output_attention and create_causal_mask parameters in owlvitencoderlayer.forward() function.

Responded here #42136 (comment)

Resolving my previous comments since the state has changed quite a bit from last time

github-actions · 2026-03-13T19:05:54Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: owlv2, owlvit

github-actions · 2026-03-13T19:16:02Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: owlv2, owlvit

github-actions · 2026-03-16T19:41:26Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: owlv2, owlvit

- Add missing can_return_tuple import to owlvit and owlv2 modeling files - Remove duplicate _can_record_outputs in OwlViTPreTrainedModel and Owlv2PreTrainedModel - Remove unused OWLVITModelTesterMixin class from test file Made-with: Cursor

github-actions · 2026-03-17T16:53:38Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: owlv2, owlvit

github-actions · 2026-03-17T16:55:52Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42136&sha=c41625

The merge left both old (bmm-based) and new (ALL_ATTENTION_FUNCTIONS) attention code in OwlViTAttention.forward and Owlv2Attention.forward. Remove the old dead code that references the deleted _shape method. Made-with: Cursor

github-actions · 2026-03-17T17:04:45Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: owlv2, owlvit

- Match CLIP return types: EncoderLayer -> torch.FloatTensor, Encoder -> BaseModelOutput - Align test ConfigTester hidden_size=32 (divisible by num_heads) Made-with: Cursor

github-actions · 2026-03-17T17:21:33Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: owlv2, owlvit

Aravind-11 · 2026-03-17T17:31:15Z

@vasqu pls take a look , thank you!!

vasqu

Got a few small details, but overall looks good! Thanks a lot for sticking with this, I really didn't make it easy for you as well

vasqu · 2026-03-17T18:07:39Z

+        queries = self.q_proj(hidden_states).view(*hidden_shape).transpose(1, 2)
+        keys = self.k_proj(hidden_states).view(*hidden_shape).transpose(1, 2)
+        values = self.v_proj(hidden_states).view(*hidden_shape).transpose(1, 2)


Suggested change

queries = self.q_proj(hidden_states).view(*hidden_shape).transpose(1, 2)

keys = self.k_proj(hidden_states).view(*hidden_shape).transpose(1, 2)

values = self.v_proj(hidden_states).view(*hidden_shape).transpose(1, 2)

query_states = self.q_proj(hidden_states).view(*hidden_shape).transpose(1, 2)

key_states = self.k_proj(hidden_states).view(*hidden_shape).transpose(1, 2)

value_states = self.v_proj(hidden_states).view(*hidden_shape).transpose(1, 2)

super nit: but that naming is just more standard across the library

Done, renamed to query_states/key_states/value_states.

vasqu · 2026-03-17T18:10:07Z

+        self.config = config
+        embed_dim = config.hidden_size

        self.embeddings = OwlViTVisionEmbeddings(config)
-        self.pre_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.pre_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
        self.encoder = OwlViTEncoder(config)
-        self.post_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)


This change seems not necessary? Or does it come from copies?

vasqu · 2026-03-17T18:12:37Z


        # Get image embeddings
-        last_hidden_state = outputs.vision_model_output[0]
+        last_hidden_state = outputs.vision_model_output.last_hidden_state


This can break, no? We have no can_return_tuple decorator and if someone may pass return_dict=False, this will fail

Would rather revert these changes here at least

vasqu · 2026-03-17T18:14:23Z

-        input_ids: torch.Tensor,
        pixel_values: torch.FloatTensor,
+        input_ids: torch.Tensor | None = None,


This seems breaking to me, any reason we need it?

Reverted to original signature.

When I put input_ids back as the first required param, the main_input_name = "pixel_values" on OwlViTForObjectDetection no longer matched, causes the test failure.

So i had to :

Remove main_input_name = "pixel_values" from the class

Changed additional_model_inputs in the test from ["input_ids", "attention_mask"] to ["pixel_values", "attention_mask"] — since input_ids is now the main input, the test needs to provide pixel_values as an additional input

vasqu · 2026-03-17T18:16:50Z

Same comments apply here so not mentioning things twice

- Rename queries/keys/values to query_states/key_states/value_states - Revert VisionTransformer embed_dim local var (unnecessary) - Revert attribute access (.last_hidden_state, .text_embeds) back to index access to avoid breaking with return_dict=False - Revert ForObjectDetection.forward param order to original Made-with: Cursor

github-actions · 2026-03-17T18:30:21Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42136&sha=8fb57e

Remove main_input_name="pixel_values" from OwlViTForObjectDetection since forward keeps input_ids first. Update additional_model_inputs in detection tests to provide pixel_values instead of input_ids. Made-with: Cursor

github-actions · 2026-03-17T18:33:45Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: owlv2, owlvit

github-actions · 2026-03-17T18:37:00Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: owlv2, owlvit

github-actions · 2026-03-17T18:46:39Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: owlv2, owlvit

Aravind-11 · 2026-03-17T18:55:09Z

Got a few small details, but overall looks good! Thanks a lot for sticking with this, I really didn't make it easy for you as well

haha, no worries!! thank u for helping out!!! :)))

vasqu · 2026-03-17T19:21:20Z

run-slow: owlv2, owlvit

github-actions · 2026-03-17T19:22:32Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/owlv2", "models/owlvit"]
quantizations: []

github-actions · 2026-03-17T19:38:56Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	31a2e413	workflow commit (merge commit)
PR	1df037b3	branch commit (from PR)
main	acc89e74	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

HuggingFaceDocBuilderDev · 2026-03-17T19:53:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu reviewed Nov 11, 2025

View reviewed changes

Aravind-11 force-pushed the sdpa_for_OWL_ViT branch from d519ced to 77f5221 Compare November 12, 2025 05:15

vasqu reviewed Nov 12, 2025

View reviewed changes

nileshkokane01 and others added 24 commits November 15, 2025 18:22

Added sdpa attention

f9498f3

Added Changes to OWL-Vit as suggested

ddd9eeb

Fixed nits

c20d934

removed unwanted files

a75f357

Fixed nits

2c07d0f

Fixed past_key_values_length to length 0

a1d25d1

Fixed nits

62f47ab

Fixed dim issue

9c6dfbf

fixed nits

2ccdc99

Fixed dim issue

ec085b3

add sdpa for owlvit

a1c22e4

fixes

8cff000

fixe

b12a5fc

fixes

9ca23ae

fixes

9923d24

fixes

7a7977d

remove return_dict from config file

a08d71d

remove return_dict from ovlv2 config

de20883

commit same changes to owlv2

5d5cf7d

make fix-copies

b281a10

fixes

9b9020a

fixes

4dbbca2

fixes

90d32e9

fixes

57e88a3

Merge branch 'main' into sdpa_for_OWL_ViT

995b9c3

Merge branch 'main' into sdpa_for_OWL_ViT

7a3b5a8

Remove leftover old attention code from merge

82fa18a

The merge left both old (bmm-based) and new (ALL_ATTENTION_FUNCTIONS) attention code in OwlViTAttention.forward and Owlv2Attention.forward. Remove the old dead code that references the deleted _shape method. Made-with: Cursor

Fix check_copies inconsistencies

8335853

- Match CLIP return types: EncoderLayer -> torch.FloatTensor, Encoder -> BaseModelOutput - Align test ConfigTester hidden_size=32 (divisible by num_heads) Made-with: Cursor

vasqu reviewed Mar 17, 2026

View reviewed changes

Fix main_input_name mismatch after forward sig revert

3d4379e

Remove main_input_name="pixel_values" from OwlViTForObjectDetection since forward keeps input_ids first. Update additional_model_inputs in detection tests to provide pixel_values instead of input_ids. Made-with: Cursor

sync owlv2 test copies

1df037b

vasqu enabled auto-merge March 17, 2026 19:42

vasqu added this pull request to the merge queue Mar 17, 2026

Merged via the queue into huggingface:main with commit f1f34de Mar 17, 2026
25 checks passed

Conversation

Aravind-11 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Aravind-11 commented Nov 11, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Aravind-11 commented Nov 11, 2025

Uh oh!

Aravind-11 commented Nov 12, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 13, 2026

Uh oh!

github-actions Bot commented Mar 13, 2026

Uh oh!

github-actions Bot commented Mar 16, 2026

Uh oh!

github-actions Bot commented Mar 17, 2026

Uh oh!

github-actions Bot commented Mar 17, 2026

Uh oh!

github-actions Bot commented Mar 17, 2026

Uh oh!

github-actions Bot commented Mar 17, 2026

Uh oh!

Aravind-11 commented Mar 17, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aravind-11 commented Nov 10, 2025 •

edited

Loading