Refactor OwlViT to modular Transformers by Aravind-11 · Pull Request #45073 · huggingface/transformers

Aravind-11 · 2026-03-27T20:00:41Z

What does this pr do?

Add modular_owlvit.py inheriting CLIP vision/text embeddings, MLP, encoder layer, encoder
Import box IoU helpers from loss_for_object_detection; eager_attention from BERT
Regenerate modeling_owlvit.py via modular_model_converter (single-file policy)
Removes duplicated Copied-from blocks in favor of modular composition

who can review?
@vasqu

Aravind-11 · 2026-03-27T20:34:16Z

@vasqu , is this what you meant by modular refactor or did you want the refactor done in the modeling_owlvit code itself ?

- Add modular_owlvit.py inheriting CLIP vision/text embeddings, MLP, encoder layer, encoder - Import box IoU helpers from loss_for_object_detection; eager_attention from BERT - Regenerate modeling_owlvit.py via modular_model_converter (single-file policy) - Removes duplicated Copied-from blocks in favor of modular composition

Align eager softmax (float32) with SDPA and fix test_eager_matches_sdpa_inference for bf16.

github-actions · 2026-03-27T23:03:40Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: owlvit

- Make OwlViTAttention a pass-through subclass of CLIPAttention - Import contrastive_loss from CLIP instead of duplicating - Remove unused logger/logging and dead imports (Callable, ALL_ATTENTION_FUNCTIONS) - Regenerate modeling_owlvit.py via modular converter

vasqu

It goes in the right direction but we also have a refactor over here #44431. So it probably makes more sense to wait for that to land first and then adjust the code here according to that

vasqu · 2026-04-09T15:38:37Z

@Aravind-11 we just merged the clip refactor PR 🤗 so now we can adjust it to that pattern more closely

Aravind-11 · 2026-04-09T15:42:31Z

@Aravind-11 we just merged the clip refactor PR 🤗 so now we can adjust it to that pattern more closely

Will get to this next week!! 😋

Aravind-11 · 2026-04-14T19:09:40Z

Hi @vasqu , should I wait further? I noticed that pr #44431 is still awaiting updates ?

Aravind-11 · 2026-04-14T19:12:09Z

Also, I had a question. I noticed that the common attention_interface backend seems to be calling mutiple files but it takes 4-5 different function calls in total to get to the actual SDPA / flash call . Some utils seem to be preparing the inputs over and over, whys it done this way. I get that it helps with a lot of model and training customisation , but wouldn't it be easier to reduce these calls and compress some of them into fewer calls ?

vasqu · 2026-04-14T20:13:20Z

Yea it got a bit more complicated and it opened a can of worms 😬 not sure when / how we properly fix this, but shouldn't take too long as it is high prio imo

Re attn interface: Can you give me some call graphs that make you think that these are so nested? Usually it is only 1-2 levels of depth around the original attns

Flex wrapper --> torch flex call
Flash wrapper --> modeling utils --> underlying FA function (might invoke some other preps)
SDPA wrapper --> direct call to F.SDPA

Flash attention is super unique tho in its structure and not really avoidable atp. Open for improvements tho!

Aravind-11 · 2026-04-15T22:29:53Z

Yea it got a bit more complicated and it opened a can of worms 😬 not sure when / how we properly fix this, but shouldn't take too long as it is high prio imo

Re attn interface: Can you give me some call graphs that make you think that these are so nested? Usually it is only 1-2 levels of depth around the original attns

Flex wrapper --> torch flex call

Flash wrapper --> modeling utils --> underlying FA function (might invoke some other preps)

SDPA wrapper --> direct call to F.SDPA

Flash attention is super unique tho in its structure and not really avoidable atp. Open for improvements tho!

Got it!! LMK if I can be of any help 😅😅

Yes, I was talking about the FA module. It goes to integrations/flash_attention.py - prepares inputs. Then, goes to modeling_flash_attention_utils to run the flash_attention implementation.

I thought it could be just implemented at integrations itself but that would break the integrations repo consistency . I'm just surprised why torch wouldn't provide independent FA , flex modules.

vasqu · 2026-04-17T18:30:21Z

To be short: Flash attention is very unique in its way it handles inputs 😅 so it needs super special treatment for all the features and edge cases to stay consistent with our features

I would not advise to touch it tbh, it is so interwined that you can easily mess up

Aravind-11 · 2026-04-17T18:33:11Z

To be short: Flash attention is very unique in its way it handles inputs 😅 so it needs super special treatment for all the features and edge cases to stay consistent with our features

I would not advise to touch it tbh, it is so interwined that you can easily mess up

Ur right 😂.

Aravind-11 force-pushed the modular-owlvit branch from 088ab0e to 8d84019 Compare March 27, 2026 20:36

Use CLIP eager_attention_forward for OwlViT

7e8109a

Align eager softmax (float32) with SDPA and fix test_eager_matches_sdpa_inference for bf16.

Aravind-11 force-pushed the modular-owlvit branch from 39ee9d7 to e96e92c Compare March 27, 2026 23:03

Aravind-11 force-pushed the modular-owlvit branch from e96e92c to d4054ba Compare March 27, 2026 23:05

vasqu reviewed Mar 30, 2026

View reviewed changes

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor OwlViT to modular Transformers#45073

Refactor OwlViT to modular Transformers#45073
Aravind-11 wants to merge 3 commits intohuggingface:mainfrom
Aravind-11:modular-owlvit

Aravind-11 commented Mar 27, 2026 •

edited

Loading

Uh oh!

Aravind-11 commented Mar 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 27, 2026

Uh oh!

vasqu left a comment

Uh oh!

vasqu commented Apr 9, 2026

Uh oh!

Aravind-11 commented Apr 9, 2026

Uh oh!

Aravind-11 commented Apr 14, 2026

Uh oh!

Aravind-11 commented Apr 14, 2026

Uh oh!

vasqu commented Apr 14, 2026

Uh oh!

Aravind-11 commented Apr 15, 2026

Uh oh!

vasqu commented Apr 17, 2026

Uh oh!

Aravind-11 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Aravind-11 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aravind-11 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Mar 27, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu commented Apr 9, 2026

Uh oh!

Aravind-11 commented Apr 9, 2026

Uh oh!

Aravind-11 commented Apr 14, 2026

Uh oh!

Aravind-11 commented Apr 14, 2026

Uh oh!

vasqu commented Apr 14, 2026

Uh oh!

Aravind-11 commented Apr 15, 2026

Uh oh!

vasqu commented Apr 17, 2026

Uh oh!

Aravind-11 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aravind-11 commented Mar 27, 2026 •

edited

Loading

Aravind-11 commented Mar 27, 2026 •

edited

Loading