Skip to content

Refactor OwlViT to modular Transformers#45073

Open
Aravind-11 wants to merge 3 commits intohuggingface:mainfrom
Aravind-11:modular-owlvit
Open

Refactor OwlViT to modular Transformers#45073
Aravind-11 wants to merge 3 commits intohuggingface:mainfrom
Aravind-11:modular-owlvit

Conversation

@Aravind-11
Copy link
Copy Markdown
Contributor

@Aravind-11 Aravind-11 commented Mar 27, 2026

What does this pr do?

  • Add modular_owlvit.py inheriting CLIP vision/text embeddings, MLP, encoder layer, encoder
  • Import box IoU helpers from loss_for_object_detection; eager_attention from BERT
  • Regenerate modeling_owlvit.py via modular_model_converter (single-file policy)
  • Removes duplicated Copied-from blocks in favor of modular composition

who can review?
@vasqu

@Aravind-11
Copy link
Copy Markdown
Contributor Author

Aravind-11 commented Mar 27, 2026

@vasqu , is this what you meant by modular refactor or did you want the refactor done in the modeling_owlvit code itself ?

- Add modular_owlvit.py inheriting CLIP vision/text embeddings, MLP, encoder layer, encoder
- Import box IoU helpers from loss_for_object_detection; eager_attention from BERT
- Regenerate modeling_owlvit.py via modular_model_converter (single-file policy)
- Removes duplicated Copied-from blocks in favor of modular composition
Align eager softmax (float32) with SDPA and fix test_eager_matches_sdpa_inference for bf16.
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: owlvit

- Make OwlViTAttention a pass-through subclass of CLIPAttention
- Import contrastive_loss from CLIP instead of duplicating
- Remove unused logger/logging and dead imports (Callable, ALL_ATTENTION_FUNCTIONS)
- Regenerate modeling_owlvit.py via modular converter
Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It goes in the right direction but we also have a refactor over here #44431. So it probably makes more sense to wait for that to land first and then adjust the code here according to that

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 9, 2026

@Aravind-11 we just merged the clip refactor PR 🤗 so now we can adjust it to that pattern more closely

@Aravind-11
Copy link
Copy Markdown
Contributor Author

@Aravind-11 we just merged the clip refactor PR 🤗 so now we can adjust it to that pattern more closely

Will get to this next week!! 😋

@Aravind-11
Copy link
Copy Markdown
Contributor Author

Hi @vasqu , should I wait further? I noticed that pr #44431 is still awaiting updates ?

@Aravind-11
Copy link
Copy Markdown
Contributor Author

Also, I had a question. I noticed that the common attention_interface backend seems to be calling mutiple files but it takes 4-5 different function calls in total to get to the actual SDPA / flash call . Some utils seem to be preparing the inputs over and over, whys it done this way. I get that it helps with a lot of model and training customisation , but wouldn't it be easier to reduce these calls and compress some of them into fewer calls ?

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 14, 2026

Yea it got a bit more complicated and it opened a can of worms 😬 not sure when / how we properly fix this, but shouldn't take too long as it is high prio imo

Re attn interface: Can you give me some call graphs that make you think that these are so nested? Usually it is only 1-2 levels of depth around the original attns

  • Flex wrapper --> torch flex call
  • Flash wrapper --> modeling utils --> underlying FA function (might invoke some other preps)
  • SDPA wrapper --> direct call to F.SDPA

Flash attention is super unique tho in its structure and not really avoidable atp. Open for improvements tho!

@Aravind-11
Copy link
Copy Markdown
Contributor Author

Yea it got a bit more complicated and it opened a can of worms 😬 not sure when / how we properly fix this, but shouldn't take too long as it is high prio imo

Re attn interface: Can you give me some call graphs that make you think that these are so nested? Usually it is only 1-2 levels of depth around the original attns

  • Flex wrapper --> torch flex call

  • Flash wrapper --> modeling utils --> underlying FA function (might invoke some other preps)

  • SDPA wrapper --> direct call to F.SDPA

Flash attention is super unique tho in its structure and not really avoidable atp. Open for improvements tho!

Got it!! LMK if I can be of any help 😅😅

Yes, I was talking about the FA module. It goes to integrations/flash_attention.py - prepares inputs. Then, goes to modeling_flash_attention_utils to run the flash_attention implementation.

I thought it could be just implemented at integrations itself but that would break the integrations repo consistency . I'm just surprised why torch wouldn't provide independent FA , flex modules.

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 17, 2026

To be short: Flash attention is very unique in its way it handles inputs 😅 so it needs super special treatment for all the features and edge cases to stay consistent with our features

I would not advise to touch it tbh, it is so interwined that you can easily mess up

@Aravind-11
Copy link
Copy Markdown
Contributor Author

To be short: Flash attention is very unique in its way it handles inputs 😅 so it needs super special treatment for all the features and edge cases to stay consistent with our features

I would not advise to touch it tbh, it is so interwined that you can easily mess up

Ur right 😂.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants