Refactor OwlViT to modular Transformers#45073
Refactor OwlViT to modular Transformers#45073Aravind-11 wants to merge 3 commits intohuggingface:mainfrom
Conversation
|
@vasqu , is this what you meant by modular refactor or did you want the refactor done in the modeling_owlvit code itself ? |
- Add modular_owlvit.py inheriting CLIP vision/text embeddings, MLP, encoder layer, encoder - Import box IoU helpers from loss_for_object_detection; eager_attention from BERT - Regenerate modeling_owlvit.py via modular_model_converter (single-file policy) - Removes duplicated Copied-from blocks in favor of modular composition
088ab0e to
8d84019
Compare
Align eager softmax (float32) with SDPA and fix test_eager_matches_sdpa_inference for bf16.
|
[For maintainers] Suggested jobs to run (before merge) run-slow: owlvit |
39ee9d7 to
e96e92c
Compare
- Make OwlViTAttention a pass-through subclass of CLIPAttention - Import contrastive_loss from CLIP instead of duplicating - Remove unused logger/logging and dead imports (Callable, ALL_ATTENTION_FUNCTIONS) - Regenerate modeling_owlvit.py via modular converter
e96e92c to
d4054ba
Compare
|
@Aravind-11 we just merged the clip refactor PR 🤗 so now we can adjust it to that pattern more closely |
Will get to this next week!! 😋 |
|
Also, I had a question. I noticed that the common attention_interface backend seems to be calling mutiple files but it takes 4-5 different function calls in total to get to the actual SDPA / flash call . Some utils seem to be preparing the inputs over and over, whys it done this way. I get that it helps with a lot of model and training customisation , but wouldn't it be easier to reduce these calls and compress some of them into fewer calls ? |
|
Yea it got a bit more complicated and it opened a can of worms 😬 not sure when / how we properly fix this, but shouldn't take too long as it is high prio imo Re attn interface: Can you give me some call graphs that make you think that these are so nested? Usually it is only 1-2 levels of depth around the original attns
Flash attention is super unique tho in its structure and not really avoidable atp. Open for improvements tho! |
Got it!! LMK if I can be of any help 😅😅 Yes, I was talking about the FA module. It goes to integrations/flash_attention.py - prepares inputs. Then, goes to modeling_flash_attention_utils to run the flash_attention implementation. I thought it could be just implemented at integrations itself but that would break the integrations repo consistency . I'm just surprised why torch wouldn't provide independent FA , flex modules. |
|
To be short: Flash attention is very unique in its way it handles inputs 😅 so it needs super special treatment for all the features and edge cases to stay consistent with our features I would not advise to touch it tbh, it is so interwined that you can easily mess up |
Ur right 😂. |
What does this pr do?
who can review?
@vasqu