Refactor CLIP-like models#44431
Conversation
|
Let me wrap up #43590 first but very much needed 🙏 I've seen similar stuff not only on clip tbh but can't remember where |
|
Thanks for reopening @zucchini-nlp ! Sorry I couldn't get to it first, happy to review if needed! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: altclip, chinese_clip, clip, clipseg, kosmos2, metaclip_2, siglip, siglip2 |
|
This comment contains models: ["models/altclip", "models/chinese_clip", "models/clip", "models/clipseg", "models/kosmos2", "models/metaclip_2", "models/siglip", "models/siglip2"] |
CI ResultsCommit Info
Model CI Report❌ 3 new failed tests from this PR 😭
|
vasqu
left a comment
There was a problem hiding this comment.
Some initial comments, just to make sure: We do not need any conversions?
Other than these, super nice work. We need this so much
| f"`text_config_dict` is provided which will be used to initialize `AltCLIPTextConfig`. The " | ||
| f"`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The " |
There was a problem hiding this comment.
i don't like that modular didn't change this
| @auto_docstring | ||
| class Aimv2Model(Aimv2PreTrainedModel): | ||
| config: Aimv2Config | ||
| _no_split_modules = ["Aimv2TextEmbeddings", "Aimv2EncoderLayer", "Aimv2VisionEmbeddings"] |
There was a problem hiding this comment.
already defined in Aimv2PreTrainedModel and cleaned thanks to modular
| ) | ||
|
|
||
|
|
||
| class MetaClip2TextTransformer(MetaClip2PreTrainedModel): |
There was a problem hiding this comment.
this was a PreTrainedModel but without high lvl import available. Don't know if anyone was importing and using it, so may be breaking
Do we want to BC for this model?
| super().__init__(config) | ||
| self.vision_model = MLCDVisionTransformer(config) | ||
| # Initialize weights and apply final processing |
There was a problem hiding this comment.
with base model prefix, the model should be load-able back. Deleting this because MLCDVisionModel has no function except for being a wrapper around MLCDVisionTransformer
|
@vasqu this is ready and modular now. The CI will be red for a while for unrelated reasons |
vasqu
left a comment
There was a problem hiding this comment.
The core is super solid, I'm just nitpicky as always. I think there are a few valid comments but most are definitely not major
|
Looks like a bad rebase 👀 |
|
Which is why I love rebasing in github haha, will force squash all commits then |
4f60399 to
56a76c2
Compare
| # Copyright 2022 WenXiang ZhongzhiCheng LedellWu LiuGuang BoWenZhang and The HuggingFace Inc. team. All rights reserved. | ||
| # |
There was a problem hiding this comment.
Model and config had different headers. ig BAAI is the right one, models are released under BAAI
|
run-slow: clip, clipseg, chinese_clip, altclip, siglip, siglip2, metaclip_2 |
|
This comment contains models: ["models/altclip", "models/chinese_clip", "models/clip", "models/clipseg", "models/metaclip_2", "models/siglip", "models/siglip2"] |
CI ResultsCommit Info
Model CI Report❌ 9 new failed tests from this PR 😭
|
|
run-slow: altclip, chinese_clip, clip, clipseg, metaclip_2, siglip, siglip2, x_clip |
|
This comment contains models: ["models/altclip", "models/chinese_clip", "models/clip", "models/clipseg", "models/metaclip_2", "models/siglip", "models/siglip2", "models/x_clip"] |
CI ResultsCommit Info
The test failure analysis could not be completed. Please check the workflow run for details. |
|
Slow tests are passing, checked with latest nightly |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: aimv2, altclip, blip, chinese_clip, clip, clipseg, clvp, git, groupvit, idefics, kosmos2, metaclip_2, mlcd, owlv2, owlvit, siglip |
vasqu
left a comment
There was a problem hiding this comment.
Flaky test only, so merging as it's quite an important PR to have
|
@vladmandic pretty sure this will break sd.next |
i'll check, thanks for heads-up. but yes, most likely. |
|
@zucchini-nlp @vasqu this breaks all existing checkpoints for clip and other models |
|
We absolutely cannot change modules graph without adding the necessary mappings or something |
|
FYI Cyril: I am aware and working on it (got swayed by smth else), so we don't have duplicate PRs |
* squash! * fix copies for losses * reverse mapping test * xclip * fxi repo * altclip needs eager attn to pass the test?! and ofc non-causal masl * final fix repo * layernorm typo / docsting * clips can't agree on causality * ugh, skip xclip * fix repo * comments * and more * fixing stuff * i didn't push it yesterday? * style * break out of infinte dependency loop * fxi repo and hopefully merge today * again
What does this PR do?
Re-opening back a PR on cleaning up clip-like model's backbones. Let's merge it now, I've been seeing quite a lot of ppl reporting it and I am not sure when it will be resolved by the big vision refactor
Basically, it just removes unused modules and gets rid of nested
text_model.text_model. That allowscapture_outputsto work easily without patching