Add CLIP-like models in conversion to VLMs#45361
Add CLIP-like models in conversion to VLMs#45361zucchini-nlp wants to merge 13 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Thanks for your reactivity and addressing this issue so quickly.
Unfortunately, I have tested your PR branch and the trl tests continue failing.
The load report is different now:
LlavaForConditionalGeneration LOAD REPORT from: trl-internal-testing/tiny-LlavaForConditionalGeneration
Key | Status |
-------------------------------------------------------------------+------------+-
vision_tower. | UNEXPECTED |
model.vision_tower.encoder.layers.{0, 1}.layer_norm2.bias | MISSING |
model.vision_tower.encoder.layers.{0, 1}.self_attn.q_proj.weight | MISSING |
model.vision_tower.encoder.layers.{0, 1}.layer_norm2.weight | MISSING |
model.vision_tower.pre_layrnorm.weight | MISSING |
model.vision_tower.encoder.layers.{0, 1}.mlp.fc2.bias | MISSING |
model.vision_tower.embeddings.patch_embedding.weight | MISSING |
model.vision_tower.encoder.layers.{0, 1}.mlp.fc1.weight | MISSING |
model.vision_tower.encoder.layers.{0, 1}.self_attn.out_proj.weight | MISSING |
model.vision_tower.encoder.layers.{0, 1}.self_attn.q_proj.bias | MISSING |
model.vision_tower.pre_layrnorm.bias | MISSING |
model.vision_tower.embeddings.class_embedding | MISSING |
model.vision_tower.encoder.layers.{0, 1}.self_attn.k_proj.bias | MISSING |
model.vision_tower.encoder.layers.{0, 1}.layer_norm1.weight | MISSING |
model.vision_tower.encoder.layers.{0, 1}.mlp.fc2.weight | MISSING |
model.vision_tower.encoder.layers.{0, 1}.self_attn.out_proj.bias | MISSING |
model.vision_tower.encoder.layers.{0, 1}.self_attn.v_proj.bias | MISSING |
model.vision_tower.encoder.layers.{0, 1}.self_attn.k_proj.weight | MISSING |
model.vision_tower.post_layernorm.bias | MISSING |
model.vision_tower.encoder.layers.{0, 1}.self_attn.v_proj.weight | MISSING |
model.vision_tower.post_layernorm.weight | MISSING |
model.vision_tower.embeddings.position_embedding.weight | MISSING |
model.vision_tower.encoder.layers.{0, 1}.mlp.fc1.bias | MISSING |
model.vision_tower.encoder.layers.{0, 1}.layer_norm1.bias | MISSING |
Notes:
- UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING: those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.|
It's weird that the replacement by group didn't work |
|
run-slow: llava, clip, llava_next_video, gemma3 |
|
This comment contains models: ["models/clip", "models/gemma3", "models/llava", "models/llava_next_video"] |
|
Should work now, hopefully. Coming back on Monday to check with Cyril |
CI ResultsCommit Info
The test failure analysis could not be completed. Please check the workflow run for details. |
|
Thanks again for your quick turnaround on this fix, @zucchini-nlp. I have re-tested on my side, and I can confirm that the load report issues we were seeing before have now disappeared! 👍 However, it looks like there is still a change in parameter naming: the
Is this renaming intended? This is the reason why our tests are still failing, as they rely on the previous key structure. |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: altclip, chinese_clip |
This reverts commit 55f3f33.
|
When re-saving it back, the model conversion isn't applied back so the renaming doesn't happen. Regex doesn't allow better matching, and we don't yet have a proper prefix-handling @Cyrilvallez , I experimented by adding a prefix-name of submodules at run-time by inspecting |
|
Thanks for your work on this issue. Any update planned? |
|
@albertvillanova You can test out #45448, which will supersed this one! |
|
Thanks for the update, @Cyrilvallez. I am commenting on your PR. |
The model structure of Clip and similar models sharing the archicture has changed in Transformers. See: - huggingface/transformers#45361 - huggingface/transformers#45448 The test was updated to reflect the change.
The model structure of Clip and similar models sharing the archicture has changed in Transformers. See: - huggingface/transformers#45361 - huggingface/transformers#45448 The test was updated to reflect the change.
The model structure of Clip and similar models sharing the archicture has changed in Transformers. See: - huggingface/transformers#45361 - huggingface/transformers#45448 The test was updated to reflect the change.
What does this PR do?
Fixes huggingface/trl#5497, also fixes #45390
TL;DR; the base model prefix is never appended if it is part of a bigger VLM, which was true for LLaVa. Loading CLIP checkpoint is not affected tho, which is why we missed it before merging
Checked "load-save-load back" pipeline with several models at random: Llava, InternVL, CLIP, Siglip, T5Gemma2, Gemma3, GotOCR, AltClip, ClipSeg. I hope other models are saved in the hub in a similar way
cc @albertvillanova