Skip to content

Add CLIP-like models in conversion to VLMs#45361

Closed
zucchini-nlp wants to merge 13 commits intohuggingface:mainfrom
zucchini-nlp:clip-conversion
Closed

Add CLIP-like models in conversion to VLMs#45361
zucchini-nlp wants to merge 13 commits intohuggingface:mainfrom
zucchini-nlp:clip-conversion

Conversation

@zucchini-nlp
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp commented Apr 10, 2026

What does this PR do?

Fixes huggingface/trl#5497, also fixes #45390

TL;DR; the base model prefix is never appended if it is part of a bigger VLM, which was true for LLaVa. Loading CLIP checkpoint is not affected tho, which is why we missed it before merging

Checked "load-save-load back" pipeline with several models at random: Llava, InternVL, CLIP, Siglip, T5Gemma2, Gemma3, GotOCR, AltClip, ClipSeg. I hope other models are saved in the hub in a similar way

cc @albertvillanova

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reactivity and addressing this issue so quickly.

Unfortunately, I have tested your PR branch and the trl tests continue failing.

The load report is different now:

LlavaForConditionalGeneration LOAD REPORT from: trl-internal-testing/tiny-LlavaForConditionalGeneration
Key                                                                | Status     | 
-------------------------------------------------------------------+------------+-
vision_tower.                                                     | UNEXPECTED | 
model.vision_tower.encoder.layers.{0, 1}.layer_norm2.bias          | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.q_proj.weight   | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.layer_norm2.weight        | MISSING    | 
model.vision_tower.pre_layrnorm.weight                             | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.mlp.fc2.bias              | MISSING    | 
model.vision_tower.embeddings.patch_embedding.weight               | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.mlp.fc1.weight            | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.out_proj.weight | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.q_proj.bias     | MISSING    | 
model.vision_tower.pre_layrnorm.bias                               | MISSING    | 
model.vision_tower.embeddings.class_embedding                      | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.k_proj.bias     | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.layer_norm1.weight        | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.mlp.fc2.weight            | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.out_proj.bias   | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.v_proj.bias     | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.k_proj.weight   | MISSING    | 
model.vision_tower.post_layernorm.bias                             | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.v_proj.weight   | MISSING    | 
model.vision_tower.post_layernorm.weight                           | MISSING    | 
model.vision_tower.embeddings.position_embedding.weight            | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.mlp.fc1.bias              | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.layer_norm1.bias          | MISSING    | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING:	those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.

@zucchini-nlp
Copy link
Copy Markdown
Member Author

It's weird that the replacement by group didn't work

@zucchini-nlp
Copy link
Copy Markdown
Member Author

run-slow: llava, clip, llava_next_video, gemma3

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/clip", "models/gemma3", "models/llava", "models/llava_next_video"]
quantizations: []

@zucchini-nlp
Copy link
Copy Markdown
Member Author

Should work now, hopefully. Coming back on Monday to check with Cyril

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 8d084253 workflow commit (merge commit)
PR 90b6fbef branch commit (from PR)
main 5c7190f0 base commit (on main)

⚠️ Model CI failed to report results

The test failure analysis could not be completed. Please check the workflow run for details.

@albertvillanova
Copy link
Copy Markdown
Member

albertvillanova commented Apr 11, 2026

Thanks again for your quick turnaround on this fix, @zucchini-nlp.

I have re-tested on my side, and I can confirm that the load report issues we were seeing before have now disappeared! 👍

However, it looks like there is still a change in parameter naming: the vision_model nesting was eliminated (as I commented in the trl issue: huggingface/trl#5497 (comment)). For example:

  • model.vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.weight before
  • model.vision_tower.encoder.layers.1.self_attn.k_proj.weight now

Is this renaming intended? This is the reason why our tests are still failing, as they rely on the previous key structure.

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: altclip, chinese_clip

@zucchini-nlp
Copy link
Copy Markdown
Member Author

When re-saving it back, the model conversion isn't applied back so the renaming doesn't happen. Regex doesn't allow better matching, and we don't yet have a proper prefix-handling

@Cyrilvallez , I experimented by adding a prefix-name of submodules at run-time by inspecting named_parameters(). It actually works fine and seems to revert keys correctly, though I am not sure if that's what we want to do long-term. LMK if you want to get that commit in here (it is here 55f3f33)

@albertvillanova
Copy link
Copy Markdown
Member

Thanks for your work on this issue. Any update planned?

@Cyrilvallez
Copy link
Copy Markdown
Member

@albertvillanova You can test out #45448, which will supersed this one!

@albertvillanova
Copy link
Copy Markdown
Member

Thanks for the update, @Cyrilvallez.

I am commenting on your PR.

BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Apr 20, 2026
The model structure of Clip and similar models sharing the archicture
has changed in Transformers. See:

- huggingface/transformers#45361
- huggingface/transformers#45448

The test was updated to reflect the change.
BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Apr 20, 2026
The model structure of Clip and similar models sharing the archicture
has changed in Transformers. See:

- huggingface/transformers#45361
- huggingface/transformers#45448

The test was updated to reflect the change.
BenjaminBossan added a commit to huggingface/peft that referenced this pull request Apr 20, 2026
The model structure of Clip and similar models sharing the archicture
has changed in Transformers. See:

- huggingface/transformers#45361
- huggingface/transformers#45448

The test was updated to reflect the change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants