Add CLIP-like models in conversion to VLMs by zucchini-nlp · Pull Request #45361 · huggingface/transformers

zucchini-nlp · 2026-04-10T13:49:28Z

What does this PR do?

Fixes huggingface/trl#5497, also fixes #45390

TL;DR; the base model prefix is never appended if it is part of a bigger VLM, which was true for LLaVa. Loading CLIP checkpoint is not affected tho, which is why we missed it before merging

Checked "load-save-load back" pipeline with several models at random: Llava, InternVL, CLIP, Siglip, T5Gemma2, Gemma3, GotOCR, AltClip, ClipSeg. I hope other models are saved in the hub in a similar way

cc @albertvillanova

HuggingFaceDocBuilderDev · 2026-04-10T13:58:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

albertvillanova

Thanks for your reactivity and addressing this issue so quickly.

Unfortunately, I have tested your PR branch and the trl tests continue failing.

The load report is different now:

LlavaForConditionalGeneration LOAD REPORT from: trl-internal-testing/tiny-LlavaForConditionalGeneration
Key                                                                | Status     | 
-------------------------------------------------------------------+------------+-
vision_tower.                                                     | UNEXPECTED | 
model.vision_tower.encoder.layers.{0, 1}.layer_norm2.bias          | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.q_proj.weight   | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.layer_norm2.weight        | MISSING    | 
model.vision_tower.pre_layrnorm.weight                             | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.mlp.fc2.bias              | MISSING    | 
model.vision_tower.embeddings.patch_embedding.weight               | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.mlp.fc1.weight            | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.out_proj.weight | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.q_proj.bias     | MISSING    | 
model.vision_tower.pre_layrnorm.bias                               | MISSING    | 
model.vision_tower.embeddings.class_embedding                      | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.k_proj.bias     | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.layer_norm1.weight        | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.mlp.fc2.weight            | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.out_proj.bias   | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.v_proj.bias     | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.k_proj.weight   | MISSING    | 
model.vision_tower.post_layernorm.bias                             | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.self_attn.v_proj.weight   | MISSING    | 
model.vision_tower.post_layernorm.weight                           | MISSING    | 
model.vision_tower.embeddings.position_embedding.weight            | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.mlp.fc1.bias              | MISSING    | 
model.vision_tower.encoder.layers.{0, 1}.layer_norm1.bias          | MISSING    | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING:	those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.

zucchini-nlp · 2026-04-10T14:39:21Z

It's weird that the replacement by group didn't work

zucchini-nlp · 2026-04-10T14:59:54Z

run-slow: llava, clip, llava_next_video, gemma3

github-actions · 2026-04-10T15:01:09Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/clip", "models/gemma3", "models/llava", "models/llava_next_video"]
quantizations: []

zucchini-nlp · 2026-04-10T16:43:41Z

Should work now, hopefully. Coming back on Monday to check with Cyril

github-actions · 2026-04-10T19:18:49Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	8d084253	workflow commit (merge commit)
PR	90b6fbef	branch commit (from PR)
main	5c7190f0	base commit (on `main`)

⚠️ Model CI failed to report results

The test failure analysis could not be completed. Please check the workflow run for details.

albertvillanova · 2026-04-11T09:11:22Z

Thanks again for your quick turnaround on this fix, @zucchini-nlp.

I have re-tested on my side, and I can confirm that the load report issues we were seeing before have now disappeared! 👍

However, it looks like there is still a change in parameter naming: the vision_model nesting was eliminated (as I commented in the trl issue: huggingface/trl#5497 (comment)). For example:

model.vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.weight before
model.vision_tower.encoder.layers.1.self_attn.k_proj.weight now

Is this renaming intended? This is the reason why our tests are still failing, as they rely on the previous key structure.

github-actions · 2026-04-13T08:44:22Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: altclip, chinese_clip

This reverts commit 55f3f33.

zucchini-nlp · 2026-04-13T08:59:53Z

When re-saving it back, the model conversion isn't applied back so the renaming doesn't happen. Regex doesn't allow better matching, and we don't yet have a proper prefix-handling

@Cyrilvallez , I experimented by adding a prefix-name of submodules at run-time by inspecting named_parameters(). It actually works fine and seems to revert keys correctly, though I am not sure if that's what we want to do long-term. LMK if you want to get that commit in here (it is here 55f3f33)

albertvillanova · 2026-04-17T06:44:17Z

Thanks for your work on this issue. Any update planned?

Cyrilvallez · 2026-04-17T07:58:52Z

@albertvillanova You can test out #45448, which will supersed this one!

albertvillanova · 2026-04-17T09:35:53Z

Thanks for the update, @Cyrilvallez.

I am commenting on your PR.

The model structure of Clip and similar models sharing the archicture has changed in Transformers. See: - huggingface/transformers#45361 - huggingface/transformers#45448 The test was updated to reflect the change.

add conversion for VLMs

d434f52

albertvillanova reviewed Apr 10, 2026

View reviewed changes

zucchini-nlp added 2 commits April 10, 2026 16:43

ahhh, i frgot raw

4b33c7e

dont add prefix when saving!

90b6fbe

zucchini-nlp added 6 commits April 10, 2026 17:31

a bit more

0f54fb7

oh no, no, some VLMs are saved good and some are bad. This is crazy

988c45c

typo

0a4b1b0

wait

4734200

Merge remote-tracking branch 'upstream/main' into clip-conversion

661f859

dont revert ever

161a88e

zucchini-nlp added 2 commits April 13, 2026 10:34

match only when part of bigger model

eff15c3

skip test if clip conversion

2b8ed36

zucchini-nlp added 2 commits April 13, 2026 10:58

tmp, will revert back

55f3f33

Revert "tmp, will revert back"

5966f6e

This reverts commit 55f3f33.

albertvillanova mentioned this pull request Apr 17, 2026

[loading] Clean way to add/remove full parts in checkpoint names #45448

Merged

zucchini-nlp closed this Apr 17, 2026

BenjaminBossan mentioned this pull request Apr 20, 2026

FIX Clip module structure since transformers > 5.5 huggingface/peft#3179

Merged

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Conversation

zucchini-nlp commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 10, 2026

Uh oh!

albertvillanova left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Apr 10, 2026

Uh oh!

zucchini-nlp commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

zucchini-nlp commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

CI Results

Commit Info

Uh oh!

albertvillanova commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

zucchini-nlp commented Apr 13, 2026

Uh oh!

albertvillanova commented Apr 17, 2026

Uh oh!

Cyrilvallez commented Apr 17, 2026

Uh oh!

albertvillanova commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zucchini-nlp commented Apr 10, 2026 •

edited

Loading

albertvillanova left a comment •

edited

Loading

albertvillanova commented Apr 11, 2026 •

edited

Loading