🚨 Generalize `get_decoder()` for multimodal and delete redundant code 🔪 by zucchini-nlp · Pull Request #42156 · huggingface/transformers

zucchini-nlp · 2025-11-12T10:30:43Z

What does this PR do?

As per title, blocked by #41589 for VLMs! We should be able to use get_decoder() to get the LM part of any model after this and have much less duplicate code. Same foes for the get_encoder() to get the encoder if the model has a separate encoding module. In comparison to decoder, we can have specific encoder per modality so the helper will accept modality as arg

Universal helper first reduces duplicate code, nudges us to use standardized names for major modules and can be used by 3rd party libraries. Right now we have 5 ways to name a vision encoder!

🚨 Breaking changes, ig we can break helpers for v5:

VLMs now will not have a property to get self.language_model directly from task-model and users will need to call self.get_decoder()
Deleted get_text_encoder and get_audio_encoder in some audio models because functionality is covered now by the get_encoder()

HuggingFaceDocBuilderDev · 2025-11-12T10:40:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2025-11-13T14:06:38Z

tests/utils/test_modeling_utils.py

        model = GPT2LMHeadModel(cfg)
        dec = model.get_decoder()

-        assert dec is model, f"GPT2 get_decoder() should return self (fallback), got {type(dec)}"


prev helper didn't cover all edge cases! This should be the base model, if we compare with other LLMs (e.g. llama)

molbap

Very nice unbloating 🔪
OK for me, just would be cool to add to the make style/ruff rules/quality check to reduce cognitive load

molbap · 2025-11-13T14:12:27Z

src/transformers/modeling_utils.py

+        Symmetric setter. Mirrors the lookup logic used in `get_encoder`.
+        """
+
+        # NOTE: new models need to use existing names for layers if possible, so this list doesn't grow infinitely


To note, this should be enforced in make fixup in code consistency part to save ourselves the hassle

hmm, isn't it going to be a huge limitation for contributors if we force it and auto-renam with fix-copies? Imo we need to communicate it when reviewing and explain why it's important. It's only a few ppl reviewing VLMs currently, so it might be easier

I was thinking the make fixup updated message (or rather code-quality check on the CI, same) would be informative enough, saying "decoder layer names should be part of this list: ..." rather than auto-renaming. Could be a ruff warning if we think it's too restrictive as an error?

Hmm, lemme see where I can fit this in a non-disruptive way. Not sure if users actually read the warnings, we should be more strict in review process in any case imo 😆

src/transformers/modeling_utils.py

src/transformers/models/musicgen/modeling_musicgen.py

src/transformers/modeling_utils.py

zucchini-nlp · 2025-11-14T10:07:42Z

Merge conflicts after a big refactor 😢

github-actions · 2025-11-19T10:16:45Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: aria, autoformer, aya_vision, bart, bigbird_pegasus, blenderbot, blenderbot_small, blip_2, cohere2_vision, colqwen2, conditional_detr, d_fine, dab_detr, deformable_detr, detr, dia

zucchini-nlp · 2025-11-19T10:46:34Z

hey @jackzhxng, I remember you requested this feature for torch.export. Now you can

multimodal_model.get_decoder() -> returns the decoding LM
multimodal_model.get_encoder() -> returns the encoding LM if any
multimodal_model.get_encoder(modality="image") -> returns the encoding vision tower if any
multimodal_model.get_encoder(modality="audio") -> returns the encoding audio tower if any

also cc @hmellor, we also discussed it re vLLM

BenjaminBossan · 2025-11-19T17:02:50Z

Hi @zucchini-nlp this PR causes an issue with PEFT as (at least some) decoder models now have get_encoder. As an example:

from transformers import AutoModelForCausalLM

model_id = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_id)
assert not hasattr(model, "get_encoder")
# after this PR, model.get_encoder() returns model.model

This works with the previous commit (a5c903f877fda21e739027eed133e03162eb7712) but fails after this PR (e2fb8d6062a05f69f976cf6e39618df6c31a3bfd). Is this change intended?

jackzhxng · 2025-11-19T17:26:00Z

@zucchini-nlp this is amazing thank you!

zucchini-nlp · 2025-11-19T17:43:33Z

@BenjaminBossan ideally it should return self and not the base model. I see where it comes from and will fix it. Then you can check with smth like - self.get_encoder != self: encoder = self.get_encoder()

All models will have get_encoder() implemented, but if a model has no actual encoder it should return self

BenjaminBossan · 2025-11-19T19:24:44Z

I see where it comes from and will fix it.

Great, please let me know when the PR is there.

All models will have get_encoder() implemented, but if a model has no actual encoder it should return self

We can modify PEFT to take this into account. But at least to me, this API feels a bit strange to be honest.

zucchini-nlp · 2025-11-20T09:32:58Z

@BenjaminBossan yeah, it is because we have get_encoder() in PreTrainedModel so all models will have it as an attribute. Might feel weird though it is same as it was with get_decoder() and I think it saves us from duplicating the same code in all models

When using mixed adapter batches (i.e. using different LoRA adapters in the same batch), users have to pass adapter_names. When simultaneously using beam search, these adapter names have to be extended by the number of beams. For encoder-decoder models, even when applying beam search, the encoder part of the model should, however, not use the extended adapter_names. This is because the encoder still uses the original, non-extended samples. The need for this used to be checked by calling model.get_encoder(). However, with transformers v5, every PretrainedModel will have a get_encoder method. The new convention is that it will return self if there is no encoder. This is now what's being checked. huggingface/transformers#42156 Note that said PR contains a small bug that leads to self not always being returned. Therefore, for the full fix of the issue on transformers main, we also need to await this PR: huggingface/transformers#42295

jackzhxng · 2025-12-12T23:07:34Z

@zucchini-nlp I believe Voxtral actually still has top-level language_model - https://github.com/huggingface/transformers/blob/main/src/transformers/models/voxtral/modeling_voxtral.py#L379

Additionally woult it be possible for get_decoder to return the causal variant of the model which includes the lm_head?

zucchini-nlp · 2025-12-15T11:20:00Z

Ah, Voxtral has a causal model as backbone. We need to explicitly overwrite get_decoder for the model in that case, will do

## Summary This PR fixes access to missing attributes for multimodal models in `src/liger_kernel/transformers/monkey_patch.py`. The main change is to consistently access attributes (like `language_model`, `vision_tower`, and `visual`) through the submodel `.model` attribute of the parent model, rather than directly from the parent model itself. This fixes AttributeError after this PR was merged in transformers: - huggingface/transformers#42156 See associated issue in TRL: - huggingface/trl#4601 Fix #960. ## Details Fix: Consistent attribute access via `.model` * Updated all references to submodules such as `language_model`, `vision_tower`, and `visual` to use the `.model` attribute (e.g., `model.model.language_model` instead of `model.language_model`) across all kernel application functions for models including LLava, Mllama, Gemma3, PaliGemma, Qwen2 VL, Qwen2.5 VL, Qwen3 VL, Qwen3 VL MoE, GLM4V, GLM4V MoE, and InternVL. Normalization and patching logic updates * Adjusted normalization and patching calls to operate on submodels accessed via `.model`, ensuring that layer normalization and RMS normalization are consistently applied to the correct components. These changes make the codebase more maintainable and robust against future changes in model class implementations. ## Testing Done - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <tangshao28@gmail.com>

zucchini-nlp · 2025-12-17T11:05:43Z

Oh, which includes the lm_head, I misread it!

Nah, the base model is supposed to be the model without any task head on top so in Voxtral, it is going to be the language model without a head

When using mixed adapter batches (i.e. using different LoRA adapters in the same batch), users have to pass adapter_names. When simultaneously using beam search, these adapter names have to be extended by the number of beams. For encoder-decoder models, even when applying beam search, the encoder part of the model should, however, not use the extended adapter_names. This is because the encoder still uses the original, non-extended samples. The need for this used to be checked by calling model.get_encoder(). However, with transformers v5, every PretrainedModel will have a get_encoder method. The new convention is that it will return self if there is no encoder. This is now what's being checked. huggingface/transformers#42156 Note that said PR contains a small bug that leads to self not always being returned. Therefore, for the full fix of the issue on transformers main, we also need to await this PR: huggingface/transformers#42295

… 🔪 (huggingface#42156) * update some models * update the rest * add helper for encoder * delete encoder code from models * fix copies * fix some tests but VLM will fail * add encider tests simialr to decoder * no print * fix overwritten models * and a million exceptions with old audio models, revert back

update some models

304d3be

zucchini-nlp added 3 commits November 13, 2025 13:51

update the rest

37415a9

add helper for encoder

0fd4474

delete encoder code from models

2efe34b

zucchini-nlp changed the title ~~[WIP] Generalize get_decoder() for multimodal and delete redundant code 🔪~~ 🚨 Generalize get_decoder() for multimodal and delete redundant code 🔪 Nov 13, 2025

fix copies

f3bfd28

zucchini-nlp requested review from ArthurZucker, Cyrilvallez and molbap November 13, 2025 13:39

fix some tests but VLM will fail

bf4bebd

zucchini-nlp commented Nov 13, 2025

View reviewed changes

molbap approved these changes Nov 13, 2025

View reviewed changes

zucchini-nlp added 6 commits November 14, 2025 14:00

add encider tests simialr to decoder

73bdc27

no print

07af770

merge main

b8f7e7d

fix overwritten models

b97a42c

and a million exceptions with old audio models, revert back

65b8ba6

merge main

7f05377

zucchini-nlp merged commit e2fb8d6 into huggingface:main Nov 19, 2025
23 checks passed

zucchini-nlp mentioned this pull request Nov 20, 2025

Fix an edge case for get_encoder() #42295

Merged

BenjaminBossan mentioned this pull request Nov 20, 2025

FIX Beam search w/ mixed adapter batches & encoder huggingface/peft#2921

Merged

zucchini-nlp added the for_v5? label Nov 26, 2025

albertvillanova mentioned this pull request Nov 29, 2025

CI fails with dev dependencies: AttributeError: 'Qwen2_5_VLForConditionalGeneration' object has no attribute 'language_model' huggingface/trl#4601

Closed

Tcc0403 mentioned this pull request Nov 29, 2025

AttributeError for 'language_model' in transformers v5 linkedin/Liger-Kernel#960

Closed

albertvillanova mentioned this pull request Dec 4, 2025

Fix missing property access for multimodal models linkedin/Liger-Kernel#966

Merged

3 tasks

jackzhxng mentioned this pull request Dec 12, 2025

Fix multimodal for Transformers v5 huggingface/optimum-executorch#195

Merged

albertvillanova mentioned this pull request Jan 6, 2026

Hotfix CI with dev dependencies: xfail test_training_vlm_and_liger huggingface/trl#4777

Merged

qgallouedec mentioned this pull request Jan 16, 2026

Refactor DPO huggingface/trl#3906

Merged

qgallouedec mentioned this pull request Jan 26, 2026

Transformers v5 release: extend xfail condition for TestGRPOTrainer.test_training_vlm_and_liger and update version checks huggingface/trl#4898

Merged

Conversation

zucchini-nlp commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

🚨 Breaking changes, ig we can break helpers for v5:

Uh oh!

HuggingFaceDocBuilderDev commented Nov 12, 2025

Uh oh!

zucchini-nlp Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

molbap Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

molbap Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp commented Nov 14, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

zucchini-nlp commented Nov 19, 2025

Uh oh!

Uh oh!

BenjaminBossan commented Nov 19, 2025

Uh oh!

jackzhxng commented Nov 19, 2025

Uh oh!

zucchini-nlp commented Nov 19, 2025

Uh oh!

BenjaminBossan commented Nov 19, 2025

Uh oh!

zucchini-nlp commented Nov 20, 2025

Uh oh!

jackzhxng commented Dec 12, 2025

Uh oh!

zucchini-nlp commented Dec 15, 2025

Uh oh!

zucchini-nlp commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

zucchini-nlp commented Nov 12, 2025 •

edited

Loading

zucchini-nlp Nov 13, 2025 •

edited

Loading