Skip to content

fix(ernie4_5_vl_moe): resolve three config loading failures for ERNIE-4.5-VL MoE models#45275

Closed
avarga1 wants to merge 5 commits intohuggingface:mainfrom
avarga1:fix/ernie4-5-vl-moe-config-loading
Closed

fix(ernie4_5_vl_moe): resolve three config loading failures for ERNIE-4.5-VL MoE models#45275
avarga1 wants to merge 5 commits intohuggingface:mainfrom
avarga1:fix/ernie4-5-vl-moe-config-loading

Conversation

@avarga1
Copy link
Copy Markdown

@avarga1 avarga1 commented Apr 7, 2026

Problem

AutoConfig.from_pretrained("baidu/ERNIE-4.5-VL-28B-A3B-Paddle", trust_remote_code=True) raises errors that prevent the model from loading at all. Three separate bugs compound each other:

Bug 1 — model_type mismatch (KeyError on load)

The published checkpoint uses "model_type": "ernie4_5_moe_vl" in its config.json, but the transformers class is registered as "ernie4_5_vl_moe". Since there is no auto_map in the checkpoint's config, AutoConfig hits a KeyError and raises:

ValueError: The checkpoint you are trying to load has model type `ernie4_5_moe_vl`
but Transformers does not recognize this architecture.

Fix: Add "ernie4_5_moe_vl" alias in SPECIAL_MODEL_TYPE_TO_MODULE_NAME (pointing to the ernie4_5_vl_moe module), CONFIG_MAPPING_NAMES, and MODEL_NAMES_MAPPING.

Bug 2 — rope_theta validation skipped (silent misconfiguration)

PreTrainedConfig.__init__ only triggered convert_rope_params_to_dict when rope_theta was present in **kwargs. However, Ernie4_5Config.__init__ consumes rope_theta as a named parameter (sets self.rope_theta = 500000) before calling super().__init__(**kwargs) — so rope_theta is never in kwargs. The RoPE standardization branch never fires.

Fix: Also check getattr(self, "rope_theta", None) is not None so the conversion path fires correctly when rope_theta was already set as an instance attribute.

Bug 3 — moe_num_experts type too narrow (StrictDataclassFieldValidationError)

Ernie4_5_VLMoeTextConfig declares moe_num_experts: int | None = 64, but the published checkpoint supplies "moe_num_experts": [64, 64] — a per-layer list. The @strict dataclass validator rejects the list.

Fix: Widen the type annotation to int | list[int] | None.

Verification

from transformers import AutoConfig

cfg = AutoConfig.from_pretrained("baidu/ERNIE-4.5-VL-28B-A3B-Paddle", trust_remote_code=True)
# Before: raises ValueError (model_type not recognized)
# After:  Ernie4_5_VLMoeConfig — loads cleanly

assert type(cfg).__name__ == "Ernie4_5_VLMoeConfig"
assert cfg.text_config.rope_parameters["rope_theta"] == 500000
assert cfg.text_config.moe_num_experts == [64, 64]

Related

…L MoE models

Three issues prevented AutoConfig from loading baidu/ERNIE-4.5-VL-28B-A3B-Paddle:

1. model_type mismatch: the published checkpoint uses "ernie4_5_moe_vl" but
   transformers registers the class as "ernie4_5_vl_moe". Add "ernie4_5_moe_vl"
   alias in SPECIAL_MODEL_TYPE_TO_MODULE_NAME, CONFIG_MAPPING_NAMES, and
   MODEL_NAMES_MAPPING so AutoConfig resolves it to Ernie4_5_VLMoeConfig.

2. rope_theta validation failure: PreTrainedConfig.__init__ only triggered
   convert_rope_params_to_dict when rope_theta was present in **kwargs, but
   Ernie4_5Config.__init__ consumes rope_theta as a named parameter before
   calling super().__init__(). Also check getattr(self, "rope_theta", None)
   so the RoPE standardization path fires correctly.

3. moe_num_experts type error: Ernie4_5_VLMoeTextConfig declared the field as
   int | None but the checkpoint supplies a list [64, 64] for per-layer expert
   counts. Widen the type to int | list[int] | None.
@avarga1 avarga1 force-pushed the fix/ernie4-5-vl-moe-config-loading branch from 98ad854 to 1219f83 Compare April 7, 2026 02:56
@zucchini-nlp
Copy link
Copy Markdown
Member

We support the model without trust_remote_code=True though, is there any reason you want to load with custom code?

@avarga1
Copy link
Copy Markdown
Author

avarga1 commented Apr 7, 2026

Fair point — trust_remote_code=True was just leftover from my debugging session and shouldn't have been in the verification snippet. The three bugs are all in transformers' native code (auto-mapping alias, PreTrainedConfig rope_theta path, and the config dataclass type annotation) — none of them require remote code to reproduce.

Updated the snippet:

from transformers import AutoConfig
cfg = AutoConfig.from_pretrained("baidu/ERNIE-4.5-VL-28B-A3B-Paddle")
assert type(cfg).__name__ == "Ernie4_5_VLMoeConfig"
assert cfg.text_config.rope_parameters["rope_theta"] == 500000
assert cfg.text_config.moe_num_experts == [64, 64]

Also just pushed a fix for the check_repository_consistency CI failure — the moe_num_experts type override needed to be in the modular source file (modular_ernie4_5_vl_moe.py), not just the generated config.

("ernie", "ErnieConfig"),
("ernie4_5", "Ernie4_5Config"),
("ernie4_5_moe", "Ernie4_5_MoeConfig"),
("ernie4_5_moe_vl", "Ernie4_5_VLMoeConfig"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vasqu for this, i remember you were changing model types recently

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this looks like the only valid change IF we change the config model type of the hub PRs - this would imply that we support 2 model types for one model which is not in our code base, i.e. a dirty workaround.

Imo, we should sync with vLLM support first / change the model type there. But that needs v5 support first, so I'd like to withhold on this PR for now and potentially "fix" on vLLM side instead

Comment thread src/transformers/configuration_utils.py Outdated
Comment on lines +268 to +270
elif kwargs.get("rope_scaling") and kwargs.get("rope_theta"):
elif kwargs.get("rope_scaling") and (
kwargs.get("rope_theta") or getattr(self, "rope_theta", None) is not None
):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rope is set to 50k even without this line. I see that the text config class has a rope_parameters field with default None so we will go by the first if path

@avarga1
Copy link
Copy Markdown
Author

avarga1 commented Apr 7, 2026

Good catch — reverted. The rope_theta path is already handled via rope_parameters on the text config, so that change was unnecessary. PR is now just the two remaining fixes: the model_type alias and the moe_num_experts type widening.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, ernie4_5_vl_moe

@avarga1
Copy link
Copy Markdown
Author

avarga1 commented Apr 7, 2026

The tests_processors failure (AttributeError: NewTokenizer has no attribute special_attribute_present) appears to be a pre-existing flaky test unrelated to this PR — it's in test_processor_auto.py::AutoFeatureExtractorTest::test_from_pretrained_dynamic_processor and involves dynamic Hub tokenizer registration, which this PR doesn't touch. Happy to rerun if needed.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45275&sha=c55854

@zucchini-nlp
Copy link
Copy Markdown
Member

Thanks, lets wait for "vasqu" who added the model and has more context on recent naming changed

Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to wait for v5 getting into vLLM and then see how we go about it + adjust the config for the integration there. As of now, it does not make much sense to me to have this merged as we need to rely on revisions either way and these are possible without this

Comment on lines 120 to +152
@@ -149,6 +149,7 @@ class Ernie4_5_VLMoeTextConfig(Ernie4_5_MoeConfig):
pad_token_id: int | None = None
eos_token_id: int | list[int] | None = None
bos_token_id: int | None = None
moe_num_experts: int | list[int] | None = 64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still for remote only, no? The transformers version should not have a list for these as they are always the same size

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, this one is valid for strict-type validation 😓

update: actually no, hf-converted configs won't have list of ints (see https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-PT/discussions/11/files)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed internally, it's only the remote configs as well

("ernie", "ErnieConfig"),
("ernie4_5", "Ernie4_5Config"),
("ernie4_5_moe", "Ernie4_5_MoeConfig"),
("ernie4_5_moe_vl", "Ernie4_5_VLMoeConfig"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this looks like the only valid change IF we change the config model type of the hub PRs - this would imply that we support 2 model types for one model which is not in our code base, i.e. a dirty workaround.

Imo, we should sync with vLLM support first / change the model type there. But that needs v5 support first, so I'd like to withhold on this PR for now and potentially "fix" on vLLM side instead

@avarga1
Copy link
Copy Markdown
Author

avarga1 commented Apr 9, 2026

Thanks, that makes sense.

I agree that if this only applies to remote configs and the cleaner fix belongs on the vLLM / hub side after v5 support lands, then it probably shouldn't be forced into core Transformers right now.

I opened this mainly because the current loading path exposed a few compounding mismatches, but I'm happy to defer if the right place to resolve them is upstream in the integration flow instead of here.

If helpful, I can narrow this PR to only the change(s) that are still considered valid, or close it and revisit once the vLLM side is aligned.

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 9, 2026

Hey @avarga1, is there anything needed from this PR then?

Everything should work fine as long as you don't use trust_remote_code=True and pass the correct revision (see the docs for usage examples). Let us know if there is indeed something breaking. It is indeed smarter to wait for now imo

@avarga1
Copy link
Copy Markdown
Author

avarga1 commented Apr 9, 2026

Makes sense — I ran into this while integrating a model I'm training locally and trust_remote_code was required for it to load. Happy to keep that workaround on my end for now and revisit properly post-v5. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants