fix(ernie4_5_vl_moe): resolve three config loading failures for ERNIE-4.5-VL MoE models#45275
fix(ernie4_5_vl_moe): resolve three config loading failures for ERNIE-4.5-VL MoE models#45275avarga1 wants to merge 5 commits intohuggingface:mainfrom
Conversation
b93e005 to
98ad854
Compare
…L MoE models Three issues prevented AutoConfig from loading baidu/ERNIE-4.5-VL-28B-A3B-Paddle: 1. model_type mismatch: the published checkpoint uses "ernie4_5_moe_vl" but transformers registers the class as "ernie4_5_vl_moe". Add "ernie4_5_moe_vl" alias in SPECIAL_MODEL_TYPE_TO_MODULE_NAME, CONFIG_MAPPING_NAMES, and MODEL_NAMES_MAPPING so AutoConfig resolves it to Ernie4_5_VLMoeConfig. 2. rope_theta validation failure: PreTrainedConfig.__init__ only triggered convert_rope_params_to_dict when rope_theta was present in **kwargs, but Ernie4_5Config.__init__ consumes rope_theta as a named parameter before calling super().__init__(). Also check getattr(self, "rope_theta", None) so the RoPE standardization path fires correctly. 3. moe_num_experts type error: Ernie4_5_VLMoeTextConfig declared the field as int | None but the checkpoint supplies a list [64, 64] for per-layer expert counts. Widen the type to int | list[int] | None.
98ad854 to
1219f83
Compare
|
We support the model without |
|
Fair point — Updated the snippet: from transformers import AutoConfig
cfg = AutoConfig.from_pretrained("baidu/ERNIE-4.5-VL-28B-A3B-Paddle")
assert type(cfg).__name__ == "Ernie4_5_VLMoeConfig"
assert cfg.text_config.rope_parameters["rope_theta"] == 500000
assert cfg.text_config.moe_num_experts == [64, 64]Also just pushed a fix for the |
| ("ernie", "ErnieConfig"), | ||
| ("ernie4_5", "Ernie4_5Config"), | ||
| ("ernie4_5_moe", "Ernie4_5_MoeConfig"), | ||
| ("ernie4_5_moe_vl", "Ernie4_5_VLMoeConfig"), |
There was a problem hiding this comment.
@vasqu for this, i remember you were changing model types recently
There was a problem hiding this comment.
Yea this looks like the only valid change IF we change the config model type of the hub PRs - this would imply that we support 2 model types for one model which is not in our code base, i.e. a dirty workaround.
Imo, we should sync with vLLM support first / change the model type there. But that needs v5 support first, so I'd like to withhold on this PR for now and potentially "fix" on vLLM side instead
| elif kwargs.get("rope_scaling") and kwargs.get("rope_theta"): | ||
| elif kwargs.get("rope_scaling") and ( | ||
| kwargs.get("rope_theta") or getattr(self, "rope_theta", None) is not None | ||
| ): |
There was a problem hiding this comment.
the rope is set to 50k even without this line. I see that the text config class has a rope_parameters field with default None so we will go by the first if path
|
Good catch — reverted. The |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, ernie4_5_vl_moe |
|
The |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45275&sha=c55854 |
|
Thanks, lets wait for "vasqu" who added the model and has more context on recent naming changed |
vasqu
left a comment
There was a problem hiding this comment.
I would like to wait for v5 getting into vLLM and then see how we go about it + adjust the config for the integration there. As of now, it does not make much sense to me to have this merged as we need to rely on revisions either way and these are possible without this
| @@ -149,6 +149,7 @@ class Ernie4_5_VLMoeTextConfig(Ernie4_5_MoeConfig): | |||
| pad_token_id: int | None = None | |||
| eos_token_id: int | list[int] | None = None | |||
| bos_token_id: int | None = None | |||
| moe_num_experts: int | list[int] | None = 64 | |||
There was a problem hiding this comment.
This is still for remote only, no? The transformers version should not have a list for these as they are always the same size
There was a problem hiding this comment.
nope, this one is valid for strict-type validation 😓
update: actually no, hf-converted configs won't have list of ints (see https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-PT/discussions/11/files)
There was a problem hiding this comment.
discussed internally, it's only the remote configs as well
| ("ernie", "ErnieConfig"), | ||
| ("ernie4_5", "Ernie4_5Config"), | ||
| ("ernie4_5_moe", "Ernie4_5_MoeConfig"), | ||
| ("ernie4_5_moe_vl", "Ernie4_5_VLMoeConfig"), |
There was a problem hiding this comment.
Yea this looks like the only valid change IF we change the config model type of the hub PRs - this would imply that we support 2 model types for one model which is not in our code base, i.e. a dirty workaround.
Imo, we should sync with vLLM support first / change the model type there. But that needs v5 support first, so I'd like to withhold on this PR for now and potentially "fix" on vLLM side instead
|
Thanks, that makes sense. I agree that if this only applies to remote configs and the cleaner fix belongs on the vLLM / hub side after v5 support lands, then it probably shouldn't be forced into core Transformers right now. I opened this mainly because the current loading path exposed a few compounding mismatches, but I'm happy to defer if the right place to resolve them is upstream in the integration flow instead of here. If helpful, I can narrow this PR to only the change(s) that are still considered valid, or close it and revisit once the vLLM side is aligned. |
|
Hey @avarga1, is there anything needed from this PR then? Everything should work fine as long as you don't use |
|
Makes sense — I ran into this while integrating a model I'm training locally and |
Problem
AutoConfig.from_pretrained("baidu/ERNIE-4.5-VL-28B-A3B-Paddle", trust_remote_code=True)raises errors that prevent the model from loading at all. Three separate bugs compound each other:Bug 1 —
model_typemismatch (KeyError on load)The published checkpoint uses
"model_type": "ernie4_5_moe_vl"in itsconfig.json, but the transformers class is registered as"ernie4_5_vl_moe". Since there is noauto_mapin the checkpoint's config,AutoConfighits aKeyErrorand raises:Fix: Add
"ernie4_5_moe_vl"alias inSPECIAL_MODEL_TYPE_TO_MODULE_NAME(pointing to theernie4_5_vl_moemodule),CONFIG_MAPPING_NAMES, andMODEL_NAMES_MAPPING.Bug 2 —
rope_thetavalidation skipped (silent misconfiguration)PreTrainedConfig.__init__only triggeredconvert_rope_params_to_dictwhenrope_thetawas present in**kwargs. However,Ernie4_5Config.__init__consumesrope_thetaas a named parameter (setsself.rope_theta = 500000) before callingsuper().__init__(**kwargs)— sorope_thetais never inkwargs. The RoPE standardization branch never fires.Fix: Also check
getattr(self, "rope_theta", None) is not Noneso the conversion path fires correctly whenrope_thetawas already set as an instance attribute.Bug 3 —
moe_num_expertstype too narrow (StrictDataclassFieldValidationError)Ernie4_5_VLMoeTextConfigdeclaresmoe_num_experts: int | None = 64, but the published checkpoint supplies"moe_num_experts": [64, 64]— a per-layer list. The@strictdataclass validator rejects the list.Fix: Widen the type annotation to
int | list[int] | None.Verification
Related