:rotating_light: Validate config attributes by zucchini-nlp · Pull Request #41250 · huggingface/transformers

zucchini-nlp · 2025-10-01T11:50:47Z

What does this PR do?

As per title. Continues from #40793 and supersedes #36534

NOTE: config classes can't accept positional args anymore! I don't think anyone would use pos args anyway but marring the PR as breaking

Note

High Risk
Refactors PreTrainedConfig and many model config classes to @dataclass + huggingface_hub @strict validation, which can change initialization/serialization behavior and reject previously-accepted configs. Also enforces save-time validation and updates defaults/deprecations (e.g., use_return_dict), risking backward-compatibility across model loading and downstream integrations.

Overview
Adds strict config validation. PreTrainedConfig is converted to a @dataclass with huggingface_hub’s @strict, introduces built-in validators (architecture consistency, special token id ranges, layer type checks, output_attentions vs attn_implementation), and runs validate() automatically on save_pretrained.

Modernizes and standardizes model configs. Many model configuration classes are migrated from custom __init__ logic to dataclass fields + __post_init__, moving compatibility logic (e.g., defaulting sub-configs, key/value casting for JSON) into post-init and adding model-specific validate_architecture where needed.

API/behavior tweaks. Deprecates use_return_dict in favor of return_dict (and updates multiple model forward paths accordingly), adjusts RoPE validation ignore-key handling, narrows AutoTokenizer fallback exception handling, and bumps the minimum huggingface-hub requirement to >=1.5.0.

^{Written by Cursor Bugbot for commit 07095f3. This will update automatically on new commits. Configure here.}

HuggingFaceDocBuilderDev · 2025-10-01T11:59:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2025-12-10T13:12:32Z

Blocked by #41541 (comment) for now

zucchini-nlp · 2026-02-03T13:12:25Z

Tieme to revive this branch

…mote code, tmrw

zucchini-nlp · 2026-02-04T17:05:43Z

Nice, much better and easy to maintain BC with remote code now!

ArthurZucker

Very very nice!

ArthurZucker · 2026-02-05T07:51:48Z

+        # Keys are always strings in JSON so convert ids to int here for id2label and pruned_heads
+        if self.id2label is None:
+            self._create_id_label_maps(kwargs.get("num_labels", 2))
+        else:
+            if kwargs.get("num_labels") is not None and len(self.id2label) != kwargs.get("num_labels"):
+                logger.warning(
+                    f"You passed `num_labels={kwargs.get('num_labels')}` which is incompatible to "
+                    f"the `id2label` map of length `{len(self.id2label)}`."
+                )
+            self.id2label = {int(key): value for key, value in self.id2label.items()}


is it a good time to get rid of these general attributes and only have them for models that actually require them?

@strict

Vendor the HyperCLOVAX Vision config into vLLM to fix transformers v5 compatibility. The upstream remote code config does not handle empty initialization (text_config=None), which breaks v5's @strict config validation added in huggingface/transformers#41250. Fixes: vllm-project#38387 TODO: Remove vendored config once HyperCLOVAX is upstreamed to transformers. Tracking PR: huggingface/transformers#44956 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@strict

Vendor the HyperCLOVAX Vision config into vLLM to fix transformers v5 compatibility. The upstream remote code config does not handle empty initialization (text_config=None), which breaks v5's @strict config validation added in huggingface/transformers#41250. Fixes: vllm-project#38387 TODO: Remove vendored config once HyperCLOVAX is upstreamed to transformers. Tracking PR: huggingface/transformers#44956 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fang Han <fhan0520@gmail.com>

AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B", dtype=torch.float32) on transformers v5.5.4 returns bfloat16 tensors for hidden_states because since v5.4.0 (huggingface/transformers#41250), the `dtype` keyword argument is consumed by AutoConfig and not forwarded to the model when the config has a nested text_config. Weights then load from the checkpoint in bfloat16, and every hidden_state comes back bf16. Two changes to test_qwen3_5: 1. Use `.float()` before `np.allclose` so it doesn't raise `TypeError: Got unsupported ScalarType BFloat16`, matching the existing pattern in tests/tx/models/test_qwen3.py. 2. Loosen the per-layer and final hidden_states tolerances from `rtol=1e-3, atol=1e-3` to `rtol=5e-3, atol=5e-3`. The HF reference is effectively bfloat16 at those checkpoints, so the prior float32-scale tolerance was unachievable — the observed layer-1 drift was ~3.6e-3 on a value of ~0.34. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B", dtype=torch.float32) on transformers v5.6.0 returns bfloat16 tensors for hidden_states because since v5.4.0 (huggingface/transformers#41250), the `dtype` keyword argument is consumed by AutoConfig and not forwarded to the model when the config has a nested text_config. Weights then load from the checkpoint in bfloat16, and every hidden_state comes back bf16. Two changes to test_qwen3_5: 1. Use `.float()` before `np.allclose` so it doesn't raise `TypeError: Got unsupported ScalarType BFloat16`, matching the existing pattern in tests/tx/models/test_qwen3.py. 2. Loosen the per-layer and final hidden_states tolerances from `rtol=1e-3, atol=1e-3` to `rtol=5e-3, atol=5e-3`. The HF reference is effectively bfloat16 at those checkpoints, so the prior float32-scale tolerance was unachievable — the observed layer-1 drift was ~3.6e-3 on a value of ~0.34. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B", dtype=torch.float32) on transformers>=5.4.0 returns bfloat16 hidden_states because the `dtype` keyword argument is consumed by AutoConfig and not forwarded to the model when the config has a nested `text_config` — and Qwen3.5 does — per huggingface/transformers#41250. Weights then load from the checkpoint in bfloat16 and the model runs that way end-to-end, diverging from the fp32 JAX model by ~6% at the final hidden state on the 0.8B checkpoint — too much even for the prior rtol=2e-2. Drop the silently-ignored `dtype=torch.float32` kwarg and chain `.float()` on the loaded model so the HF reference actually runs in fp32, mirroring the working pattern in test_qwen3.py (plain Qwen3 has no nested text_config, so the kwarg is still honored there). Tolerances return to rtol=1e-3, atol=1e-3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B", dtype=torch.float32) on transformers>=5.4.0 returns bfloat16 hidden_states because the `dtype` keyword argument is consumed by AutoConfig and not forwarded to the model when the config has a nested `text_config` — and Qwen3.5 does — per huggingface/transformers#41250. Weights then load from the checkpoint in bfloat16 and the model runs that way end-to-end, diverging from the fp32 JAX model by ~6% at the final hidden state on the 0.8B checkpoint — too much even for the prior rtol=2e-2. Drop the silently-ignored `dtype=torch.float32` kwarg and chain `.float()` on the loaded model so the HF reference actually runs in fp32. Also loosen the final-hidden-state tolerance to rtol=atol=4e-3: the .float() cast restores fp32 inference, but Qwen3.5's gated-delta-rule layers (exp/cumsum/tril) still accumulate more JAX-vs-PyTorch fp32 rounding across the stack than plain attention does, so the outlier element at the last layer exceeds 1e-3 on CI (Ubuntu/MKL) even when it passes locally. Earlier-layer assertions stay at their original tolerances. No matching change is needed in test_qwen3.py: Qwen3Config has no nested text_config (get_text_config() returns self), so the `dtype=torch.float32` kwarg is still forwarded to the model there, and that path is actually preferable to .float()-after-load because it allocates weights directly in fp32 instead of loading bf16 and up-casting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B", dtype=torch.float32) on transformers>=5.4.0 returns bfloat16 hidden_states because the `dtype` keyword argument is consumed by AutoConfig and not forwarded to the model when the config has a nested `text_config` — and Qwen3.5 does — per huggingface/transformers#41250. Weights then load from the checkpoint in bfloat16 and the model runs that way end-to-end, diverging from the fp32 JAX model by ~6% at the final hidden state on the 0.8B checkpoint — too much even for the prior rtol=2e-2. Drop the silently-ignored `dtype=torch.float32` kwarg and chain `.float()` on the loaded model so the HF reference actually runs in fp32. Also loosen the final-hidden-state tolerance to rtol=atol=1e-2: the .float() cast restores fp32 inference, but Qwen3.5's gated-delta-rule layers (exp/cumsum/tril) still accumulate more JAX-vs-PyTorch fp32 rounding across the stack than plain attention does, so the outlier element at the last layer reaches ~7e-3 on CI even though it passes at a tighter bound locally. Earlier-layer assertions stay at their original tolerances. No matching change is needed in test_qwen3.py: Qwen3Config has no nested text_config (get_text_config() returns self), so the `dtype=torch.float32` kwarg is still forwarded to the model there, and that path is actually preferable to .float()-after-load because it allocates weights directly in fp32 instead of loading bf16 and up-casting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B", dtype=torch.float32) on transformers>=5.4.0 returns bfloat16 hidden_states because the `dtype` keyword argument is consumed by AutoConfig and not forwarded to the model when the config has a nested `text_config` — and Qwen3.5 does — per huggingface/transformers#41250. Weights then load from the checkpoint in bfloat16 and the model runs that way end-to-end, diverging from the fp32 JAX model by ~6% at the final hidden state on the 0.8B checkpoint — too much even for the prior rtol=2e-2. Drop the silently-ignored `dtype=torch.float32` kwarg and chain `.float()` on the loaded model so the HF reference actually runs in fp32. Also loosen the final-hidden-state tolerance to rtol=atol=2e-2: the .float() cast restores fp32 inference, but Qwen3.5's gated-delta-rule layers (exp/cumsum/tril) still accumulate more JAX-vs-PyTorch fp32 rounding across the stack than plain attention does, and the CI outlier exceeds 1e-2 even though local runs fit a tighter bound. The final assertion now also prints the worst-element signed diff on failure so future drift is diagnosable without a local repro. Earlier-layer assertions stay at their original tolerances. No matching change is needed in test_qwen3.py: Qwen3Config has no nested text_config (get_text_config() returns self), so the `dtype=torch.float32` kwarg is still forwarded to the model there, and that path is actually preferable to .float()-after-load because it allocates weights directly in fp32 instead of loading bf16 and up-casting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop support for the `transformers` 4.x line and raise the minimum supported version to 5.4.0 in `pyproject.toml`. Why: - Upstream `transformers` switched `PreTrainedConfig` to dataclass in v5.4.0 (huggingface/transformers#41250). - Keeping 4.x would preserve incompatible behavior for this migration target. BREAKING CHANGE: `transformers` 4.x is no longer supported; upgrade to 5.4.0 or later.

AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B", dtype=torch.float32) on transformers>=5.4.0 returns bfloat16 hidden_states because the `dtype` keyword argument is consumed by AutoConfig and not forwarded to the model when the config has a nested `text_config` — and Qwen3.5 does — per huggingface/transformers#41250. Weights then load from the checkpoint in bfloat16 and the model runs that way end-to-end, diverging from the fp32 JAX model by ~6% at the final hidden state on the 0.8B checkpoint — too much even for the prior rtol=2e-2. Drop the silently-ignored `dtype=torch.float32` kwarg and chain `.float()` on the loaded model so the HF reference actually runs in fp32. Also loosen the final-hidden-state tolerance to rtol=atol=2e-2: the .float() cast restores fp32 inference, but Qwen3.5's gated-delta-rule layers (exp/cumsum/tril) still accumulate more JAX-vs-PyTorch fp32 rounding across the stack than plain attention does, and the CI outlier exceeds 1e-2 even though local runs fit a tighter bound. The final assertion now also prints the worst-element signed diff on failure so future drift is diagnosable without a local repro. Earlier-layer assertions stay at their original tolerances. No matching change is needed in test_qwen3.py: Qwen3Config has no nested text_config (get_text_config() returns self), so the `dtype=torch.float32` kwarg is still forwarded to the model there, and that path is actually preferable to .float()-after-load because it allocates weights directly in fp32 instead of loading bf16 and up-casting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

initial commit

2ecd9f5

zucchini-nlp added 9 commits October 1, 2025 19:00

just push for now

5758b7c

maybe not do it for all models, lets see how many models fail now

e9f940b

update

684e799

lets see what esle fails now

43eb47c

nit

868cac6

merge main

b7db732

style

17739ff

delete rope validation

19a2ba1

bart

6095a39

ArthurZucker reviewed Dec 10, 2025

View reviewed changes

Comment thread src/transformers/models/bart/configuration_bart.py Outdated

zucchini-nlp added 5 commits February 3, 2026 14:29

rebase

26892c1

make style

bfe2998

provate rope valid for now, hub complains

d039be1

more updates

a82d894

i love backwards compatibility! Let's check if this will work with re…

b7b0492

…mote code, tmrw

zucchini-nlp mentioned this pull request Feb 4, 2026

Pass kwargs to post init in dataclasses huggingface/huggingface_hub#3771

Merged

pin hf hub 1.4.0

b9aec45

ArthurZucker approved these changes Feb 5, 2026

View reviewed changes

zucchini-nlp added 7 commits February 6, 2026 15:30

merge main

7edc1a2

want to check tests

40d2128

why do we even keep use_return_dict from 6 hyear ago?

e241202

special eos token can be a list in many cases, fix type hints

7b24e38

batch

b79200f

batch

b4e93e3

batch

f011bd4

Maximellerbach mentioned this pull request Mar 27, 2026

fix(deps): breaking change from transformers 5.4.0 huggingface/lerobot#3231

Merged

HanFa mentioned this pull request Mar 27, 2026

[Transformers v5] HCXVisionForCausalLM vllm-project/vllm#38387

Open

HanFa mentioned this pull request Mar 29, 2026

[Transformers v5] Vendor HCXVisionConfig for compatibility vllm-project/vllm#38447

Open

NanoCode012 mentioned this pull request Mar 30, 2026

feat: move to uv first axolotl-ai-cloud/axolotl#3545

Merged

2 tasks

Rocketknight1 mentioned this pull request Mar 30, 2026

v5.4.0 breaks PretrainedConfig field in pydantic model #45070

Open

4 tasks

hmellor mentioned this pull request Apr 1, 2026

[Transformers v5] SarvamMLAForCausalLM vllm-project/vllm#38734

Open

Zelys-DFKH mentioned this pull request Apr 2, 2026

[Transformers v5] Add SarvamMLAConfig to fix SarvamMLAForCausalLM (#38734) vllm-project/vllm#38767

Closed

Vikrantpalle mentioned this pull request Apr 2, 2026

Fix sarvam forward compatibility with transformers v5 vllm-project/vllm#38804

Open

5 tasks

SurbhiJainUSC mentioned this pull request Apr 3, 2026

Update generated requirements for pre-training with JAX 0.9.2 AI-Hypercomputer/maxtext#3403

Merged

4 tasks

erictang000 mentioned this pull request Apr 3, 2026

[dependencies] Upgrade transformers to >=5.0.0,<=5.3.0 NovaSky-AI/SkyRL#1426

Merged

chernistry mentioned this pull request Apr 6, 2026

[Transformers v5] Fix SarvamMLAForCausalLM config vllm-project/vllm#39105

Closed

sharonyu-115 mentioned this pull request Apr 11, 2026

Qwen3_5MoeVisionConfig missing deepstack_visual_indexes field — silently dropped by @strict #45375

Open

4 tasks

Rocketknight1 mentioned this pull request Apr 13, 2026

fix(config): add deepstack_visual_indexes to Qwen3_5MoeVisionConfig #45379

Open

zovonoir mentioned this pull request Apr 28, 2026

Fix Qwen3.5 model config type error ROCm/ATOM#655

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚨 Validate config attributes#41250

🚨 Validate config attributes#41250
zucchini-nlp merged 114 commits intohuggingface:mainfrom
zucchini-nlp:config-validation

zucchini-nlp commented Oct 1, 2025 •

edited by cursor Bot

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Oct 1, 2025

Uh oh!

Uh oh!

zucchini-nlp commented Dec 10, 2025

Uh oh!

zucchini-nlp commented Feb 3, 2026

Uh oh!

zucchini-nlp commented Feb 4, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

zucchini-nlp commented Oct 1, 2025 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Oct 1, 2025

Uh oh!

Uh oh!

zucchini-nlp commented Dec 10, 2025

Uh oh!

zucchini-nlp commented Feb 3, 2026

Uh oh!

zucchini-nlp commented Feb 4, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zucchini-nlp commented Oct 1, 2025 •

edited by cursor Bot

Loading