[nemotron_h] respect _no_reinit flag on dt_bias and out_proj.weight by vai-minzhou · Pull Request #45591 · huggingface/transformers

vai-minzhou · 2026-04-23T01:47:23Z

Summary

NemotronHPreTrainedModel._init_weights unconditionally overwrites two trained parameters every time it is invoked:

NemotronHMamba2Mixer.dt_bias — reset to a fresh inv_softplus(random dt) draw
{…}.out_proj.weight — reset to a kaiming-uniform scaled by 1/sqrt(num_hidden_layers)

It sets module.dt_bias._no_reinit = True after the copy, but that flag is only checked for the nn.Linear.bias branch of the same function — never for dt_bias itself, and out_proj.weight doesn't set the flag at all.

On transformers>=5.0, _init_weights runs a second time after from_pretrained has finished loading the checkpoint (the post-load pass that initialises tensors still on meta). For NemotronHForCausalLM that silently overwrites the on-disk values for dt_bias and out_proj.weight with fresh random ones, while all other tensors keep their trained values.

The resulting model outputs repetitive filler streams like and and and , and and , for any input — sanity is preserved only when loading through vLLM (which bypasses _init_weights) or via an older transformers release.

Reproduction

import json, pathlib, torch
from safetensors.torch import load_file
from transformers import AutoConfig, AutoModelForCausalLM

path = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"   # any Nemotron-H ckpt
cfg = AutoConfig.from_pretrained(path, trust_remote_code=True)
cfg._attn_implementation = "eager"
m = AutoModelForCausalLM.from_pretrained(path, config=cfg, torch_dtype=torch.bfloat16)

idx = json.load(open(pathlib.Path(path) / "model.safetensors.index.json"))["weight_map"]
k = "backbone.layers.0.mixer.dt_bias"
on_disk = load_file(f"{path}/{idx[k]}")[k]
in_mem  = m.backbone.layers[0].mixer.dt_bias
print((on_disk.float() - in_mem.float().cpu()).abs().max().item())
# → ~26.8 before this patch, 0 after

Prompting "Hello, how are you? I am" on an unpatched load returns ' and' ' in' ' the' ' first' ',' as top-5 next tokens — a symptom of Mamba2 with randomised dt_bias and mis-scaled out_proj. After the patch, trained values are preserved and the model generates normally.

The fix

Both changes live in NemotronHPreTrainedModel._init_weights:

dt_bias branch: early-return if dt_bias._no_reinit is already set (the flag is set at the end of the current branch, so the first pass initialises normally, the second pass becomes a no-op).
out_proj.weight branch: skip when p._no_reinit is set, and set p._no_reinit = True after the initial kaiming scale so a second invocation is a no-op.

Fresh-init training is unaffected — only the second (post-load) invocation is made idempotent. Same edit is mirrored into modular_nemotron_h.py and modeling_nemotron_h.py.

Test plan

Unpatched load: |on_disk - in_mem|.max() for layer-0 dt_bias ≈ 26.8, next-token logits return stop-word garbage.
Patched load: diff is 0, next-token logits look sane, eval on our NemotronH-based classifier no longer collapses to 1000/1000 parse failures.
CI: run tests/models/nemotron_h/ — no behaviour change for fresh-init, only the idempotence of the re-init pass changes.

Please let me know if you'd like the fix to take a different shape (e.g. short-circuit _init_weights entirely when the module's parameters are all materialised, or move the guard to a shared utility in modeling_utils). Happy to adjust.

_init_weights() on `NemotronHPreTrainedModel` unconditionally overwrites `dt_bias` (random `inv_softplus(dt)`) and `out_proj.weight` (kaiming_uniform scaled by 1/sqrt(n_layer)) every time it is invoked on a mamba block. It sets `module.dt_bias._no_reinit = True` after the copy, but the flag is never checked by either code path (only the Linear-bias branch reads it). On transformers>=5.0, `_init_weights` is triggered a second time after `from_pretrained()` has loaded the checkpoint (the post-load safety pass that initializes tensors staying on `meta`). For `NemotronHForCausalLM` that silently overwrites the checkpoint values for `dt_bias` and `out_proj.weight` with fresh random draws. The model then outputs repetitive stop-word streams like ` and and and and ,` for any input. Minimal repro with any Nemotron-H checkpoint: from transformers import AutoConfig, AutoModelForCausalLM from safetensors.torch import load_file import json, pathlib path = ".../NVIDIA-Nemotron-Cascade-2-30B-A3B-BF16" # or Nano cfg = AutoConfig.from_pretrained(path); cfg._attn_implementation='eager' m = AutoModelForCausalLM.from_pretrained(path, config=cfg, torch_dtype='bfloat16') idx = json.loads((pathlib.Path(path) / 'model.safetensors.index.json').read_text())['weight_map'] k = 'backbone.layers.0.mixer.dt_bias' on_disk = load_file(f'{path}/{idx[k]}')[k] in_mem = m.backbone.layers[0].mixer.dt_bias print((on_disk.float() - in_mem.float().cpu()).abs().max()) # ~26.8 This patch makes `_init_weights` honour `_no_reinit` on both `dt_bias` and `out_proj.weight` (the only two params that re-init unconditionally), and sets `_no_reinit = True` on `out_proj.weight` after the initial kaiming scale so a second pass is a no-op. Ordinary fresh-init training is unaffected; only the second invocation becomes idempotent. Signed-off-by: Min Zhou <minzhou@virtueai.com>

Rocketknight1 · 2026-04-23T12:20:34Z

Hey, I'm not sure about this PR! We already have the _is_hf_initialized attribute, so I'm worried about the no_reinit flag that does the same thing, even though it seems like it's already in the codebase. Can you dig a little deeper and figure out why we don't just use _is_hf_initialized here?

@Rocketknight1

Per @Rocketknight1's review: replace the ad-hoc `_no_reinit` flag with the existing `_is_hf_initialized` flag that `from_pretrained` already sets on checkpoint-loaded parameters. Guard each Mamba2 init target (A_log / D / dt_bias) and the residual-scaled `out_proj.weight` independently, so parameters restored from a checkpoint survive any subsequent `_init_weights` pass.

vai-minzhou · 2026-04-24T02:01:01Z

Thanks for the review! You're right — _is_hf_initialized is the canonical flag and is already set on checkpoint-loaded parameters by from_pretrained. Pushed a new commit that:

replaces each _no_reinit check with _is_hf_initialized, guarded per-parameter (A_log, D, dt_bias, and out_proj.weight), so a loaded value survives any subsequent _init_weights pass;
drops the _no_reinit = True assignments that this PR previously added (no longer needed — the checkpoint loader already sets the flag).

Left the pre-existing getattr(module.bias, "_no_reinit", False) check on nn.Linear.bias untouched since it's orthogonal to this fix.

Repro that motivated the original bug: with the old code, loading a finetuned NemotronH checkpoint left |diff(loaded, effective)| ≈ 26 on dt_bias and substantial drift on out_proj.weight, because _init_weights ran a second time after the safety pass and drew fresh random values. With this fix those params are byte-identical to the checkpoint.

Rocketknight1 · 2026-04-24T13:16:43Z

The code is still referencing _no_reinit here - can we just replace with _is_hf_initialized everywhere in the file, or will that cause problems?

vai-minzhou · 2026-05-01T00:00:12Z

Just pushed 5aee35f that swaps the remaining _no_reinit getter on nn.Linear.bias to _is_hf_initialized. Confirmed safe — after my earlier patch removed the only _no_reinit = True assignment, that flag is set nowhere in the repo (grep -rn "_no_reinit *= *True" . returns nothing), so the old check was effectively dead. Using _is_hf_initialized makes the file consistent and gains a small upside: a checkpoint-loaded bias (rare for the layers in question, but possible) now survives a re-init pass.

github-actions · 2026-05-01T00:01:22Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: nemotron_h

tarekziade mentioned this pull request Apr 24, 2026

[nemotron_h] respect _no_reinit flag on dt_bias and out_proj.weight tarekziade/tarekziade-transformers-reviewer-test#3

Closed

3 tasks

Rocketknight1 reviewed Apr 24, 2026

View reviewed changes

Use _is_hf_initialized for nn.Linear.bias check too

5aee35f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nemotron_h] respect _no_reinit flag on dt_bias and out_proj.weight#45591

[nemotron_h] respect _no_reinit flag on dt_bias and out_proj.weight#45591
vai-minzhou wants to merge 3 commits intohuggingface:mainfrom
vai-minzhou:fix-nemotronh-init-overwrite

vai-minzhou commented Apr 23, 2026

Uh oh!

Rocketknight1 commented Apr 23, 2026

Uh oh!

vai-minzhou commented Apr 24, 2026

Uh oh!

Rocketknight1 Apr 24, 2026

Uh oh!

vai-minzhou commented May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vai-minzhou commented Apr 23, 2026

Summary

Reproduction

The fix

Test plan

Uh oh!

Rocketknight1 commented Apr 23, 2026

Uh oh!

vai-minzhou commented Apr 24, 2026

Uh oh!

Rocketknight1 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

vai-minzhou commented May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants