Cumulative defect fixes from recent Transformers PRs#41
Open
Cumulative defect fixes from recent Transformers PRs#41
Conversation
`flash_attention_forward` unconditionally called `s_aux.to(query.dtype)`, which crashed with `AttributeError: 'NoneType' object has no attribute 'to'` for models that don't use attention sinks (e.g. Gemma). Mirrors the parallel guard added in huggingface#40434 for `flash_paged.py`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_init_weights() on `NemotronHPreTrainedModel` unconditionally overwrites
`dt_bias` (random `inv_softplus(dt)`) and `out_proj.weight` (kaiming_uniform
scaled by 1/sqrt(n_layer)) every time it is invoked on a mamba block.
It sets `module.dt_bias._no_reinit = True` after the copy, but the flag is
never checked by either code path (only the Linear-bias branch reads it).
On transformers>=5.0, `_init_weights` is triggered a second time after
`from_pretrained()` has loaded the checkpoint (the post-load safety pass
that initializes tensors staying on `meta`). For `NemotronHForCausalLM`
that silently overwrites the checkpoint values for `dt_bias` and
`out_proj.weight` with fresh random draws. The model then outputs
repetitive stop-word streams like ` and and and and ,` for any input.
Minimal repro with any Nemotron-H checkpoint:
from transformers import AutoConfig, AutoModelForCausalLM
from safetensors.torch import load_file
import json, pathlib
path = ".../NVIDIA-Nemotron-Cascade-2-30B-A3B-BF16" # or Nano
cfg = AutoConfig.from_pretrained(path); cfg._attn_implementation='eager'
m = AutoModelForCausalLM.from_pretrained(path, config=cfg, torch_dtype='bfloat16')
idx = json.loads((pathlib.Path(path) / 'model.safetensors.index.json').read_text())['weight_map']
k = 'backbone.layers.0.mixer.dt_bias'
on_disk = load_file(f'{path}/{idx[k]}')[k]
in_mem = m.backbone.layers[0].mixer.dt_bias
print((on_disk.float() - in_mem.float().cpu()).abs().max()) # ~26.8
This patch makes `_init_weights` honour `_no_reinit` on both `dt_bias` and
`out_proj.weight` (the only two params that re-init unconditionally), and
sets `_no_reinit = True` on `out_proj.weight` after the initial kaiming
scale so a second pass is a no-op. Ordinary fresh-init training is
unaffected; only the second invocation becomes idempotent.
Signed-off-by: Min Zhou <minzhou@virtueai.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
# Conflicts: # src/transformers/loss/loss_utils.py
Direct merge conflicted after Trainer refactors; applied the minimal config-saving change from 57cb2b9.
Owner
Author
All-defects flow statusProcessed terminal records: 500 Merged / applied defect records (204)
Rejected / not included records (296)
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cumulative defect fixes from recent Transformers PRs
This PR is generated by the all-defects mergeability flow. It accumulates defect-fix PRs from
huggingface/transformersthat could be applied cleanly to the current base.all-defectsevalstate/transformers:mainae9e74bc2aStatus counts
Category counts
Validation
Each applied defect fix was followed by the configured lightweight validation profile:
compileall -q src/transformersutils/checkers.py ruff_check,ruff_format,init_isort,sort_auto_mappingsutils/tests_fetcher.py ... && pytest ...when impacted pytest targets are selectedNote: this is intentionally not an end-to-end or slow-test validation pass.
Details
A detailed status table is posted as a PR comment and is also available locally in:
.mergeability/defect-merge-state.jsonl.mergeability/pr-classifications.jsonlall-defects-report.md