fix: nemotron_nano_9b_squad checkpoint robustness thresholds#1943
Closed
fix: nemotron_nano_9b_squad checkpoint robustness thresholds#1943
Conversation
Nemotron-Nano-9B-v2 is a Mamba-hybrid (NemotronH); its Mamba mixer
save/reload path is not bit-exact, so Phase 3 KL is ~1.6e-3 (>0).
Add kl_threshold=5e-3, widen timeout for Phase 4's 9B vanilla-HF load,
pass trust_remote_code=True so the HF load uses the model's own
configuration_nemotron_h.py (transformers 5.5.4's builtin parser
raises KeyError('-') on the model's hybrid_override_pattern), and
set no_check_resume to match the existing nemotron_nano_8b_v1_squad
precedent for Mamba-hybrid resume non-determinism.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
This was referenced Apr 21, 2026
Collaborator
Author
|
Superseded by #1971 (batched pipeline-48953745 fixes). Threshold/flag change for this recipe is included in that PR. |
akoumpa
added a commit
that referenced
this pull request
Apr 22, 2026
…PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * Apply suggestions from code review Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
akoumpa
added a commit
that referenced
this pull request
Apr 22, 2026
…es 9 PRs) (1971)` into `r0.4.0` (#1979) fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. * Apply suggestions from code review --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
linnanwang
pushed a commit
that referenced
this pull request
Apr 24, 2026
…PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * Apply suggestions from code review Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes CI job
nemotron_nano_9b_squad(CI job 301287542) in pipeline 48953745.kl_threshold: 5e-3— Nemotron-Nano-9B-v2 is a Mamba-hybrid (NemotronH) whose Mamba mixer save/reload path is not bit-exact; Phase 3 KL is ~1.6e-3 (>0 default), so the assertion fails. Follows the existing precedent innemotron_nano_8b_v1_squad.yaml.trust_remote_code: true— Phase 4 loads the consolidated safetensors viaAutoModelForCausalLM.from_pretrained. The model'shybrid_override_patternuses-as a separator, which transformers 5.5.4's builtinconfiguration_nemotron_h.py_pattern_to_listdoes not understand (raisesKeyError: '-'). Withtrust_remote_code=Truethe model's own custom config class (shipped in the HF repo) is used instead.no_check_resume: true— matches the siblingnemotron_nano_8b_v1_squad.yamland the known Mamba-hybrid resume non-determinism flagged intests/functional_tests/checkpoint_robustness/STATUS.md.dist_env.timeout_minutes: 1 -> 20— Phase 4's single-rank 9B vanilla-HF load exceeds the 60s NCCL collective timeout; other ranks otherwise time out at the_barrier()before rank 0 finishes. Matches the siblingqwen3_moe_30b_hellaswag.yamlpattern.Evidence
On cw-dfw 8xH100 with
transformers==5.5.4and CI launcher overrides (--step_scheduler.max_steps 5 --step_scheduler.ckpt_every_steps 5 --step_scheduler.val_every_steps 5 --step_scheduler.global_batch_size 32 --step_scheduler.local_batch_size 2,--checkpoint.checkpoint_dir /tmp/nemotron_nano_9b_ckpt):[Phase 3] Automodel-from-consolidated max KL: 1.254905e-03 (threshold: 5.000000e-03)✓[Phase 4] HF-loaded max KL: 1.284073e-03 (threshold: 5.000000e-03)✓1 passed, 27 warnings in 281.13s (0:04:41)Pre-fix reproducer byte-matches the CI failure (
Phase 3 KL ~1.62e-3 > 0).Test plan
a1dc3a67).🤖 Generated with Claude Code