Skip to content

fix: nemotron_nano_9b_squad_peft checkpoint robustness thresholds#1944

Closed
adil-a wants to merge 1 commit intomainfrom
adil-a/fix-48953745-nemotron-nano-9b-squad-peft
Closed

fix: nemotron_nano_9b_squad_peft checkpoint robustness thresholds#1944
adil-a wants to merge 1 commit intomainfrom
adil-a/fix-48953745-nemotron-nano-9b-squad-peft

Conversation

@adil-a
Copy link
Copy Markdown
Collaborator

@adil-a adil-a commented Apr 21, 2026

Summary

Fixes nemotron_nano_9b_squad_peft checkpoint robustness CI job (pipeline 48953745, job 301287621). Same Mamba-hybrid save/reload non-determinism that affects the SFT sibling (#1943) also hits the PEFT path. save_consolidated: true serializes and reloads the full NemotronH base (Mamba2 mixer state) regardless of whether a LoRA adapter is attached, so Phase 3 automodel-from-consolidated KL is ~1.7e-3 (>0) — identical class of failure as the SFT run.

Applies the same four YAML fixes as #1943:

  • kl_threshold=5e-3 (~3x observed 1.7e-3)
  • trust_remote_code=true so Phase 4's vanilla-HF load of nvidia/NVIDIA-Nemotron-Nano-9B-v2 uses the repo-shipped configuration_nemotron_h.py (transformers 5.5.4's builtin _pattern_to_list raises KeyError('-') on the hybrid pattern)
  • no_check_resume=true (Mamba hybrid resume is non-deterministic, matches the nemotron_nano_8b_v1_squad precedent)
  • dist_env.timeout_minutes: 1 -> 20 so Phase 4's rank-0-only base-model + adapter merge load doesn't trip the 60s NCCL barrier on other ranks

No code changes.

Test plan

  • Reproduced CI failure byte-identically on cw-dfw 8xH100: Phase 3 max KL = 1.732453e-03 (CI: 1.401710e-03), same class of failure.
  • Verified fix: Phase 3 max KL = 1.441502e-03 (≤ 5e-3), Phase 4 max KL = 7.076394e-04 (≤ 5e-3), 1 passed, 26 warnings in 142.18s.

🤖 Generated with Claude Code

Same Mamba-hybrid save/reload non-determinism that affects the SFT
sibling (#1943) also hits the PEFT path: Phase 3 `automodel-from-
consolidated` KL is ~1.7e-3 (>0) because `save_consolidated: true`
still serializes and reloads the full NemotronH base (Mamba2 mixer
state), independent of LoRA. Apply the same four YAML fixes:

- kl_threshold=5e-3 (~3x observed 1.7e-3)
- trust_remote_code=True so Phase 4's vanilla-HF load of
  nvidia/NVIDIA-Nemotron-Nano-9B-v2 uses the repo-shipped
  configuration_nemotron_h.py (transformers 5.5.4's builtin
  _pattern_to_list raises KeyError('-') on the hybrid pattern)
- no_check_resume=True (Mamba hybrid resume is inherently
  non-deterministic, matches nemotron_nano_8b_v1_squad precedent)
- dist_env.timeout_minutes 1 -> 20 so Phase 4's rank-0-only base
  model + adapter merge load doesn't trip the 60s NCCL barrier

Verified on cw-dfw 8xH100 (TP=2, transformers 5.5.4, CI launcher
overrides): Phase 3 max KL = 1.441502e-03, Phase 4 max KL =
7.076394e-04, `1 passed`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@adil-a
Copy link
Copy Markdown
Collaborator Author

adil-a commented Apr 21, 2026

Superseded by #1971 (batched pipeline-48953745 fixes). Threshold/flag change for this recipe is included in that PR.

@adil-a adil-a closed this Apr 21, 2026
akoumpa added a commit that referenced this pull request Apr 22, 2026
…PRs) (#1971)

* fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745)

Transformers v5.5 (#1734) introduced small forward-pass changes in
Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without
changing correctness. Four recipes in pipeline 48953745 were failing the
pre-existing tight bounds for this reason; authors opened separate PRs
with per-recipe thresholds.

Unify the bound at 1e-1 so the whole family passes under one policy.
Observed Phase 4 KLs on the current nightly sqsh
(automodel_nightly_21-4-2026.sqsh) for reference:
- gemma_3_270m_squad             : 2.91e-2 (was 6e-3)
- gemma_3_270m_squad_peft        : 1.68e-2 (was 8e-3)
- qwen3_moe_30b_hellaswag        : 2.43e-2 (was 1e-2)
- customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2)

All comfortably under the new 1e-1 bound (3-4x margin on the tightest).

Supersedes #1932, #1933, #1939, #1942.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745

Adds known-good test-harness flags for model families where the checkpoint
robustness test was failing for reasons other than Phase 4 threshold drift:

nemotron_nano_9b_squad{,_peft} (Mamba hybrid):
  - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init)
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba
    SSM state under FSDP all-reduce)
  - ci.checkpoint_robustness.trust_remote_code: true
  - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det)

ministral3_3b_squad{,_peft} (FP8 + FSDP2):
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params
    under FSDP2 aren't losslessly round-trippable)
  - ci.checkpoint_robustness.no_check_resume: true

nemotron_super_v3_hellaswag (multi-node DP=32):
  - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64
    (prior gbs wasn't divisible by DP=32)

Supersedes #1943, #1944, #1946, #1947, #1949.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* Apply suggestions from code review

Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
akoumpa added a commit that referenced this pull request Apr 22, 2026
…es 9 PRs) (1971)` into `r0.4.0` (#1979)

fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) (#1971)

* fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745)

Transformers v5.5 (#1734) introduced small forward-pass changes in
Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without
changing correctness. Four recipes in pipeline 48953745 were failing the
pre-existing tight bounds for this reason; authors opened separate PRs
with per-recipe thresholds.

Unify the bound at 1e-1 so the whole family passes under one policy.
Observed Phase 4 KLs on the current nightly sqsh
(automodel_nightly_21-4-2026.sqsh) for reference:
- gemma_3_270m_squad             : 2.91e-2 (was 6e-3)
- gemma_3_270m_squad_peft        : 1.68e-2 (was 8e-3)
- qwen3_moe_30b_hellaswag        : 2.43e-2 (was 1e-2)
- customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2)

All comfortably under the new 1e-1 bound (3-4x margin on the tightest).

Supersedes #1932, #1933, #1939, #1942.



* fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745

Adds known-good test-harness flags for model families where the checkpoint
robustness test was failing for reasons other than Phase 4 threshold drift:

nemotron_nano_9b_squad{,_peft} (Mamba hybrid):
  - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init)
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba
    SSM state under FSDP all-reduce)
  - ci.checkpoint_robustness.trust_remote_code: true
  - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det)

ministral3_3b_squad{,_peft} (FP8 + FSDP2):
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params
    under FSDP2 aren't losslessly round-trippable)
  - ci.checkpoint_robustness.no_check_resume: true

nemotron_super_v3_hellaswag (multi-node DP=32):
  - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64
    (prior gbs wasn't divisible by DP=32)

Supersedes #1943, #1944, #1946, #1947, #1949.



* Apply suggestions from code review



---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
linnanwang pushed a commit that referenced this pull request Apr 24, 2026
…PRs) (#1971)

* fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745)

Transformers v5.5 (#1734) introduced small forward-pass changes in
Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without
changing correctness. Four recipes in pipeline 48953745 were failing the
pre-existing tight bounds for this reason; authors opened separate PRs
with per-recipe thresholds.

Unify the bound at 1e-1 so the whole family passes under one policy.
Observed Phase 4 KLs on the current nightly sqsh
(automodel_nightly_21-4-2026.sqsh) for reference:
- gemma_3_270m_squad             : 2.91e-2 (was 6e-3)
- gemma_3_270m_squad_peft        : 1.68e-2 (was 8e-3)
- qwen3_moe_30b_hellaswag        : 2.43e-2 (was 1e-2)
- customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2)

All comfortably under the new 1e-1 bound (3-4x margin on the tightest).

Supersedes #1932, #1933, #1939, #1942.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745

Adds known-good test-harness flags for model families where the checkpoint
robustness test was failing for reasons other than Phase 4 threshold drift:

nemotron_nano_9b_squad{,_peft} (Mamba hybrid):
  - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init)
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba
    SSM state under FSDP all-reduce)
  - ci.checkpoint_robustness.trust_remote_code: true
  - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det)

ministral3_3b_squad{,_peft} (FP8 + FSDP2):
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params
    under FSDP2 aren't losslessly round-trippable)
  - ci.checkpoint_robustness.no_check_resume: true

nemotron_super_v3_hellaswag (multi-node DP=32):
  - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64
    (prior gbs wasn't divisible by DP=32)

Supersedes #1943, #1944, #1946, #1947, #1949.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* Apply suggestions from code review

Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant