Skip to content

cp: fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) (1971) into r0.4.0#1979

Merged
akoumpa merged 1 commit intor0.4.0from
cherry-pick-1971-r0.4.0
Apr 22, 2026
Merged

cp: fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) (1971) into r0.4.0#1979
akoumpa merged 1 commit intor0.4.0from
cherry-pick-1971-r0.4.0

Conversation

@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor

beep boop [🤖]: Hi @adil-a 👋,

we've cherry picked #1971 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

…PRs) (#1971)

* fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745)

Transformers v5.5 (#1734) introduced small forward-pass changes in
Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without
changing correctness. Four recipes in pipeline 48953745 were failing the
pre-existing tight bounds for this reason; authors opened separate PRs
with per-recipe thresholds.

Unify the bound at 1e-1 so the whole family passes under one policy.
Observed Phase 4 KLs on the current nightly sqsh
(automodel_nightly_21-4-2026.sqsh) for reference:
- gemma_3_270m_squad             : 2.91e-2 (was 6e-3)
- gemma_3_270m_squad_peft        : 1.68e-2 (was 8e-3)
- qwen3_moe_30b_hellaswag        : 2.43e-2 (was 1e-2)
- customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2)

All comfortably under the new 1e-1 bound (3-4x margin on the tightest).

Supersedes #1932, #1933, #1939, #1942.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745

Adds known-good test-harness flags for model families where the checkpoint
robustness test was failing for reasons other than Phase 4 threshold drift:

nemotron_nano_9b_squad{,_peft} (Mamba hybrid):
  - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init)
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba
    SSM state under FSDP all-reduce)
  - ci.checkpoint_robustness.trust_remote_code: true
  - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det)

ministral3_3b_squad{,_peft} (FP8 + FSDP2):
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params
    under FSDP2 aren't losslessly round-trippable)
  - ci.checkpoint_robustness.no_check_resume: true

nemotron_super_v3_hellaswag (multi-node DP=32):
  - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64
    (prior gbs wasn't divisible by DP=32)

Supersedes #1943, #1944, #1946, #1947, #1949.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* Apply suggestions from code review

Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor Author

/ok to test 5a21ede

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa akoumpa merged commit 0364ea8 into r0.4.0 Apr 22, 2026
12 of 14 checks passed
@akoumpa akoumpa deleted the cherry-pick-1971-r0.4.0 branch April 22, 2026 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick Run CICD Trigger Testing CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants