Skip to content

fix: nemotron_super_v3_hellaswag checkpoint robustness batch size#1949

Closed
adil-a wants to merge 1 commit intomainfrom
adil-a/fix-48953745-nemotron-super-v3-hellaswag
Closed

fix: nemotron_super_v3_hellaswag checkpoint robustness batch size#1949
adil-a wants to merge 1 commit intomainfrom
adil-a/fix-48953745-nemotron-super-v3-hellaswag

Conversation

@adil-a
Copy link
Copy Markdown
Collaborator

@adil-a adil-a commented Apr 21, 2026

Summary

  • Phase 1 of the checkpoint-robustness test crashes in StepScheduler.__init__ with AssertionError: global_batch_size (32) must be divisible by local_batch_size * dp_size (2 * 32).
  • The CI robustness launcher (tests/ci_tests/scripts/finetune_launcher.sh) hardcodes --step_scheduler.local_batch_size 2 and --step_scheduler.global_batch_size 32, but this config runs on 4 nodes x 8 GPUs (dp_size=32 with fsdp2, tp=1, cp=1, ep=32), so 32 is not divisible by 2 * 32 = 64.
  • YAML-only fix: add step_scheduler.global_batch_size: 64 under ci.checkpoint_robustness. The test harness (_extract_custom_args in test_checkpoint_robustness_llm.py) extends dotted YAML entries to the tail of the argv, so the YAML value overrides the launcher's CLI value. 64 % (2 * 32) == 0, grad-accum = 1.

Fixes CI job 301287545 in pipeline 48953745.

Test plan

  • Locally simulated _extract_custom_args + arg parser: confirmed --step_scheduler.global_batch_size 64 is appended after the launcher's 32 and becomes the final value.
  • cw-dfw multi-node repro was attempted but blocked by the local sqsh container lacking transformers.models.gemma4 (pre-dates PR fix: Update recipe_owner for gemma4 #1925); CI container has it. The fix is a single YAML bump with verified override order.
  • CI re-run.

🤖 Generated with Claude Code

Phase 1 crashes in StepScheduler.__init__ with
`global_batch_size (32) must be divisible by local_batch_size * dp_size
(2 * 32)`. The CI robustness launcher hardcodes
`--step_scheduler.local_batch_size 2` and `--step_scheduler.global_batch_size
32`, but this config runs on 4 nodes x 8 GPUs (dp_size=32 with fsdp2, tp=1,
cp=1, ep=32), so 32 is not divisible by 2 * 32 = 64.

Override `step_scheduler.global_batch_size: 64` in `ci.checkpoint_robustness`.
The test harness appends `ci.checkpoint_robustness` dotted-key entries to
the argv tail after the launcher-provided flags, so the YAML value wins.
With gbs=64, 64 % (2 * 32) == 0 and grad-accum = 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@adil-a
Copy link
Copy Markdown
Collaborator Author

adil-a commented Apr 21, 2026

Superseded by #1971 (batched pipeline-48953745 fixes). Threshold/flag change for this recipe is included in that PR.

@adil-a adil-a closed this Apr 21, 2026
akoumpa added a commit that referenced this pull request Apr 22, 2026
…PRs) (#1971)

* fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745)

Transformers v5.5 (#1734) introduced small forward-pass changes in
Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without
changing correctness. Four recipes in pipeline 48953745 were failing the
pre-existing tight bounds for this reason; authors opened separate PRs
with per-recipe thresholds.

Unify the bound at 1e-1 so the whole family passes under one policy.
Observed Phase 4 KLs on the current nightly sqsh
(automodel_nightly_21-4-2026.sqsh) for reference:
- gemma_3_270m_squad             : 2.91e-2 (was 6e-3)
- gemma_3_270m_squad_peft        : 1.68e-2 (was 8e-3)
- qwen3_moe_30b_hellaswag        : 2.43e-2 (was 1e-2)
- customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2)

All comfortably under the new 1e-1 bound (3-4x margin on the tightest).

Supersedes #1932, #1933, #1939, #1942.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745

Adds known-good test-harness flags for model families where the checkpoint
robustness test was failing for reasons other than Phase 4 threshold drift:

nemotron_nano_9b_squad{,_peft} (Mamba hybrid):
  - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init)
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba
    SSM state under FSDP all-reduce)
  - ci.checkpoint_robustness.trust_remote_code: true
  - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det)

ministral3_3b_squad{,_peft} (FP8 + FSDP2):
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params
    under FSDP2 aren't losslessly round-trippable)
  - ci.checkpoint_robustness.no_check_resume: true

nemotron_super_v3_hellaswag (multi-node DP=32):
  - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64
    (prior gbs wasn't divisible by DP=32)

Supersedes #1943, #1944, #1946, #1947, #1949.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* Apply suggestions from code review

Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
akoumpa added a commit that referenced this pull request Apr 22, 2026
…es 9 PRs) (1971)` into `r0.4.0` (#1979)

fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) (#1971)

* fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745)

Transformers v5.5 (#1734) introduced small forward-pass changes in
Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without
changing correctness. Four recipes in pipeline 48953745 were failing the
pre-existing tight bounds for this reason; authors opened separate PRs
with per-recipe thresholds.

Unify the bound at 1e-1 so the whole family passes under one policy.
Observed Phase 4 KLs on the current nightly sqsh
(automodel_nightly_21-4-2026.sqsh) for reference:
- gemma_3_270m_squad             : 2.91e-2 (was 6e-3)
- gemma_3_270m_squad_peft        : 1.68e-2 (was 8e-3)
- qwen3_moe_30b_hellaswag        : 2.43e-2 (was 1e-2)
- customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2)

All comfortably under the new 1e-1 bound (3-4x margin on the tightest).

Supersedes #1932, #1933, #1939, #1942.



* fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745

Adds known-good test-harness flags for model families where the checkpoint
robustness test was failing for reasons other than Phase 4 threshold drift:

nemotron_nano_9b_squad{,_peft} (Mamba hybrid):
  - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init)
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba
    SSM state under FSDP all-reduce)
  - ci.checkpoint_robustness.trust_remote_code: true
  - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det)

ministral3_3b_squad{,_peft} (FP8 + FSDP2):
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params
    under FSDP2 aren't losslessly round-trippable)
  - ci.checkpoint_robustness.no_check_resume: true

nemotron_super_v3_hellaswag (multi-node DP=32):
  - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64
    (prior gbs wasn't divisible by DP=32)

Supersedes #1943, #1944, #1946, #1947, #1949.



* Apply suggestions from code review



---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
linnanwang pushed a commit that referenced this pull request Apr 24, 2026
…PRs) (#1971)

* fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745)

Transformers v5.5 (#1734) introduced small forward-pass changes in
Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without
changing correctness. Four recipes in pipeline 48953745 were failing the
pre-existing tight bounds for this reason; authors opened separate PRs
with per-recipe thresholds.

Unify the bound at 1e-1 so the whole family passes under one policy.
Observed Phase 4 KLs on the current nightly sqsh
(automodel_nightly_21-4-2026.sqsh) for reference:
- gemma_3_270m_squad             : 2.91e-2 (was 6e-3)
- gemma_3_270m_squad_peft        : 1.68e-2 (was 8e-3)
- qwen3_moe_30b_hellaswag        : 2.43e-2 (was 1e-2)
- customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2)

All comfortably under the new 1e-1 bound (3-4x margin on the tightest).

Supersedes #1932, #1933, #1939, #1942.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745

Adds known-good test-harness flags for model families where the checkpoint
robustness test was failing for reasons other than Phase 4 threshold drift:

nemotron_nano_9b_squad{,_peft} (Mamba hybrid):
  - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init)
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba
    SSM state under FSDP all-reduce)
  - ci.checkpoint_robustness.trust_remote_code: true
  - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det)

ministral3_3b_squad{,_peft} (FP8 + FSDP2):
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params
    under FSDP2 aren't losslessly round-trippable)
  - ci.checkpoint_robustness.no_check_resume: true

nemotron_super_v3_hellaswag (multi-node DP=32):
  - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64
    (prior gbs wasn't divisible by DP=32)

Supersedes #1943, #1944, #1946, #1947, #1949.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* Apply suggestions from code review

Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant