Skip to content

fix: bump hf_kl_threshold for customizer_nemotron_nano_full_sft_chat#1939

Closed
adil-a wants to merge 1 commit intomainfrom
adil-a/fix-48953745-customizer-nemotron-nano-full-sft-chat
Closed

fix: bump hf_kl_threshold for customizer_nemotron_nano_full_sft_chat#1939
adil-a wants to merge 1 commit intomainfrom
adil-a/fix-48953745-customizer-nemotron-nano-full-sft-chat

Conversation

@adil-a
Copy link
Copy Markdown
Collaborator

@adil-a adil-a commented Apr 21, 2026

Summary

Test plan

  • Reproduced the original failure via the CI trace (tool_calls[0].id ValueError at line 212 across all 8 ranks).
  • On cw-dfw 8xH100 with transformers==5.5.4, DP=8, EP=8 and the CI launcher overrides (--step_scheduler.max_steps=50 --step_scheduler.val_every_steps=50 --step_scheduler.ckpt_every_steps=50 --step_scheduler.global_batch_size=8 --step_scheduler.local_batch_size=1), verified the robustness test passes with the bumped threshold:
    • [Phase 3] Automodel-from-consolidated max KL: 0.000000e+00 (threshold: 0.000000e+00) — save/reload bit-exact
    • [Phase 4] HF-loaded max KL: 1.037561e-02 (threshold: 1.000000e-01) — well under the bumped bound
    • Phase 6 is intentionally skipped via no_check_resume: true
    • Final line: 1 passed, 27 warnings in 162.91s (0:02:42)
  • CI pipeline green on the re-triggered customizer_nemotron_nano_full_sft_chat job.

Notes

  • The CI-pipeline container that produced the failure predates fix: chat dataset #1921; the next rebuilt container with fc46ae53 in place fixes the blocker. The threshold bump here is belt-and-suspenders for the v5.5 Phase 4 drift.
  • The CI's customizer sample dataset at /lustre/fsw/coreai_dlalgo_ci/automodel_ci/datasets/customizer/sample-datasets is not mounted on cw-dfw, so a synthetic plain-chat dataset was used locally. The existing Nemotron chat template uses tool_call.arguments | items which is incompatible with _normalize_tool_calls' stringified arguments — a separate follow-up that only matters if we ever want tool_calls in a Nemotron customizer dataset. Our synthetic dataset omits tool_calls, so this exercises the post-fix: chat dataset #1921 code path sufficiently for robustness/Phase 3/Phase 4 verification.
  • Phase 4 KL is only 1.04e-2 — factor of ~7x headroom under even the pre-bump 7e-2 threshold — so the bump is mostly preemptive safety, consistent with fix: bump hf_kl_threshold for customizer_llama_3_2_1b_full_sft_chat #1938's "~1.5x observed margin" pattern applied to MoE noise.

🤖 Generated with Claude Code

Nudge `ci.checkpoint_robustness.hf_kl_threshold` from 7e-2 to 1e-1 to
widen the safety margin now that the underlying chat-dataset
`tool_calls[0].id` regression is fixed in main (#1921 / fc46ae5).

Pipeline 48953745 / CI job 301287540 failed on a stale main before
#1921 was merged, dying inside the finetune phase with
`ValueError: assistant message tool_calls[0].id must be a non-empty string`
at `chat_dataset.py:212`. The robustness test stage never ran, so the
failure signature in the trace is purely the dataset error. With the
container rebuilt on current main, `_normalize_tool_calls` now autofills
`id=f"call_{idx}"` / `type="function"`, and the finetune phase proceeds.

Verified end-to-end on cw-dfw 8xH100 (transformers 5.5, DP=8 EP=8,
CI overrides `--step_scheduler.max_steps=50 --step_scheduler.val_every_steps=50
--step_scheduler.ckpt_every_steps=50 --step_scheduler.global_batch_size=8
--step_scheduler.local_batch_size=1`):

    [Phase 3] Automodel-from-consolidated max KL: 0.000000e+00 (threshold 0)
    [Phase 4] HF-loaded max KL: 1.037561e-02 (threshold 1.000000e-01)
    1 passed, 27 warnings in 162.91s

Phase 3 = 0 confirms save/reload is bit-exact; Phase 4 = 1.04e-2 is well
under both the existing 7e-2 and the bumped 1e-1 threshold. Phase 6 is
skipped (`no_check_resume: true`). The 1e-1 value keeps ~10x margin over
observed, matching the `~1.5x observed` pattern from #1932 / #1937 /
#1938 but starting from the config's already-generous MoE baseline.

Signed-off-by: Adil Asif <adasif@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@adil-a
Copy link
Copy Markdown
Collaborator Author

adil-a commented Apr 21, 2026

Superseded by #1971 (batched pipeline-48953745 fixes). Threshold/flag change for this recipe is included in that PR.

@adil-a adil-a closed this Apr 21, 2026
akoumpa added a commit that referenced this pull request Apr 22, 2026
…PRs) (#1971)

* fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745)

Transformers v5.5 (#1734) introduced small forward-pass changes in
Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without
changing correctness. Four recipes in pipeline 48953745 were failing the
pre-existing tight bounds for this reason; authors opened separate PRs
with per-recipe thresholds.

Unify the bound at 1e-1 so the whole family passes under one policy.
Observed Phase 4 KLs on the current nightly sqsh
(automodel_nightly_21-4-2026.sqsh) for reference:
- gemma_3_270m_squad             : 2.91e-2 (was 6e-3)
- gemma_3_270m_squad_peft        : 1.68e-2 (was 8e-3)
- qwen3_moe_30b_hellaswag        : 2.43e-2 (was 1e-2)
- customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2)

All comfortably under the new 1e-1 bound (3-4x margin on the tightest).

Supersedes #1932, #1933, #1939, #1942.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745

Adds known-good test-harness flags for model families where the checkpoint
robustness test was failing for reasons other than Phase 4 threshold drift:

nemotron_nano_9b_squad{,_peft} (Mamba hybrid):
  - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init)
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba
    SSM state under FSDP all-reduce)
  - ci.checkpoint_robustness.trust_remote_code: true
  - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det)

ministral3_3b_squad{,_peft} (FP8 + FSDP2):
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params
    under FSDP2 aren't losslessly round-trippable)
  - ci.checkpoint_robustness.no_check_resume: true

nemotron_super_v3_hellaswag (multi-node DP=32):
  - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64
    (prior gbs wasn't divisible by DP=32)

Supersedes #1943, #1944, #1946, #1947, #1949.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* Apply suggestions from code review

Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
akoumpa added a commit that referenced this pull request Apr 22, 2026
…es 9 PRs) (1971)` into `r0.4.0` (#1979)

fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) (#1971)

* fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745)

Transformers v5.5 (#1734) introduced small forward-pass changes in
Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without
changing correctness. Four recipes in pipeline 48953745 were failing the
pre-existing tight bounds for this reason; authors opened separate PRs
with per-recipe thresholds.

Unify the bound at 1e-1 so the whole family passes under one policy.
Observed Phase 4 KLs on the current nightly sqsh
(automodel_nightly_21-4-2026.sqsh) for reference:
- gemma_3_270m_squad             : 2.91e-2 (was 6e-3)
- gemma_3_270m_squad_peft        : 1.68e-2 (was 8e-3)
- qwen3_moe_30b_hellaswag        : 2.43e-2 (was 1e-2)
- customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2)

All comfortably under the new 1e-1 bound (3-4x margin on the tightest).

Supersedes #1932, #1933, #1939, #1942.



* fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745

Adds known-good test-harness flags for model families where the checkpoint
robustness test was failing for reasons other than Phase 4 threshold drift:

nemotron_nano_9b_squad{,_peft} (Mamba hybrid):
  - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init)
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba
    SSM state under FSDP all-reduce)
  - ci.checkpoint_robustness.trust_remote_code: true
  - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det)

ministral3_3b_squad{,_peft} (FP8 + FSDP2):
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params
    under FSDP2 aren't losslessly round-trippable)
  - ci.checkpoint_robustness.no_check_resume: true

nemotron_super_v3_hellaswag (multi-node DP=32):
  - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64
    (prior gbs wasn't divisible by DP=32)

Supersedes #1943, #1944, #1946, #1947, #1949.



* Apply suggestions from code review



---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
linnanwang pushed a commit that referenced this pull request Apr 24, 2026
…PRs) (#1971)

* fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745)

Transformers v5.5 (#1734) introduced small forward-pass changes in
Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without
changing correctness. Four recipes in pipeline 48953745 were failing the
pre-existing tight bounds for this reason; authors opened separate PRs
with per-recipe thresholds.

Unify the bound at 1e-1 so the whole family passes under one policy.
Observed Phase 4 KLs on the current nightly sqsh
(automodel_nightly_21-4-2026.sqsh) for reference:
- gemma_3_270m_squad             : 2.91e-2 (was 6e-3)
- gemma_3_270m_squad_peft        : 1.68e-2 (was 8e-3)
- qwen3_moe_30b_hellaswag        : 2.43e-2 (was 1e-2)
- customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2)

All comfortably under the new 1e-1 bound (3-4x margin on the tightest).

Supersedes #1932, #1933, #1939, #1942.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745

Adds known-good test-harness flags for model families where the checkpoint
robustness test was failing for reasons other than Phase 4 threshold drift:

nemotron_nano_9b_squad{,_peft} (Mamba hybrid):
  - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init)
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba
    SSM state under FSDP all-reduce)
  - ci.checkpoint_robustness.trust_remote_code: true
  - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det)

ministral3_3b_squad{,_peft} (FP8 + FSDP2):
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params
    under FSDP2 aren't losslessly round-trippable)
  - ci.checkpoint_robustness.no_check_resume: true

nemotron_super_v3_hellaswag (multi-node DP=32):
  - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64
    (prior gbs wasn't divisible by DP=32)

Supersedes #1943, #1944, #1946, #1947, #1949.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* Apply suggestions from code review

Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant