fix: bump hf_kl_threshold for customizer_nemotron_nano_full_sft_chat by adil-a · Pull Request #1939 · NVIDIA-NeMo/Automodel

adil-a · 2026-04-21T07:02:15Z

Summary

CI job customizer_nemotron_nano_full_sft_chat (pipeline 48953745 / job 301287540) failed before training started: ValueError: assistant message tool_calls[0].id must be a non-empty string at nemo_automodel/components/datasets/llm/chat_dataset.py:212. The checkpoint-robustness stage never got to run. The underlying _normalize_tool_calls strict-validation bug is already fixed on main by fix: chat dataset #1921 / fc46ae53 (landed 2026-04-20 20:20 PDT, after the failing run started at 2026-04-20 01:44 UTC).
Nudge examples/llm_finetune/nemotron/customizer_nemotron_nano_full_sft_chat.yaml ci.checkpoint_robustness.hf_kl_threshold from 7e-2 to 1e-1 as extra headroom for the transformers 5.3 → 5.5 Phase 4 drift that has forced threshold bumps on the sibling SFT-robustness jobs (fix: gemma_3_270m_squad HF KL regression in ckpt robustness #1932 / fix: gemma_3_270m_squad_peft HF KL regression in ckpt robustness #1933 / fix: qwen2_5_7b_squad ckpt robustness thresholds for transformers v5.5 #1937 / fix: bump hf_kl_threshold for customizer_llama_3_2_1b_full_sft_chat #1938). No code changes (chat-dataset fix is already in main).

Test plan

Reproduced the original failure via the CI trace (tool_calls[0].id ValueError at line 212 across all 8 ranks).
On cw-dfw 8xH100 with transformers==5.5.4, DP=8, EP=8 and the CI launcher overrides (--step_scheduler.max_steps=50 --step_scheduler.val_every_steps=50 --step_scheduler.ckpt_every_steps=50 --step_scheduler.global_batch_size=8 --step_scheduler.local_batch_size=1), verified the robustness test passes with the bumped threshold:
- [Phase 3] Automodel-from-consolidated max KL: 0.000000e+00 (threshold: 0.000000e+00) — save/reload bit-exact
- [Phase 4] HF-loaded max KL: 1.037561e-02 (threshold: 1.000000e-01) — well under the bumped bound
- Phase 6 is intentionally skipped via no_check_resume: true
- Final line: 1 passed, 27 warnings in 162.91s (0:02:42)
CI pipeline green on the re-triggered customizer_nemotron_nano_full_sft_chat job.

Notes

The CI-pipeline container that produced the failure predates fix: chat dataset #1921; the next rebuilt container with fc46ae53 in place fixes the blocker. The threshold bump here is belt-and-suspenders for the v5.5 Phase 4 drift.
The CI's customizer sample dataset at /lustre/fsw/coreai_dlalgo_ci/automodel_ci/datasets/customizer/sample-datasets is not mounted on cw-dfw, so a synthetic plain-chat dataset was used locally. The existing Nemotron chat template uses tool_call.arguments | items which is incompatible with _normalize_tool_calls' stringified arguments — a separate follow-up that only matters if we ever want tool_calls in a Nemotron customizer dataset. Our synthetic dataset omits tool_calls, so this exercises the post-fix: chat dataset #1921 code path sufficiently for robustness/Phase 3/Phase 4 verification.
Phase 4 KL is only 1.04e-2 — factor of ~7x headroom under even the pre-bump 7e-2 threshold — so the bump is mostly preemptive safety, consistent with fix: bump hf_kl_threshold for customizer_llama_3_2_1b_full_sft_chat #1938's "~1.5x observed margin" pattern applied to MoE noise.

🤖 Generated with Claude Code

Nudge `ci.checkpoint_robustness.hf_kl_threshold` from 7e-2 to 1e-1 to widen the safety margin now that the underlying chat-dataset `tool_calls[0].id` regression is fixed in main (#1921 / fc46ae5). Pipeline 48953745 / CI job 301287540 failed on a stale main before #1921 was merged, dying inside the finetune phase with `ValueError: assistant message tool_calls[0].id must be a non-empty string` at `chat_dataset.py:212`. The robustness test stage never ran, so the failure signature in the trace is purely the dataset error. With the container rebuilt on current main, `_normalize_tool_calls` now autofills `id=f"call_{idx}"` / `type="function"`, and the finetune phase proceeds. Verified end-to-end on cw-dfw 8xH100 (transformers 5.5, DP=8 EP=8, CI overrides `--step_scheduler.max_steps=50 --step_scheduler.val_every_steps=50 --step_scheduler.ckpt_every_steps=50 --step_scheduler.global_batch_size=8 --step_scheduler.local_batch_size=1`): [Phase 3] Automodel-from-consolidated max KL: 0.000000e+00 (threshold 0) [Phase 4] HF-loaded max KL: 1.037561e-02 (threshold 1.000000e-01) 1 passed, 27 warnings in 162.91s Phase 3 = 0 confirms save/reload is bit-exact; Phase 4 = 1.04e-2 is well under both the existing 7e-2 and the bumped 1e-1 threshold. Phase 6 is skipped (`no_check_resume: true`). The 1e-1 value keeps ~10x margin over observed, matching the `~1.5x observed` pattern from #1932 / #1937 / #1938 but starting from the config's already-generous MoE baseline. Signed-off-by: Adil Asif <adasif@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

copy-pr-bot · 2026-04-21T07:02:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

adil-a · 2026-04-21T22:04:05Z

Superseded by #1971 (batched pipeline-48953745 fixes). Threshold/flag change for this recipe is included in that PR.

…PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * Apply suggestions from code review Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

…es 9 PRs) (1971)` into `r0.4.0` (#1979) fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. * Apply suggestions from code review --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

…PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * Apply suggestions from code review Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

adil-a requested review from HuiyingLi, ZhiyuLi-Nvidia, akoumpa, hemildesai and pthombre as code owners April 21, 2026 07:02

This was referenced Apr 21, 2026

fix: widen hf_kl_threshold for customizer_gpt_oss_full_sft_chat #1940

Closed

fix: widen qwen3_moe_30b_hellaswag ckpt-robustness KL threshold to 3e-2 #1942

Closed

fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) #1971

Merged

adil-a closed this Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bump hf_kl_threshold for customizer_nemotron_nano_full_sft_chat#1939

fix: bump hf_kl_threshold for customizer_nemotron_nano_full_sft_chat#1939
adil-a wants to merge 1 commit intomainfrom
adil-a/fix-48953745-customizer-nemotron-nano-full-sft-chat

adil-a commented Apr 21, 2026

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

adil-a commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adil-a commented Apr 21, 2026

Summary

Test plan

Notes

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

adil-a commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant