fix: bump hf_kl_threshold for customizer_nemotron_nano_full_sft_chat#1939
Closed
fix: bump hf_kl_threshold for customizer_nemotron_nano_full_sft_chat#1939
Conversation
Nudge `ci.checkpoint_robustness.hf_kl_threshold` from 7e-2 to 1e-1 to widen the safety margin now that the underlying chat-dataset `tool_calls[0].id` regression is fixed in main (#1921 / fc46ae5). Pipeline 48953745 / CI job 301287540 failed on a stale main before #1921 was merged, dying inside the finetune phase with `ValueError: assistant message tool_calls[0].id must be a non-empty string` at `chat_dataset.py:212`. The robustness test stage never ran, so the failure signature in the trace is purely the dataset error. With the container rebuilt on current main, `_normalize_tool_calls` now autofills `id=f"call_{idx}"` / `type="function"`, and the finetune phase proceeds. Verified end-to-end on cw-dfw 8xH100 (transformers 5.5, DP=8 EP=8, CI overrides `--step_scheduler.max_steps=50 --step_scheduler.val_every_steps=50 --step_scheduler.ckpt_every_steps=50 --step_scheduler.global_batch_size=8 --step_scheduler.local_batch_size=1`): [Phase 3] Automodel-from-consolidated max KL: 0.000000e+00 (threshold 0) [Phase 4] HF-loaded max KL: 1.037561e-02 (threshold 1.000000e-01) 1 passed, 27 warnings in 162.91s Phase 3 = 0 confirms save/reload is bit-exact; Phase 4 = 1.04e-2 is well under both the existing 7e-2 and the bumped 1e-1 threshold. Phase 6 is skipped (`no_check_resume: true`). The 1e-1 value keeps ~10x margin over observed, matching the `~1.5x observed` pattern from #1932 / #1937 / #1938 but starting from the config's already-generous MoE baseline. Signed-off-by: Adil Asif <adasif@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Collaborator
Author
|
Superseded by #1971 (batched pipeline-48953745 fixes). Threshold/flag change for this recipe is included in that PR. |
akoumpa
added a commit
that referenced
this pull request
Apr 22, 2026
…PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * Apply suggestions from code review Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
akoumpa
added a commit
that referenced
this pull request
Apr 22, 2026
…es 9 PRs) (1971)` into `r0.4.0` (#1979) fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. * Apply suggestions from code review --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
linnanwang
pushed a commit
that referenced
this pull request
Apr 24, 2026
…PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * Apply suggestions from code review Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
customizer_nemotron_nano_full_sft_chat(pipeline 48953745 / job 301287540) failed before training started:ValueError: assistant message tool_calls[0].id must be a non-empty stringatnemo_automodel/components/datasets/llm/chat_dataset.py:212. The checkpoint-robustness stage never got to run. The underlying_normalize_tool_callsstrict-validation bug is already fixed on main by fix: chat dataset #1921 /fc46ae53(landed 2026-04-20 20:20 PDT, after the failing run started at 2026-04-20 01:44 UTC).examples/llm_finetune/nemotron/customizer_nemotron_nano_full_sft_chat.yamlci.checkpoint_robustness.hf_kl_thresholdfrom7e-2to1e-1as extra headroom for the transformers 5.3 → 5.5 Phase 4 drift that has forced threshold bumps on the sibling SFT-robustness jobs (fix: gemma_3_270m_squad HF KL regression in ckpt robustness #1932 / fix: gemma_3_270m_squad_peft HF KL regression in ckpt robustness #1933 / fix: qwen2_5_7b_squad ckpt robustness thresholds for transformers v5.5 #1937 / fix: bump hf_kl_threshold for customizer_llama_3_2_1b_full_sft_chat #1938). No code changes (chat-dataset fix is already in main).Test plan
tool_calls[0].idValueError at line 212 across all 8 ranks).transformers==5.5.4,DP=8, EP=8and the CI launcher overrides (--step_scheduler.max_steps=50 --step_scheduler.val_every_steps=50 --step_scheduler.ckpt_every_steps=50 --step_scheduler.global_batch_size=8 --step_scheduler.local_batch_size=1), verified the robustness test passes with the bumped threshold:[Phase 3] Automodel-from-consolidated max KL: 0.000000e+00 (threshold: 0.000000e+00)— save/reload bit-exact[Phase 4] HF-loaded max KL: 1.037561e-02 (threshold: 1.000000e-01)— well under the bumped boundno_check_resume: true1 passed, 27 warnings in 162.91s (0:02:42)customizer_nemotron_nano_full_sft_chatjob.Notes
fc46ae53in place fixes the blocker. The threshold bump here is belt-and-suspenders for the v5.5 Phase 4 drift./lustre/fsw/coreai_dlalgo_ci/automodel_ci/datasets/customizer/sample-datasetsis not mounted on cw-dfw, so a synthetic plain-chat dataset was used locally. The existing Nemotron chat template usestool_call.arguments | itemswhich is incompatible with_normalize_tool_calls' stringified arguments — a separate follow-up that only matters if we ever wanttool_callsin a Nemotron customizer dataset. Our synthetic dataset omitstool_calls, so this exercises the post-fix: chat dataset #1921 code path sufficiently for robustness/Phase 3/Phase 4 verification.🤖 Generated with Claude Code