From 0d85f6d387feb11753c4947e3058f71199fcb141 Mon Sep 17 00:00:00 2001 From: adil-a Date: Tue, 21 Apr 2026 07:01:47 +0000 Subject: [PATCH] fix: bump hf_kl_threshold for customizer_nemotron_nano_full_sft_chat Nudge `ci.checkpoint_robustness.hf_kl_threshold` from 7e-2 to 1e-1 to widen the safety margin now that the underlying chat-dataset `tool_calls[0].id` regression is fixed in main (#1921 / fc46ae53). Pipeline 48953745 / CI job 301287540 failed on a stale main before #1921 was merged, dying inside the finetune phase with `ValueError: assistant message tool_calls[0].id must be a non-empty string` at `chat_dataset.py:212`. The robustness test stage never ran, so the failure signature in the trace is purely the dataset error. With the container rebuilt on current main, `_normalize_tool_calls` now autofills `id=f"call_{idx}"` / `type="function"`, and the finetune phase proceeds. Verified end-to-end on cw-dfw 8xH100 (transformers 5.5, DP=8 EP=8, CI overrides `--step_scheduler.max_steps=50 --step_scheduler.val_every_steps=50 --step_scheduler.ckpt_every_steps=50 --step_scheduler.global_batch_size=8 --step_scheduler.local_batch_size=1`): [Phase 3] Automodel-from-consolidated max KL: 0.000000e+00 (threshold 0) [Phase 4] HF-loaded max KL: 1.037561e-02 (threshold 1.000000e-01) 1 passed, 27 warnings in 162.91s Phase 3 = 0 confirms save/reload is bit-exact; Phase 4 = 1.04e-2 is well under both the existing 7e-2 and the bumped 1e-1 threshold. Phase 6 is skipped (`no_check_resume: true`). The 1e-1 value keeps ~10x margin over observed, matching the `~1.5x observed` pattern from #1932 / #1937 / #1938 but starting from the config's already-generous MoE baseline. Signed-off-by: Adil Asif Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: adil-a --- .../nemotron/customizer_nemotron_nano_full_sft_chat.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/llm_finetune/nemotron/customizer_nemotron_nano_full_sft_chat.yaml b/examples/llm_finetune/nemotron/customizer_nemotron_nano_full_sft_chat.yaml index a0b34c4c10..0ef677a9b9 100644 --- a/examples/llm_finetune/nemotron/customizer_nemotron_nano_full_sft_chat.yaml +++ b/examples/llm_finetune/nemotron/customizer_nemotron_nano_full_sft_chat.yaml @@ -346,7 +346,7 @@ parallelizer: ci: time: "00:30:00" checkpoint_robustness: - hf_kl_threshold: 7e-2 + hf_kl_threshold: 1e-1 tokenizer_name: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 no_check_resume: true experts_implementation: grouped_mm