fix: bump hf_kl_threshold for customizer_llama_3_2_1b_full_sft_chat by adil-a · Pull Request #1938 · NVIDIA-NeMo/Automodel

adil-a · 2026-04-21T06:14:28Z

Summary

After the transformers 5.3 → 5.5 upgrade (ci: Update to transformers v5.5 #1734), Phase 4 of customizer_llama_3_2_1b_full_sft_chat checkpoint robustness overshoots the pre-v5.5 5e-3 KL threshold by reaching ~6.93e-3. Phase 3 is bit-exact (max KL = 0), so this is a forward-pass drift in vanilla-HF Llama 3.2, not a save/reload correctness bug.
Bumps ci.checkpoint_robustness.hf_kl_threshold from 5e-3 to 2.5e-2 (~1.5× observed margin, matching the pattern from fix: gemma_3_270m_squad HF KL regression in ckpt robustness #1932 / fix: qwen2_5_7b_squad ckpt robustness thresholds for transformers v5.5 #1937 for other post-v5.5 SFT robustness jobs).
No code changes.

Note: the underlying CI job also hit a separate chat-dataset tool_calls.id bug (already fixed in main by #1921 / fc46ae53). With that fix in place, the remaining failure on cw-dfw reproduced exactly as a Phase 4 threshold overshoot, handled here.

Test plan

Reproduced Phase 4 failure on cw-dfw 8×H100 with transformers==5.5.0 using CI launcher overrides — [Phase 4] max KL = 6.926899e-03 > threshold 5e-3.
Re-ran with bumped threshold — [Phase 3] max KL = 0.000000e+00, [Phase 4] max KL = 6.926899e-03 (threshold 2.500000e-02), [Phase 6] Step 5 / 6 / 7 diff = 0 (3 steps compared). Final: 1 passed, 24 warnings in 61.42s.
CI pipeline green on the re-triggered customizer_llama_3_2_1b_full_sft_chat job.

After the transformers 5.3 -> 5.5 upgrade (#1734) the vanilla HF Llama 3.2 1B forward diverges slightly from the FSDP2 + kernel-patched training-time forward at Phase 4 of the checkpoint robustness test. Phase 3 max KL is still exactly 0 (save/reload is bit-exact), but Phase 4 max KL climbs to ~6.9e-3, overshooting the pre-v5.5 5e-3 threshold. Bumps ci.checkpoint_robustness.hf_kl_threshold from 5e-3 to 2.5e-2 (~1.5x margin over the observed 6.93e-3), matching the pattern already applied to gemma_3_270m_squad (#1932) and qwen2_5_7b_squad (#1937). Evidence on cw-dfw 8xH100 with CI launcher overrides, transformers 5.5.0: [Phase 3] max KL = 0.000000e+00 [Phase 4] max KL = 6.926899e-03 (threshold 2.5e-2) [Phase 6] Step 5 / 6 / 7 diff = 0.000000e+00 (3 steps compared) 1 passed, 24 warnings in 61.42s Signed-off-by: Adil Asif <adasif@nvidia.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

copy-pr-bot · 2026-04-21T06:14:32Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

adil-a · 2026-04-21T19:15:00Z

Closing as redundant. Empirically verified on fresh CI sqsh (automodel_nightly_21-4-2026.sqsh):

Observed Phase 4 HF KL: 1.34e-3
Old threshold (pre-bump): 5e-3
Margin: ~4× under

The threshold bump to 2.5e-2 is not needed — with #1921 baked into the current sqsh, the test passes cleanly under the original 5e-3 threshold.

adil-a requested review from HuiyingLi, ZhiyuLi-Nvidia, akoumpa, hemildesai and pthombre as code owners April 21, 2026 06:14

This was referenced Apr 21, 2026

fix: bump hf_kl_threshold for customizer_nemotron_nano_full_sft_chat #1939

Closed

fix: widen hf_kl_threshold for customizer_gpt_oss_full_sft_chat #1940

Closed

fix: widen qwen3_moe_30b_hellaswag ckpt-robustness KL threshold to 3e-2 #1942

Closed

adil-a closed this Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bump hf_kl_threshold for customizer_llama_3_2_1b_full_sft_chat#1938

fix: bump hf_kl_threshold for customizer_llama_3_2_1b_full_sft_chat#1938
adil-a wants to merge 1 commit intomainfrom
adil-a/fix-48953745-customizer-llama-3-2-1b-full-sft-chat

adil-a commented Apr 21, 2026

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

adil-a commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adil-a commented Apr 21, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

adil-a commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant