Skip to content

fix: bump hf_kl_threshold for customizer_llama_3_2_1b_full_sft_chat#1938

Closed
adil-a wants to merge 1 commit intomainfrom
adil-a/fix-48953745-customizer-llama-3-2-1b-full-sft-chat
Closed

fix: bump hf_kl_threshold for customizer_llama_3_2_1b_full_sft_chat#1938
adil-a wants to merge 1 commit intomainfrom
adil-a/fix-48953745-customizer-llama-3-2-1b-full-sft-chat

Conversation

@adil-a
Copy link
Copy Markdown
Collaborator

@adil-a adil-a commented Apr 21, 2026

Summary

Note: the underlying CI job also hit a separate chat-dataset tool_calls.id bug (already fixed in main by #1921 / fc46ae53). With that fix in place, the remaining failure on cw-dfw reproduced exactly as a Phase 4 threshold overshoot, handled here.

Test plan

  • Reproduced Phase 4 failure on cw-dfw 8×H100 with transformers==5.5.0 using CI launcher overrides — [Phase 4] max KL = 6.926899e-03 > threshold 5e-3.
  • Re-ran with bumped threshold — [Phase 3] max KL = 0.000000e+00, [Phase 4] max KL = 6.926899e-03 (threshold 2.500000e-02), [Phase 6] Step 5 / 6 / 7 diff = 0 (3 steps compared). Final: 1 passed, 24 warnings in 61.42s.
  • CI pipeline green on the re-triggered customizer_llama_3_2_1b_full_sft_chat job.

After the transformers 5.3 -> 5.5 upgrade (#1734) the vanilla HF Llama 3.2
1B forward diverges slightly from the FSDP2 + kernel-patched training-time
forward at Phase 4 of the checkpoint robustness test. Phase 3 max KL is
still exactly 0 (save/reload is bit-exact), but Phase 4 max KL climbs to
~6.9e-3, overshooting the pre-v5.5 5e-3 threshold.

Bumps ci.checkpoint_robustness.hf_kl_threshold from 5e-3 to 2.5e-2
(~1.5x margin over the observed 6.93e-3), matching the pattern already
applied to gemma_3_270m_squad (#1932) and qwen2_5_7b_squad (#1937).

Evidence on cw-dfw 8xH100 with CI launcher overrides, transformers 5.5.0:
  [Phase 3] max KL = 0.000000e+00
  [Phase 4] max KL = 6.926899e-03 (threshold 2.5e-2)
  [Phase 6] Step 5 / 6 / 7 diff = 0.000000e+00 (3 steps compared)
  1 passed, 24 warnings in 61.42s

Signed-off-by: Adil Asif <adasif@nvidia.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@adil-a
Copy link
Copy Markdown
Collaborator Author

adil-a commented Apr 21, 2026

Closing as redundant. Empirically verified on fresh CI sqsh (automodel_nightly_21-4-2026.sqsh):

  • Observed Phase 4 HF KL: 1.34e-3
  • Old threshold (pre-bump): 5e-3
  • Margin: ~4× under

The threshold bump to 2.5e-2 is not needed — with #1921 baked into the current sqsh, the test passes cleanly under the original 5e-3 threshold.

@adil-a adil-a closed this Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant