fix: gemma_3_270m_squad_peft HF KL regression in ckpt robustness by adil-a · Pull Request #1933 · NVIDIA-NeMo/Automodel

adil-a · 2026-04-21T05:08:38Z

Summary

Bump ci.checkpoint_robustness.hf_kl_threshold for examples/llm_finetune/gemma/gemma_3_270m_squad_peft.yaml from 8e-3 -> 3.5e-2 to restore the gemma_3_270m_squad_peft checkpoint-robustness CI job that started failing after the transformers v5.5 upgrade (ci: Update to transformers v5.5 #1734).
This is not a save/reload correctness bug -- Phase 3 (automodel-from-consolidated) KL is exactly 0. The drift is in the forward pass itself (training-time FSDP2 + kernel-patched FSDPGemma3ForCausalLM vs vanilla HF Gemma3ForCausalLM under v5.5's revised gemma3_text implementation). The PEFT variant composes a LoRA adapter on top of a freshly-loaded HF base in Phase 4, which follows the same Gemma3 v5.5 forward path as the non-PEFT sibling.
Follows the same pattern as the non-PEFT sibling fix: gemma_3_270m_squad HF KL regression in ckpt robustness #1932 (gemma_3_270m_squad, 6e-3 -> 2.5e-2) and fix: relax KL thresholds and remove invalid kwargs in Qwen3Next linear attn #1867 (qwen3_moe / gpt_oss).

Evidence

Pre-fix, CI job 301287633:

[Phase 3] Automodel-from-consolidated max KL: 0.000000e+00 (threshold: 0.000000e+00)
[Phase 4] HF-loaded max KL: 8.439951e-03 (threshold: 8.000000e-03)
AssertionError: KL divergence between original and HF-loaded model too large:
  max per-token KL = 8.439951e-03 > threshold 8.000000e-03

Reproduction on cw-dfw 8xH100 with transformers 5.5.4, applying the same CI-launcher overrides (--step_scheduler.max_steps 5 --step_scheduler.ckpt_every_steps 5 --step_scheduler.val_every_steps 5 --step_scheduler.global_batch_size 32 --step_scheduler.local_batch_size 2 --peft.use_triton false):

[Phase 3] Automodel-from-consolidated max KL: 0.000000e+00 (threshold: 0.000000e+00)
[Phase 4] HF-loaded max KL: 8.439951e-03 (threshold: 3.500000e-02)
[Phase 6] Step 5: baseline_loss=0.837454, resume_loss=0.843585, diff=6.131470e-03
[Phase 6] Step 6: baseline_loss=0.409137, resume_loss=0.413974, diff=4.837364e-03
[Phase 6] Step 7: baseline_loss=0.540729, resume_loss=0.535871, diff=4.858077e-03
[Phase 6] Training resumption verified (3 steps compared) OK
================== 1 passed, 24 warnings in 207.07s (0:03:27) ==================

Phase 3 KL is exactly 0, confirming the automodel save/reload path is bit-exact for the LoRA adapter. Phase 4 KL matches CI byte-for-byte (8.439951e-03). The bumped threshold (3.5e-2) also covers the worst case where the YAML is run with its default max_steps=100 (observed ~2.8e-2 on cw-dfw without the CI overrides).

Test plan

Reproduce the exact CI failure on cw-dfw with transformers 5.5 (max per-token KL = 8.439951e-03).
Apply the threshold bump and rerun the same test end-to-end -- Phases 1-4 and Phase 6 all pass.
Verify Phase 3 (automodel-from-consolidated) KL is still exactly 0 -- no regression in save/reload correctness.

Generated with Claude Code.

Bump ci.checkpoint_robustness.hf_kl_threshold from 8e-3 to 3.5e-2 to restore the gemma_3_270m_squad_peft checkpoint-robustness CI job that started failing after the transformers v5.5 upgrade (#1734). Mirrors the sibling non-PEFT fix (#1932) and earlier qwen3_moe/gpt_oss fix (#1867). Phase 3 (automodel-from-consolidated) KL is still 0 — this is a forward-pass numerical drift in v5.5's Gemma3 text-only stack, not a save/reload correctness bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

copy-pr-bot · 2026-04-21T05:08:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

adil-a · 2026-04-21T22:04:03Z

Superseded by #1971 (batched pipeline-48953745 fixes). Threshold/flag change for this recipe is included in that PR.

…PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * Apply suggestions from code review Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

…es 9 PRs) (1971)` into `r0.4.0` (#1979) fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. * Apply suggestions from code review --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

…PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * Apply suggestions from code review Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

adil-a requested review from HuiyingLi, ZhiyuLi-Nvidia, akoumpa, hemildesai and pthombre as code owners April 21, 2026 05:08

Update gemma_3_270m_squad_peft.yaml

e11dc2f

akoumpa added docs-only With great power comes great responsibility. r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. labels Apr 21, 2026

This was referenced Apr 21, 2026

fix: bump hf_kl_threshold for customizer_nemotron_nano_full_sft_chat #1939

Closed

fix: widen hf_kl_threshold for customizer_gpt_oss_full_sft_chat #1940

Closed

fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) #1971

Merged

adil-a closed this Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: gemma_3_270m_squad_peft HF KL regression in ckpt robustness#1933

fix: gemma_3_270m_squad_peft HF KL regression in ckpt robustness#1933
adil-a wants to merge 2 commits intomainfrom
adil-a/fix-48953745-gemma-3-270m-squad-peft

adil-a commented Apr 21, 2026

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

adil-a commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adil-a commented Apr 21, 2026

Summary

Evidence

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

adil-a commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants