Skip to content

fix: gemma_3_270m_squad_peft HF KL regression in ckpt robustness#1933

Closed
adil-a wants to merge 2 commits intomainfrom
adil-a/fix-48953745-gemma-3-270m-squad-peft
Closed

fix: gemma_3_270m_squad_peft HF KL regression in ckpt robustness#1933
adil-a wants to merge 2 commits intomainfrom
adil-a/fix-48953745-gemma-3-270m-squad-peft

Conversation

@adil-a
Copy link
Copy Markdown
Collaborator

@adil-a adil-a commented Apr 21, 2026

Summary

  • Bump ci.checkpoint_robustness.hf_kl_threshold for examples/llm_finetune/gemma/gemma_3_270m_squad_peft.yaml from 8e-3 -> 3.5e-2 to restore the gemma_3_270m_squad_peft checkpoint-robustness CI job that started failing after the transformers v5.5 upgrade (ci: Update to transformers v5.5 #1734).
  • This is not a save/reload correctness bug -- Phase 3 (automodel-from-consolidated) KL is exactly 0. The drift is in the forward pass itself (training-time FSDP2 + kernel-patched FSDPGemma3ForCausalLM vs vanilla HF Gemma3ForCausalLM under v5.5's revised gemma3_text implementation). The PEFT variant composes a LoRA adapter on top of a freshly-loaded HF base in Phase 4, which follows the same Gemma3 v5.5 forward path as the non-PEFT sibling.
  • Follows the same pattern as the non-PEFT sibling fix: gemma_3_270m_squad HF KL regression in ckpt robustness #1932 (gemma_3_270m_squad, 6e-3 -> 2.5e-2) and fix: relax KL thresholds and remove invalid kwargs in Qwen3Next linear attn #1867 (qwen3_moe / gpt_oss).

Evidence

Pre-fix, CI job 301287633:

[Phase 3] Automodel-from-consolidated max KL: 0.000000e+00 (threshold: 0.000000e+00)
[Phase 4] HF-loaded max KL: 8.439951e-03 (threshold: 8.000000e-03)
AssertionError: KL divergence between original and HF-loaded model too large:
  max per-token KL = 8.439951e-03 > threshold 8.000000e-03

Reproduction on cw-dfw 8xH100 with transformers 5.5.4, applying the same CI-launcher overrides (--step_scheduler.max_steps 5 --step_scheduler.ckpt_every_steps 5 --step_scheduler.val_every_steps 5 --step_scheduler.global_batch_size 32 --step_scheduler.local_batch_size 2 --peft.use_triton false):

[Phase 3] Automodel-from-consolidated max KL: 0.000000e+00 (threshold: 0.000000e+00)
[Phase 4] HF-loaded max KL: 8.439951e-03 (threshold: 3.500000e-02)
[Phase 6] Step 5: baseline_loss=0.837454, resume_loss=0.843585, diff=6.131470e-03
[Phase 6] Step 6: baseline_loss=0.409137, resume_loss=0.413974, diff=4.837364e-03
[Phase 6] Step 7: baseline_loss=0.540729, resume_loss=0.535871, diff=4.858077e-03
[Phase 6] Training resumption verified (3 steps compared) OK
================== 1 passed, 24 warnings in 207.07s (0:03:27) ==================

Phase 3 KL is exactly 0, confirming the automodel save/reload path is bit-exact for the LoRA adapter. Phase 4 KL matches CI byte-for-byte (8.439951e-03). The bumped threshold (3.5e-2) also covers the worst case where the YAML is run with its default max_steps=100 (observed ~2.8e-2 on cw-dfw without the CI overrides).

Test plan

  • Reproduce the exact CI failure on cw-dfw with transformers 5.5 (max per-token KL = 8.439951e-03).
  • Apply the threshold bump and rerun the same test end-to-end -- Phases 1-4 and Phase 6 all pass.
  • Verify Phase 3 (automodel-from-consolidated) KL is still exactly 0 -- no regression in save/reload correctness.

Generated with Claude Code.

Bump ci.checkpoint_robustness.hf_kl_threshold from 8e-3 to 3.5e-2 to
restore the gemma_3_270m_squad_peft checkpoint-robustness CI job that
started failing after the transformers v5.5 upgrade (#1734). Mirrors
the sibling non-PEFT fix (#1932) and earlier qwen3_moe/gpt_oss fix
(#1867).

Phase 3 (automodel-from-consolidated) KL is still 0 — this is a
forward-pass numerical drift in v5.5's Gemma3 text-only stack, not a
save/reload correctness bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa akoumpa added docs-only With great power comes great responsibility. r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. labels Apr 21, 2026
@adil-a
Copy link
Copy Markdown
Collaborator Author

adil-a commented Apr 21, 2026

Superseded by #1971 (batched pipeline-48953745 fixes). Threshold/flag change for this recipe is included in that PR.

@adil-a adil-a closed this Apr 21, 2026
akoumpa added a commit that referenced this pull request Apr 22, 2026
…PRs) (#1971)

* fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745)

Transformers v5.5 (#1734) introduced small forward-pass changes in
Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without
changing correctness. Four recipes in pipeline 48953745 were failing the
pre-existing tight bounds for this reason; authors opened separate PRs
with per-recipe thresholds.

Unify the bound at 1e-1 so the whole family passes under one policy.
Observed Phase 4 KLs on the current nightly sqsh
(automodel_nightly_21-4-2026.sqsh) for reference:
- gemma_3_270m_squad             : 2.91e-2 (was 6e-3)
- gemma_3_270m_squad_peft        : 1.68e-2 (was 8e-3)
- qwen3_moe_30b_hellaswag        : 2.43e-2 (was 1e-2)
- customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2)

All comfortably under the new 1e-1 bound (3-4x margin on the tightest).

Supersedes #1932, #1933, #1939, #1942.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745

Adds known-good test-harness flags for model families where the checkpoint
robustness test was failing for reasons other than Phase 4 threshold drift:

nemotron_nano_9b_squad{,_peft} (Mamba hybrid):
  - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init)
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba
    SSM state under FSDP all-reduce)
  - ci.checkpoint_robustness.trust_remote_code: true
  - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det)

ministral3_3b_squad{,_peft} (FP8 + FSDP2):
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params
    under FSDP2 aren't losslessly round-trippable)
  - ci.checkpoint_robustness.no_check_resume: true

nemotron_super_v3_hellaswag (multi-node DP=32):
  - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64
    (prior gbs wasn't divisible by DP=32)

Supersedes #1943, #1944, #1946, #1947, #1949.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* Apply suggestions from code review

Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
akoumpa added a commit that referenced this pull request Apr 22, 2026
…es 9 PRs) (1971)` into `r0.4.0` (#1979)

fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) (#1971)

* fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745)

Transformers v5.5 (#1734) introduced small forward-pass changes in
Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without
changing correctness. Four recipes in pipeline 48953745 were failing the
pre-existing tight bounds for this reason; authors opened separate PRs
with per-recipe thresholds.

Unify the bound at 1e-1 so the whole family passes under one policy.
Observed Phase 4 KLs on the current nightly sqsh
(automodel_nightly_21-4-2026.sqsh) for reference:
- gemma_3_270m_squad             : 2.91e-2 (was 6e-3)
- gemma_3_270m_squad_peft        : 1.68e-2 (was 8e-3)
- qwen3_moe_30b_hellaswag        : 2.43e-2 (was 1e-2)
- customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2)

All comfortably under the new 1e-1 bound (3-4x margin on the tightest).

Supersedes #1932, #1933, #1939, #1942.



* fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745

Adds known-good test-harness flags for model families where the checkpoint
robustness test was failing for reasons other than Phase 4 threshold drift:

nemotron_nano_9b_squad{,_peft} (Mamba hybrid):
  - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init)
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba
    SSM state under FSDP all-reduce)
  - ci.checkpoint_robustness.trust_remote_code: true
  - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det)

ministral3_3b_squad{,_peft} (FP8 + FSDP2):
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params
    under FSDP2 aren't losslessly round-trippable)
  - ci.checkpoint_robustness.no_check_resume: true

nemotron_super_v3_hellaswag (multi-node DP=32):
  - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64
    (prior gbs wasn't divisible by DP=32)

Supersedes #1943, #1944, #1946, #1947, #1949.



* Apply suggestions from code review



---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
linnanwang pushed a commit that referenced this pull request Apr 24, 2026
…PRs) (#1971)

* fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745)

Transformers v5.5 (#1734) introduced small forward-pass changes in
Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without
changing correctness. Four recipes in pipeline 48953745 were failing the
pre-existing tight bounds for this reason; authors opened separate PRs
with per-recipe thresholds.

Unify the bound at 1e-1 so the whole family passes under one policy.
Observed Phase 4 KLs on the current nightly sqsh
(automodel_nightly_21-4-2026.sqsh) for reference:
- gemma_3_270m_squad             : 2.91e-2 (was 6e-3)
- gemma_3_270m_squad_peft        : 1.68e-2 (was 8e-3)
- qwen3_moe_30b_hellaswag        : 2.43e-2 (was 1e-2)
- customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2)

All comfortably under the new 1e-1 bound (3-4x margin on the tightest).

Supersedes #1932, #1933, #1939, #1942.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745

Adds known-good test-harness flags for model families where the checkpoint
robustness test was failing for reasons other than Phase 4 threshold drift:

nemotron_nano_9b_squad{,_peft} (Mamba hybrid):
  - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init)
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba
    SSM state under FSDP all-reduce)
  - ci.checkpoint_robustness.trust_remote_code: true
  - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det)

ministral3_3b_squad{,_peft} (FP8 + FSDP2):
  - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params
    under FSDP2 aren't losslessly round-trippable)
  - ci.checkpoint_robustness.no_check_resume: true

nemotron_super_v3_hellaswag (multi-node DP=32):
  - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64
    (prior gbs wasn't divisible by DP=32)

Supersedes #1943, #1944, #1946, #1947, #1949.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* Apply suggestions from code review

Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-only With great power comes great responsibility. r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants