fix: gemma_3_270m_squad_peft HF KL regression in ckpt robustness#1933
Closed
fix: gemma_3_270m_squad_peft HF KL regression in ckpt robustness#1933
Conversation
Bump ci.checkpoint_robustness.hf_kl_threshold from 8e-3 to 3.5e-2 to restore the gemma_3_270m_squad_peft checkpoint-robustness CI job that started failing after the transformers v5.5 upgrade (#1734). Mirrors the sibling non-PEFT fix (#1932) and earlier qwen3_moe/gpt_oss fix (#1867). Phase 3 (automodel-from-consolidated) KL is still 0 — this is a forward-pass numerical drift in v5.5's Gemma3 text-only stack, not a save/reload correctness bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
This was referenced Apr 21, 2026
Collaborator
Author
|
Superseded by #1971 (batched pipeline-48953745 fixes). Threshold/flag change for this recipe is included in that PR. |
akoumpa
added a commit
that referenced
this pull request
Apr 22, 2026
…PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * Apply suggestions from code review Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
akoumpa
added a commit
that referenced
this pull request
Apr 22, 2026
…es 9 PRs) (1971)` into `r0.4.0` (#1979) fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. * Apply suggestions from code review --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
linnanwang
pushed a commit
that referenced
this pull request
Apr 24, 2026
…PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * Apply suggestions from code review Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ci.checkpoint_robustness.hf_kl_thresholdforexamples/llm_finetune/gemma/gemma_3_270m_squad_peft.yamlfrom8e-3->3.5e-2to restore thegemma_3_270m_squad_peftcheckpoint-robustness CI job that started failing after the transformers v5.5 upgrade (ci: Update to transformers v5.5 #1734).FSDPGemma3ForCausalLMvs vanilla HFGemma3ForCausalLMunder v5.5's revisedgemma3_textimplementation). The PEFT variant composes a LoRA adapter on top of a freshly-loaded HF base in Phase 4, which follows the same Gemma3 v5.5 forward path as the non-PEFT sibling.Evidence
Pre-fix, CI job 301287633:
Reproduction on cw-dfw 8xH100 with transformers 5.5.4, applying the same CI-launcher overrides (
--step_scheduler.max_steps 5 --step_scheduler.ckpt_every_steps 5 --step_scheduler.val_every_steps 5 --step_scheduler.global_batch_size 32 --step_scheduler.local_batch_size 2 --peft.use_triton false):Phase 3 KL is exactly 0, confirming the automodel save/reload path is bit-exact for the LoRA adapter. Phase 4 KL matches CI byte-for-byte (8.439951e-03). The bumped threshold (3.5e-2) also covers the worst case where the YAML is run with its default
max_steps=100(observed ~2.8e-2 on cw-dfw without the CI overrides).Test plan
max per-token KL = 8.439951e-03).0-- no regression in save/reload correctness.Generated with Claude Code.