fix: widen qwen3_moe_30b_hellaswag ckpt-robustness KL threshold to 3e-2 by adil-a · Pull Request #1942 · NVIDIA-NeMo/Automodel

adil-a · 2026-04-21T07:48:45Z

Summary

Bumps ci.checkpoint_robustness.hf_kl_threshold in examples/llm_finetune/qwen/qwen3_moe_30b_hellaswag.yaml from 1e-2 to 3e-2 to keep the sft_ckpt_robustness job green under transformers v5.5.
CI job 301287530 (pipeline 48953745, pre-fix: Step-3.5-Flash layer_types mismatch and related recipe fixes #1916 commit 45537f9) failed Phase 4 with max per-token KL = 9.151315e-03 > threshold 1.000000e-03. fix: Step-3.5-Flash layer_types mismatch and related recipe fixes #1916 already nudged the YAML to 1e-2 but that leaves only ~9% headroom; this further bump (~3x observed CI, ~5x cw-dfw) aligns with the sibling-MoE pattern from fix: relax KL thresholds and remove invalid kwargs in Qwen3Next linear attn #1867 (gpt_oss_20b 1e-1) and the v5.5-drift bumps in fix: gemma_3_270m_squad HF KL regression in ckpt robustness #1932 / fix: qwen2_5_7b_squad ckpt robustness thresholds for transformers v5.5 #1937 / fix: bump hf_kl_threshold for customizer_llama_3_2_1b_full_sft_chat #1938 / fix: bump hf_kl_threshold for customizer_nemotron_nano_full_sft_chat #1939 / fix: widen hf_kl_threshold for customizer_gpt_oss_full_sft_chat #1940.
Root cause is the transformers v5.5 forward-pass drift (Phase 3 automodel-from-consolidated remains bit-exact at max KL = 0), not a save/reload bug. no_check_resume: true keeps Phase 6 disabled as before. No code changes.

Test plan

Reproduced failure signature from CI trace ([Phase 4] HF-loaded max KL: 9.151315e-03 against hf_kl_threshold: 1e-3).
Verified fix on cw-dfw 8xH100 with the exact CI launcher overrides (--step_scheduler.max_steps 5 --step_scheduler.ckpt_every_steps 5 --step_scheduler.val_every_steps 5 --step_scheduler.global_batch_size 32 --step_scheduler.local_batch_size 2 --checkpoint.checkpoint_dir /tmp/qwen3_moe_ckpts), transformers==5.5.4, HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1: [Phase 3] max KL = 0.000000e+00, [Phase 4] max KL = 5.796895e-03 (threshold: 3.000000e-02), 1 passed, 50 warnings in 263.77s (0:04:23).
Next nightly sft_ckpt_robustness qwen3_moe_30b_hellaswag job passes.

🤖 Generated with Claude Code

The sft_ckpt_robustness job `qwen3_moe_30b_hellaswag` (CI job 301287530, pipeline 48953745) fails Phase 4 on main at commit 45537f9: vanilla-HF load of the consolidated safetensors yields max per-token KL 9.151315e-03 against the pre-teardown reference, tripping the pre-v5.5 threshold of 1e-3. This is the same transformers v5.5 forward-pass numerical drift that already forced bumps on sibling configs (#1867 gpt_oss_20b 5e-2 -> 1e-1, #1932 gemma_3_270m_squad 6e-3 -> 2.5e-2, #1937 qwen2_5_7b_squad 9e-3 -> 2.5e-2, #1938/#1939/#1940 customizer-chat). Phase 3 (automodel-from-consolidated) is still bit-exact at 0, so the save/reload path is correct; Phase 4's HF eager forward is what drifted. #1916 already nudged this YAML from 1e-3 to 1e-2, but that leaves only ~9% headroom over the CI-observed 9.15e-3. Bumping to 3e-2 (~3x the worst observed KL, ~5x the cw-dfw observed KL) matches the sibling margin pattern and gives MoE-EP=8 run-to-run variance room to breathe. No code changes; `no_check_resume: true` keeps Phase 6 disabled as before. Evidence on cw-dfw 8xH100 with the exact CI launcher overrides (`--step_scheduler.max_steps 5 --step_scheduler.ckpt_every_steps 5 --step_scheduler.val_every_steps 5 --step_scheduler.global_batch_size 32 --step_scheduler.local_batch_size 2 --checkpoint.checkpoint_dir /tmp/qwen3_moe_ckpts`), transformers==5.5.4, `HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1`: `[Phase 3] max KL = 0.000000e+00`, `[Phase 4] max KL = 5.796895e-03 (threshold: 3.000000e-02)`, `1 passed, 50 warnings in 263.77s`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

copy-pr-bot · 2026-04-21T07:48:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

adil-a · 2026-04-21T22:04:07Z

Superseded by #1971 (batched pipeline-48953745 fixes). Threshold/flag change for this recipe is included in that PR.

…PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * Apply suggestions from code review Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

…es 9 PRs) (1971)` into `r0.4.0` (#1979) fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. * Apply suggestions from code review --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

…PRs) (#1971) * fix: unify hf_kl_threshold to 1e-1 for v5.5 transformers Phase 4 drift (pipeline 48953745) Transformers v5.5 (#1734) introduced small forward-pass changes in Llama/Gemma/Qwen that widen the observed Phase 4 HF KL without changing correctness. Four recipes in pipeline 48953745 were failing the pre-existing tight bounds for this reason; authors opened separate PRs with per-recipe thresholds. Unify the bound at 1e-1 so the whole family passes under one policy. Observed Phase 4 KLs on the current nightly sqsh (automodel_nightly_21-4-2026.sqsh) for reference: - gemma_3_270m_squad : 2.91e-2 (was 6e-3) - gemma_3_270m_squad_peft : 1.68e-2 (was 8e-3) - qwen3_moe_30b_hellaswag : 2.43e-2 (was 1e-2) - customizer_nemotron_nano_full_sft_chat: already 1e-1 (was 7e-2) All comfortably under the new 1e-1 bound (3-4x margin on the tightest). Supersedes #1932, #1933, #1939, #1942. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * fix: ckpt-robustness Phase 3 / resume / GBS-divisibility fixes for pipeline 48953745 Adds known-good test-harness flags for model families where the checkpoint robustness test was failing for reasons other than Phase 4 threshold drift: nemotron_nano_9b_squad{,_peft} (Mamba hybrid): - dist_env.timeout_minutes: 1 -> 20 (short timeout triggered on slow init) - ci.checkpoint_robustness.kl_threshold: 5e-3 (Phase 3 non-det from Mamba SSM state under FSDP all-reduce) - ci.checkpoint_robustness.trust_remote_code: true - ci.checkpoint_robustness.no_check_resume: true (Mamba resume non-det) ministral3_3b_squad{,_peft} (FP8 + FSDP2): - ci.checkpoint_robustness.kl_threshold: 5e-3 (FP8 scalar scale params under FSDP2 aren't losslessly round-trippable) - ci.checkpoint_robustness.no_check_resume: true nemotron_super_v3_hellaswag (multi-node DP=32): - ci.checkpoint_robustness.step_scheduler.global_batch_size: 64 (prior gbs wasn't divisible by DP=32) Supersedes #1943, #1944, #1946, #1947, #1949. Signed-off-by: adil-a <adil.asif2000@hotmail.com> * Apply suggestions from code review Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

adil-a requested review from HuiyingLi, ZhiyuLi-Nvidia, akoumpa, hemildesai and pthombre as code owners April 21, 2026 07:48

adil-a closed this Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: widen qwen3_moe_30b_hellaswag ckpt-robustness KL threshold to 3e-2#1942

fix: widen qwen3_moe_30b_hellaswag ckpt-robustness KL threshold to 3e-2#1942
adil-a wants to merge 1 commit intomainfrom
adil-a/fix-48953745-qwen3-moe-30b-hellaswag

adil-a commented Apr 21, 2026

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

adil-a commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adil-a commented Apr 21, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

adil-a commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant