fix: lower qwen3_moe_30b_lora local_batch_size to avoid CI OOM by adil-a · Pull Request #1948 · NVIDIA-NeMo/Automodel

adil-a · 2026-04-21T15:29:15Z

Summary

CI sft_ckpt_robustness job qwen3_moe_30b_lora (job 301287532, pipeline 48953745) was OOM-ing at step 39 in the finetune phase on rank 0 right after the first val + checkpoint save: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.17 GiB. GPU 0 has a total capacity of 79.11 GiB of which 9.75 GiB is free, which cascaded into DeepEP CPU recv timeouts on the other 7 ranks. No pytest ran ([finetune] Failed with exit code 1, skipping robustness test).
Root cause is activation-memory pressure at local_batch_size: 128 with activation_checkpointing: true on 30B-A3B MoE + LoRA. Steady-state mem was 57.87 GiB on every rank; the post-save rank-0 state left enough fragmentation that the next backward alloc (12.17 GiB) OOM'd.
Fix: lower local_batch_size: 128 -> 64 to match the passing sibling qwen3_moe_30b_hellaswag (SFT, local_batch_size: 64, mem 60.19 GiB in CI). global_batch_size stays 1024; GA becomes 2. Steady-state mem drops to 34.91 GiB, the step 38 val+save and step 39 backward both pass.
The STATUS.md 2026-04-02 note flagging Phase 3 KL=0.84 - broken PEFT checkpoint reload for Qwen3-MoE is stale: Phase 3 is now bit-exact (max KL = 0), so no adapter save/reload code change is needed here.

Test plan

Byte-exact OOM repro on cw-dfw 8xH100 (transformers==5.5.4): [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.17 GiB at step 39, matching CI.
Post-fix finetune phase runs cleanly through CI's max_steps=50 and beyond (steady mem 34.9 GiB, step 38 val+save OK, step 39 backward OK, second val+save at step 77 OK). Log: /tmp/verify_finetune.log on cw-dfw.
Checkpoint robustness pytest with CI overrides (--step_scheduler.max_steps 5 --step_scheduler.ckpt_every_steps 5 --step_scheduler.val_every_steps 5 --step_scheduler.global_batch_size 32 --step_scheduler.local_batch_size 2): [Phase 3] Automodel-from-consolidated max KL: 0.000000e+00 (threshold: 0.000000e+00), [Phase 4] HF-loaded max KL: 1.018360e-02 (threshold: 7.000000e-02), 1 passed, 48 warnings in 238.45s. Log: /tmp/verify_robust.log on cw-dfw.
Next nightly sft_ckpt_robustness qwen3_moe_30b_lora job passes.

🤖 Generated with Claude Code

The CI sft_ckpt_robustness finetune phase was running the config as-is with local_batch_size: 128 and OOM-ing at step 39 on rank 0 right after the first validation + checkpoint save (max_steps=50, val_every_steps=50 puts the val at step 38 since HellaSwag train exhausts there). Rank 0 reports `torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.17 GiB. GPU 0 has a total capacity of 79.11 GiB of which 9.75 GiB is free.` and cascades into DeepEP CPU recv timeouts on the other ranks. Root cause is activation memory: with activation_checkpointing: true and LoRA frozen base, the backward pass still materializes full-batch-128 activations for recomputation, peaking at ~58 GiB mid-step. The post-save state (consolidated safetensors staging on rank 0) leaves enough fragmentation that the next backward allocation fails. Halving local_batch_size (128 -> 64, matching the passing sibling qwen3_moe_30b_hellaswag SFT) drops steady-state memory from 57.87 GiB to 34.91 GiB and keeps global_batch_size=1024 via GA=2. No other knobs change. Reproduced the original OOM byte-exact on cw-dfw 8xH100 with transformers==5.5.4 (`[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.17 GiB` at step 39). Post-fix the finetune phase runs cleanly through step 50 and beyond (steady mem 34.9 GiB, second val+save at step 77 survives). Checkpoint robustness pytest (`--step_scheduler.max_steps 5 --local_batch_size 2 --global_batch_size 32`) reports `[Phase 3] max KL = 0.000000e+00`, `[Phase 4] max KL = 1.018360e-02 (threshold 7.000000e-02)`, `1 passed, 48 warnings in 238.45s`. The STATUS.md note from 2026-04-02 flagging `Phase 3 KL=0.84 - broken PEFT checkpoint reload for Qwen3-MoE` is stale: Phase 3 now returns 0 bit-exact, so no adapter save/reload code change is needed here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

copy-pr-bot · 2026-04-21T15:29:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

adil-a requested review from HuiyingLi, ZhiyuLi-Nvidia, akoumpa, hemildesai and pthombre as code owners April 21, 2026 15:29

adil-a mentioned this pull request Apr 21, 2026

fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) #1971

Merged

2 tasks

adil-a closed this Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: lower qwen3_moe_30b_lora local_batch_size to avoid CI OOM#1948

fix: lower qwen3_moe_30b_lora local_batch_size to avoid CI OOM#1948
adil-a wants to merge 1 commit intomainfrom
adil-a/fix-48953745-qwen3-moe-30b-lora

adil-a commented Apr 21, 2026

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adil-a commented Apr 21, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant