Skip to content

fix: lower qwen3_moe_30b_lora local_batch_size to avoid CI OOM#1948

Closed
adil-a wants to merge 1 commit intomainfrom
adil-a/fix-48953745-qwen3-moe-30b-lora
Closed

fix: lower qwen3_moe_30b_lora local_batch_size to avoid CI OOM#1948
adil-a wants to merge 1 commit intomainfrom
adil-a/fix-48953745-qwen3-moe-30b-lora

Conversation

@adil-a
Copy link
Copy Markdown
Collaborator

@adil-a adil-a commented Apr 21, 2026

Summary

  • CI sft_ckpt_robustness job qwen3_moe_30b_lora (job 301287532, pipeline 48953745) was OOM-ing at step 39 in the finetune phase on rank 0 right after the first val + checkpoint save: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.17 GiB. GPU 0 has a total capacity of 79.11 GiB of which 9.75 GiB is free, which cascaded into DeepEP CPU recv timeouts on the other 7 ranks. No pytest ran ([finetune] Failed with exit code 1, skipping robustness test).
  • Root cause is activation-memory pressure at local_batch_size: 128 with activation_checkpointing: true on 30B-A3B MoE + LoRA. Steady-state mem was 57.87 GiB on every rank; the post-save rank-0 state left enough fragmentation that the next backward alloc (12.17 GiB) OOM'd.
  • Fix: lower local_batch_size: 128 -> 64 to match the passing sibling qwen3_moe_30b_hellaswag (SFT, local_batch_size: 64, mem 60.19 GiB in CI). global_batch_size stays 1024; GA becomes 2. Steady-state mem drops to 34.91 GiB, the step 38 val+save and step 39 backward both pass.
  • The STATUS.md 2026-04-02 note flagging Phase 3 KL=0.84 - broken PEFT checkpoint reload for Qwen3-MoE is stale: Phase 3 is now bit-exact (max KL = 0), so no adapter save/reload code change is needed here.

Test plan

  • Byte-exact OOM repro on cw-dfw 8xH100 (transformers==5.5.4): [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.17 GiB at step 39, matching CI.
  • Post-fix finetune phase runs cleanly through CI's max_steps=50 and beyond (steady mem 34.9 GiB, step 38 val+save OK, step 39 backward OK, second val+save at step 77 OK). Log: /tmp/verify_finetune.log on cw-dfw.
  • Checkpoint robustness pytest with CI overrides (--step_scheduler.max_steps 5 --step_scheduler.ckpt_every_steps 5 --step_scheduler.val_every_steps 5 --step_scheduler.global_batch_size 32 --step_scheduler.local_batch_size 2): [Phase 3] Automodel-from-consolidated max KL: 0.000000e+00 (threshold: 0.000000e+00), [Phase 4] HF-loaded max KL: 1.018360e-02 (threshold: 7.000000e-02), 1 passed, 48 warnings in 238.45s. Log: /tmp/verify_robust.log on cw-dfw.
  • Next nightly sft_ckpt_robustness qwen3_moe_30b_lora job passes.

🤖 Generated with Claude Code

The CI sft_ckpt_robustness finetune phase was running the config as-is
with local_batch_size: 128 and OOM-ing at step 39 on rank 0 right after
the first validation + checkpoint save (max_steps=50, val_every_steps=50
puts the val at step 38 since HellaSwag train exhausts there). Rank 0
reports `torch.OutOfMemoryError: CUDA out of memory. Tried to allocate
12.17 GiB. GPU 0 has a total capacity of 79.11 GiB of which 9.75 GiB is
free.` and cascades into DeepEP CPU recv timeouts on the other ranks.

Root cause is activation memory: with activation_checkpointing: true and
LoRA frozen base, the backward pass still materializes full-batch-128
activations for recomputation, peaking at ~58 GiB mid-step. The
post-save state (consolidated safetensors staging on rank 0) leaves
enough fragmentation that the next backward allocation fails.

Halving local_batch_size (128 -> 64, matching the passing sibling
qwen3_moe_30b_hellaswag SFT) drops steady-state memory from 57.87 GiB
to 34.91 GiB and keeps global_batch_size=1024 via GA=2. No other
knobs change.

Reproduced the original OOM byte-exact on cw-dfw 8xH100 with
transformers==5.5.4 (`[rank0]: torch.OutOfMemoryError: CUDA out of
memory. Tried to allocate 12.17 GiB` at step 39). Post-fix the finetune
phase runs cleanly through step 50 and beyond (steady mem 34.9 GiB,
second val+save at step 77 survives). Checkpoint robustness pytest
(`--step_scheduler.max_steps 5 --local_batch_size 2 --global_batch_size
32`) reports `[Phase 3] max KL = 0.000000e+00`, `[Phase 4] max KL =
1.018360e-02 (threshold 7.000000e-02)`, `1 passed, 48 warnings in
238.45s`.

The STATUS.md note from 2026-04-02 flagging `Phase 3 KL=0.84 - broken
PEFT checkpoint reload for Qwen3-MoE` is stale: Phase 3 now returns 0
bit-exact, so no adapter save/reload code change is needed here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant