fix: lower qwen3_moe_30b_lora local_batch_size to avoid CI OOM#1948
Closed
fix: lower qwen3_moe_30b_lora local_batch_size to avoid CI OOM#1948
Conversation
The CI sft_ckpt_robustness finetune phase was running the config as-is with local_batch_size: 128 and OOM-ing at step 39 on rank 0 right after the first validation + checkpoint save (max_steps=50, val_every_steps=50 puts the val at step 38 since HellaSwag train exhausts there). Rank 0 reports `torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.17 GiB. GPU 0 has a total capacity of 79.11 GiB of which 9.75 GiB is free.` and cascades into DeepEP CPU recv timeouts on the other ranks. Root cause is activation memory: with activation_checkpointing: true and LoRA frozen base, the backward pass still materializes full-batch-128 activations for recomputation, peaking at ~58 GiB mid-step. The post-save state (consolidated safetensors staging on rank 0) leaves enough fragmentation that the next backward allocation fails. Halving local_batch_size (128 -> 64, matching the passing sibling qwen3_moe_30b_hellaswag SFT) drops steady-state memory from 57.87 GiB to 34.91 GiB and keeps global_batch_size=1024 via GA=2. No other knobs change. Reproduced the original OOM byte-exact on cw-dfw 8xH100 with transformers==5.5.4 (`[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.17 GiB` at step 39). Post-fix the finetune phase runs cleanly through step 50 and beyond (steady mem 34.9 GiB, second val+save at step 77 survives). Checkpoint robustness pytest (`--step_scheduler.max_steps 5 --local_batch_size 2 --global_batch_size 32`) reports `[Phase 3] max KL = 0.000000e+00`, `[Phase 4] max KL = 1.018360e-02 (threshold 7.000000e-02)`, `1 passed, 48 warnings in 238.45s`. The STATUS.md note from 2026-04-02 flagging `Phase 3 KL=0.84 - broken PEFT checkpoint reload for Qwen3-MoE` is stale: Phase 3 now returns 0 bit-exact, so no adapter save/reload code change is needed here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sft_ckpt_robustnessjobqwen3_moe_30b_lora(job 301287532, pipeline 48953745) was OOM-ing at step 39 in the finetune phase on rank 0 right after the first val + checkpoint save:torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.17 GiB. GPU 0 has a total capacity of 79.11 GiB of which 9.75 GiB is free, which cascaded into DeepEP CPU recv timeouts on the other 7 ranks. No pytest ran ([finetune] Failed with exit code 1, skipping robustness test).local_batch_size: 128withactivation_checkpointing: trueon 30B-A3B MoE + LoRA. Steady-state mem was 57.87 GiB on every rank; the post-save rank-0 state left enough fragmentation that the next backward alloc (12.17 GiB) OOM'd.local_batch_size: 128 -> 64to match the passing siblingqwen3_moe_30b_hellaswag(SFT,local_batch_size: 64,mem 60.19 GiBin CI).global_batch_sizestays1024; GA becomes 2. Steady-state mem drops to 34.91 GiB, the step 38 val+save and step 39 backward both pass.STATUS.md2026-04-02 note flaggingPhase 3 KL=0.84 - broken PEFT checkpoint reload for Qwen3-MoEis stale: Phase 3 is now bit-exact (max KL = 0), so no adapter save/reload code change is needed here.Test plan
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.17 GiBat step 39, matching CI.max_steps=50and beyond (steady mem 34.9 GiB, step 38 val+save OK, step 39 backward OK, second val+save at step 77 OK). Log:/tmp/verify_finetune.logon cw-dfw.--step_scheduler.max_steps 5 --step_scheduler.ckpt_every_steps 5 --step_scheduler.val_every_steps 5 --step_scheduler.global_batch_size 32 --step_scheduler.local_batch_size 2):[Phase 3] Automodel-from-consolidated max KL: 0.000000e+00 (threshold: 0.000000e+00),[Phase 4] HF-loaded max KL: 1.018360e-02 (threshold: 7.000000e-02),1 passed, 48 warnings in 238.45s. Log:/tmp/verify_robust.logon cw-dfw.sft_ckpt_robustnessqwen3_moe_30b_lorajob passes.🤖 Generated with Claude Code