From 93ac3e64de99931f20b468094d5075993916d540 Mon Sep 17 00:00:00 2001
From: adil-a <adil.asif2000@hotmail.com>
Date: Tue, 21 Apr 2026 15:28:55 +0000
Subject: [PATCH] fix: lower qwen3_moe_30b_lora local_batch_size to 64 to avoid
 CI OOM

The CI sft_ckpt_robustness finetune phase was running the config as-is
with local_batch_size: 128 and OOM-ing at step 39 on rank 0 right after
the first validation + checkpoint save (max_steps=50, val_every_steps=50
puts the val at step 38 since HellaSwag train exhausts there). Rank 0
reports `torch.OutOfMemoryError: CUDA out of memory. Tried to allocate
12.17 GiB. GPU 0 has a total capacity of 79.11 GiB of which 9.75 GiB is
free.` and cascades into DeepEP CPU recv timeouts on the other ranks.

Root cause is activation memory: with activation_checkpointing: true and
LoRA frozen base, the backward pass still materializes full-batch-128
activations for recomputation, peaking at ~58 GiB mid-step. The
post-save state (consolidated safetensors staging on rank 0) leaves
enough fragmentation that the next backward allocation fails.

Halving local_batch_size (128 -> 64, matching the passing sibling
qwen3_moe_30b_hellaswag SFT) drops steady-state memory from 57.87 GiB
to 34.91 GiB and keeps global_batch_size=1024 via GA=2. No other
knobs change.

Reproduced the original OOM byte-exact on cw-dfw 8xH100 with
transformers==5.5.4 (`[rank0]: torch.OutOfMemoryError: CUDA out of
memory. Tried to allocate 12.17 GiB` at step 39). Post-fix the finetune
phase runs cleanly through step 50 and beyond (steady mem 34.9 GiB,
second val+save at step 77 survives). Checkpoint robustness pytest
(`--step_scheduler.max_steps 5 --local_batch_size 2 --global_batch_size
32`) reports `[Phase 3] max KL = 0.000000e+00`, `[Phase 4] max KL =
1.018360e-02 (threshold 7.000000e-02)`, `1 passed, 48 warnings in
238.45s`.

The STATUS.md note from 2026-04-02 flagging `Phase 3 KL=0.84 - broken
PEFT checkpoint reload for Qwen3-MoE` is stale: Phase 3 now returns 0
bit-exact, so no adapter save/reload code change is needed here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
---
 examples/llm_finetune/qwen/qwen3_moe_30b_lora.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/llm_finetune/qwen/qwen3_moe_30b_lora.yaml b/examples/llm_finetune/qwen/qwen3_moe_30b_lora.yaml
index 69eb5497ec..4d701eef03 100644
--- a/examples/llm_finetune/qwen/qwen3_moe_30b_lora.yaml
+++ b/examples/llm_finetune/qwen/qwen3_moe_30b_lora.yaml
@@ -15,7 +15,7 @@ recipe: TrainFinetuneRecipeForNextTokenPrediction
 
 step_scheduler:
   global_batch_size: 1024
-  local_batch_size: 128
+  local_batch_size: 64
   ckpt_every_steps: 100
   val_every_steps: 50
   num_epochs: 2