From 93ac3e64de99931f20b468094d5075993916d540 Mon Sep 17 00:00:00 2001 From: adil-a Date: Tue, 21 Apr 2026 15:28:55 +0000 Subject: [PATCH] fix: lower qwen3_moe_30b_lora local_batch_size to 64 to avoid CI OOM The CI sft_ckpt_robustness finetune phase was running the config as-is with local_batch_size: 128 and OOM-ing at step 39 on rank 0 right after the first validation + checkpoint save (max_steps=50, val_every_steps=50 puts the val at step 38 since HellaSwag train exhausts there). Rank 0 reports `torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.17 GiB. GPU 0 has a total capacity of 79.11 GiB of which 9.75 GiB is free.` and cascades into DeepEP CPU recv timeouts on the other ranks. Root cause is activation memory: with activation_checkpointing: true and LoRA frozen base, the backward pass still materializes full-batch-128 activations for recomputation, peaking at ~58 GiB mid-step. The post-save state (consolidated safetensors staging on rank 0) leaves enough fragmentation that the next backward allocation fails. Halving local_batch_size (128 -> 64, matching the passing sibling qwen3_moe_30b_hellaswag SFT) drops steady-state memory from 57.87 GiB to 34.91 GiB and keeps global_batch_size=1024 via GA=2. No other knobs change. Reproduced the original OOM byte-exact on cw-dfw 8xH100 with transformers==5.5.4 (`[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.17 GiB` at step 39). Post-fix the finetune phase runs cleanly through step 50 and beyond (steady mem 34.9 GiB, second val+save at step 77 survives). Checkpoint robustness pytest (`--step_scheduler.max_steps 5 --local_batch_size 2 --global_batch_size 32`) reports `[Phase 3] max KL = 0.000000e+00`, `[Phase 4] max KL = 1.018360e-02 (threshold 7.000000e-02)`, `1 passed, 48 warnings in 238.45s`. The STATUS.md note from 2026-04-02 flagging `Phase 3 KL=0.84 - broken PEFT checkpoint reload for Qwen3-MoE` is stale: Phase 3 now returns 0 bit-exact, so no adapter save/reload code change is needed here. Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: adil-a --- examples/llm_finetune/qwen/qwen3_moe_30b_lora.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/llm_finetune/qwen/qwen3_moe_30b_lora.yaml b/examples/llm_finetune/qwen/qwen3_moe_30b_lora.yaml index 69eb5497ec..4d701eef03 100644 --- a/examples/llm_finetune/qwen/qwen3_moe_30b_lora.yaml +++ b/examples/llm_finetune/qwen/qwen3_moe_30b_lora.yaml @@ -15,7 +15,7 @@ recipe: TrainFinetuneRecipeForNextTokenPrediction step_scheduler: global_batch_size: 1024 - local_batch_size: 128 + local_batch_size: 64 ckpt_every_steps: 100 val_every_steps: 50 num_epochs: 2