From c71d1ab8e94199f35195c836301cbc6a1106ad6b Mon Sep 17 00:00:00 2001
From: adil-a <adil.asif2000@hotmail.com>
Date: Tue, 21 Apr 2026 15:38:02 +0000
Subject: [PATCH] fix: nemotron_super_v3_hellaswag checkpoint robustness batch
 size

Phase 1 crashes in StepScheduler.__init__ with
`global_batch_size (32) must be divisible by local_batch_size * dp_size
(2 * 32)`. The CI robustness launcher hardcodes
`--step_scheduler.local_batch_size 2` and `--step_scheduler.global_batch_size
32`, but this config runs on 4 nodes x 8 GPUs (dp_size=32 with fsdp2, tp=1,
cp=1, ep=32), so 32 is not divisible by 2 * 32 = 64.

Override `step_scheduler.global_batch_size: 64` in `ci.checkpoint_robustness`.
The test harness appends `ci.checkpoint_robustness` dotted-key entries to
the argv tail after the launcher-provided flags, so the YAML value wins.
With gbs=64, 64 % (2 * 32) == 0 and grad-accum = 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
---
 examples/llm_finetune/nemotron/nemotron_super_v3_hellaswag.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/examples/llm_finetune/nemotron/nemotron_super_v3_hellaswag.yaml b/examples/llm_finetune/nemotron/nemotron_super_v3_hellaswag.yaml
index 3c03bd6351..36ec56283f 100644
--- a/examples/llm_finetune/nemotron/nemotron_super_v3_hellaswag.yaml
+++ b/examples/llm_finetune/nemotron/nemotron_super_v3_hellaswag.yaml
@@ -111,5 +111,6 @@ ci:
     tokenizer_name: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
     hf_device_map_auto: true
     no_check_resume: true
+    step_scheduler.global_batch_size: 64
     dataset.num_samples_limit: 500
     validation_dataset.num_samples_limit: 500