From c71d1ab8e94199f35195c836301cbc6a1106ad6b Mon Sep 17 00:00:00 2001 From: adil-a Date: Tue, 21 Apr 2026 15:38:02 +0000 Subject: [PATCH] fix: nemotron_super_v3_hellaswag checkpoint robustness batch size Phase 1 crashes in StepScheduler.__init__ with `global_batch_size (32) must be divisible by local_batch_size * dp_size (2 * 32)`. The CI robustness launcher hardcodes `--step_scheduler.local_batch_size 2` and `--step_scheduler.global_batch_size 32`, but this config runs on 4 nodes x 8 GPUs (dp_size=32 with fsdp2, tp=1, cp=1, ep=32), so 32 is not divisible by 2 * 32 = 64. Override `step_scheduler.global_batch_size: 64` in `ci.checkpoint_robustness`. The test harness appends `ci.checkpoint_robustness` dotted-key entries to the argv tail after the launcher-provided flags, so the YAML value wins. With gbs=64, 64 % (2 * 32) == 0 and grad-accum = 1. Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: adil-a --- examples/llm_finetune/nemotron/nemotron_super_v3_hellaswag.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/llm_finetune/nemotron/nemotron_super_v3_hellaswag.yaml b/examples/llm_finetune/nemotron/nemotron_super_v3_hellaswag.yaml index 3c03bd6351..36ec56283f 100644 --- a/examples/llm_finetune/nemotron/nemotron_super_v3_hellaswag.yaml +++ b/examples/llm_finetune/nemotron/nemotron_super_v3_hellaswag.yaml @@ -111,5 +111,6 @@ ci: tokenizer_name: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 hf_device_map_auto: true no_check_resume: true + step_scheduler.global_batch_size: 64 dataset.num_samples_limit: 500 validation_dataset.num_samples_limit: 500