fix: make sft dynamic batch step time check more stable by terrykong · Pull Request #1265 · NVIDIA-NeMo/RL

terrykong · 2025-10-03T06:30:10Z

Failure seems to have been there since the beginning, usually the threshold is barely crossed

                                 Metric Checks                                  
┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Status ┃ Check                  ┃ Value              ┃ Message               ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ PASS   │ data["train/loss"]["1… │ 0.5874428153038025 │                       │
│        │ < 0.6                  │                    │                       │
│ PASS   │ data["train/loss"]["2… │ 0.3165355920791626 │                       │
│        │ < 0.36                 │                    │                       │
│ PASS   │ max(data["ray/node.0.… │ 68.64453125        │                       │
│        │ < 70                   │                    │                       │
│ FAIL   │ mean(data["timing/tra… │ 10.07073718764217  │ mean(data["timing/tr… │
│        │ 2) < 10                │                    │ 2) < 10 (condition    │
│        │                        │                    │ evaluated to False)   │
└────────┴────────────────────────┴────────────────────┴───────────────────────┘

It's biased a little high because it also includes the checkpointing https://wandb.ai/nvidia/nemo-rl/panel/z7zystoxj?nw=rv5zmj1j76g&yAxisMax=13

The performance seems to be pretty high variance between commits, even the initial commit had a step time pretty close to 10

                                 Metric Checks                             
┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓┃ Status ┃ Check                                 ┃ Value             ┃ Message ┃┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ PASS   │ mean(data["timing/train/total_step_t… │ 9.983369773650265 │         │   │        │ 2) < 10                               │                   │         │   └────────┴───────────────────────────────────────┴───────────────────┴─────────┘

Averaging just the last steps seems to give a larger gap and fewer false positives

Summary by CodeRabbit

Tests
- Updated LLM dynamic-batch test to compute the mean of training total step time over a recent window (indices -6 to -1) instead of a single index.
- Metrics check now aggregates multiple recent samples for timing validation within the test suite.
- No changes to runtime behavior or user-facing functionality outside of the test environment.

Signed-off-by: Terry Kong <terryk@nvidia.com>

coderabbitai · 2025-10-03T06:32:32Z

📝 Walkthrough

Walkthrough

Updated a test script’s metric validation: the mean calculation for timing data now uses a trailing slice (-6 to -1) instead of a single index (2). No other logic or control flow changes.

Changes

Cohort / File(s)	Summary of Changes
LLM test dynamic batch metrics check `tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh`	Changed mean computation from a single-index lookup to a trailing-slice mean for `timing/train/total_step_time` (mean(..., -6, -1) vs mean(..., 2)).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

fix: lower steps in smolvlm nightly test #1239 — Similar shift from single-index metric checks to trailing-slice mean in test validation logic.

Suggested labels

CI:L0

Suggested reviewers

ashors1
chtruong814

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit's high-level summary is enabled.
Title Check	✅ Passed	The title clearly summarizes the primary change by stating that the SFT dynamic batch step time check is being made more stable, which directly aligns with the PR’s goal of reducing flakiness in the timing metric.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.
Test Results For Major Changes	✅ Passed	The PR only adjusts the averaging window in a test script and does not introduce new features, break existing APIs, or alter model behavior or performance; it is a minor change limited to test stability. According to the custom check, test results are only required for major numeric, convergence, or performance changes, so this adjustment passes as a minor change.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tk/dynbatch-fix

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 557b7ec and a1cd1f5.

📒 Files selected for processing (1)

tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Follow the Google Shell Style Guide for all shell scripts
Use uv run to execute Python scripts in shell/driver scripts instead of activating virtualenvs and calling python directly
Add the NVIDIA copyright header (with current year) at the top of all shell scripts, excluding tests/ and test-only scripts

Files:

tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh

tests/test_suites/llm/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

LLM driver script filenames must mirror the YAML base name and follow the same pattern with .sh extension

Files:

tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh

tests/test_suites/**

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Place driver shell scripts and common.env under tests/test_suites// and list nightly tests in tests/test_suites/nightly.txt

Files:

tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: build-container / main
GitHub Check: sphinx-build / Build docs
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Post submodule check comment / Comment on PR
GitHub Check: Post automodel integration comment / Comment on PR

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…1265) Signed-off-by: Terry Kong <terryk@nvidia.com>

…1265) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

fix: make sft dynamic batch step time check more stable

a1cd1f5

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong requested a review from a team as a code owner October 3, 2025 06:30

terrykong requested review from ashors1 and chtruong814 and removed request for a team October 3, 2025 06:30

terrykong added the CI:docs Run doctest label Oct 3, 2025

terrykong enabled auto-merge (squash) October 3, 2025 06:31

terrykong temporarily deployed to nemo-ci October 3, 2025 06:31 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci October 3, 2025 06:32 — with GitHub Actions Inactive

terrykong added the r0.4.0 label Oct 5, 2025

chtruong814 approved these changes Oct 5, 2025

View reviewed changes

terrykong added CI:docs Run doctest and removed CI:docs Run doctest labels Oct 5, 2025

terrykong temporarily deployed to nemo-ci October 5, 2025 03:23 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci October 5, 2025 05:17 — with GitHub Actions Inactive

terrykong merged commit c68b4c2 into main Oct 5, 2025
68 of 70 checks passed

terrykong deleted the tk/dynbatch-fix branch October 5, 2025 06:31

coderabbitai Bot mentioned this pull request Oct 8, 2025

fix: more robust fp8 rollout metric check #1307

Merged

4 tasks

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

fix: make sft dynamic batch step time check more stable (NVIDIA-NeMo#…

287b45d

…1265) Signed-off-by: Terry Kong <terryk@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make sft dynamic batch step time check more stable#1265

fix: make sft dynamic batch step time check more stable#1265
terrykong merged 1 commit intomainfrom
tk/dynbatch-fix

terrykong commented Oct 3, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Oct 3, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

terrykong commented Oct 3, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Oct 3, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

terrykong commented Oct 3, 2025 •

edited by coderabbitai Bot

Loading