fix: loosen sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh step time/loss check by terrykong · Pull Request #1221 · NVIDIA-NeMo/RL

terrykong · 2025-09-28T17:03:00Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Step time threshold is a little too strict on this test:

https://wandb.ai/nvidia/nemo-rl/panel/z7zystoxj?nw=5kdn9ueaulf

Even on the first commit that introduced this test, the timing check can fail

The loss issue was bisected in #1223. tldr; the num_workers change affected the dataloading very slightly that caused the strict loss check to fail

https://wandb.ai/nvidia/nemo-rl?nw=fnhia71y43d

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Tests
- Relaxed the performance threshold for LLM training step-time checks to better match observed timing variability, reducing flaky failures and improving CI reliability.
- Maintains performance monitoring while minimizing false negatives across environments.
- No impact on product functionality; end-user features and behavior remain unchanged.

Signed-off-by: Terry Kong <terryk@nvidia.com>

coderabbitai · 2025-09-28T17:06:41Z

📝 Walkthrough

Walkthrough

Updated a performance assertion in a single LLM test script to relax the mean train total_step_time threshold from <0.6 to <0.7, and added a comment documenting observed timing range.

Changes

Cohort / File(s)	Summary
LLM test thresholds `tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh`	Relaxed assertion threshold for mean train total_step_time from <0.6 to <0.7; added comment noting observed 0.6–0.64 timings.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.
Test Results For Major Changes	✅ Passed	Reviewing the PR summary shows it only loosens a numeric threshold and updates a related comment inside a test script, without introducing new features, refactors, or other wide-ranging impacts, so the change qualifies as minor under the custom check criteria. Because minor changes do not require documented test results, the absence of additional testing information in the PR description is acceptable and the check therefore passes.
Title Check	✅ Passed	The title succinctly conveys the primary change—loosening the step time (and loss) threshold in the specified sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh test—making it clear what was fixed and where without unnecessary detail.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tk/step-time-tolerance

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Terry Kong <terryk@nvidia.com>

…VIDIA-NeMo#1221) Signed-off-by: Terry Kong <terryk@nvidia.com>

…VIDIA-NeMo#1221) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

fix: loosen sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh step time check

b9e21d1

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong requested a review from ashors1 September 28, 2025 17:03

terrykong requested a review from a team as a code owner September 28, 2025 17:03

terrykong added the CI:docs Run doctest label Sep 28, 2025

terrykong temporarily deployed to nemo-ci September 28, 2025 17:03 — with GitHub Actions Inactive

terrykong enabled auto-merge (squash) September 28, 2025 17:03

terrykong temporarily deployed to nemo-ci September 28, 2025 17:19 — with GitHub Actions Inactive

terrykong disabled auto-merge September 29, 2025 03:24

fix the train loss after num_workers=1 change

32e0890

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong changed the title ~~fix: loosen sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh step time check~~ fix: loosen sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh step time/loss check Sep 29, 2025

ashors1 approved these changes Sep 29, 2025

View reviewed changes

terrykong added CI:docs Run doctest and removed CI:docs Run doctest labels Sep 29, 2025

terrykong enabled auto-merge (squash) September 29, 2025 18:19

terrykong temporarily deployed to nemo-ci September 29, 2025 18:19 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci September 29, 2025 18:20 — with GitHub Actions Inactive

terrykong merged commit b445a3a into main Sep 29, 2025
42 of 43 checks passed

terrykong deleted the tk/step-time-tolerance branch September 29, 2025 18:24

terrykong mentioned this pull request Oct 4, 2025

fix: qwen32 nightly metric check more stable #1271

Merged

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

fix: loosen sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh step time/loss check (N…

388b732

…VIDIA-NeMo#1221) Signed-off-by: Terry Kong <terryk@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: loosen sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh step time/loss check#1221

fix: loosen sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh step time/loss check#1221
terrykong merged 2 commits intomainfrom
tk/step-time-tolerance

terrykong commented Sep 28, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot commented Sep 28, 2025 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

terrykong commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

terrykong commented Sep 28, 2025 •

edited

Loading

coderabbitai Bot commented Sep 28, 2025 •

edited

Loading