Skip to content

fix: loosen sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh step time/loss check#1221

Merged
terrykong merged 2 commits intomainfrom
tk/step-time-tolerance
Sep 29, 2025
Merged

fix: loosen sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh step time/loss check#1221
terrykong merged 2 commits intomainfrom
tk/step-time-tolerance

Conversation

@terrykong
Copy link
Copy Markdown
Collaborator

@terrykong terrykong commented Sep 28, 2025

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Step time threshold is a little too strict on this test:

image

https://wandb.ai/nvidia/nemo-rl/panel/z7zystoxj?nw=5kdn9ueaulf

Even on the first commit that introduced this test, the timing check can fail


The loss issue was bisected in #1223. tldr; the num_workers change affected the dataloading very slightly that caused the strict loss check to fail

image

https://wandb.ai/nvidia/nemo-rl?nw=fnhia71y43d

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Tests
    • Relaxed the performance threshold for LLM training step-time checks to better match observed timing variability, reducing flaky failures and improving CI reliability.
    • Maintains performance monitoring while minimizing false negatives across environments.
    • No impact on product functionality; end-user features and behavior remain unchanged.

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong requested a review from ashors1 September 28, 2025 17:03
@terrykong terrykong requested a review from a team as a code owner September 28, 2025 17:03
@terrykong terrykong added the CI:docs Run doctest label Sep 28, 2025
@terrykong terrykong enabled auto-merge (squash) September 28, 2025 17:03
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 28, 2025

📝 Walkthrough

Walkthrough

Updated a performance assertion in a single LLM test script to relax the mean train total_step_time threshold from <0.6 to <0.7, and added a comment documenting observed timing range.

Changes

Cohort / File(s) Summary
LLM test thresholds
tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
Relaxed assertion threshold for mean train total_step_time from <0.6 to <0.7; added comment noting observed 0.6–0.64 timings.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Test Results For Major Changes ✅ Passed Reviewing the PR summary shows it only loosens a numeric threshold and updates a related comment inside a test script, without introducing new features, refactors, or other wide-ranging impacts, so the change qualifies as minor under the custom check criteria. Because minor changes do not require documented test results, the absence of additional testing information in the PR description is acceptable and the check therefore passes.
Title Check ✅ Passed The title succinctly conveys the primary change—loosening the step time (and loss) threshold in the specified sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh test—making it clear what was fixed and where without unnecessary detail.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tk/step-time-tolerance

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong changed the title fix: loosen sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh step time check fix: loosen sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh step time/loss check Sep 29, 2025
@terrykong terrykong added CI:docs Run doctest and removed CI:docs Run doctest labels Sep 29, 2025
@terrykong terrykong enabled auto-merge (squash) September 29, 2025 18:19
@terrykong terrykong merged commit b445a3a into main Sep 29, 2025
42 of 43 checks passed
@terrykong terrykong deleted the tk/step-time-tolerance branch September 29, 2025 18:24
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
…VIDIA-NeMo#1221)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:docs Run doctest

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants