Skip to content

fix: make sft dynamic batch step time check more stable#1265

Merged
terrykong merged 1 commit intomainfrom
tk/dynbatch-fix
Oct 5, 2025
Merged

fix: make sft dynamic batch step time check more stable#1265
terrykong merged 1 commit intomainfrom
tk/dynbatch-fix

Conversation

@terrykong
Copy link
Copy Markdown
Collaborator

@terrykong terrykong commented Oct 3, 2025

Failure seems to have been there since the beginning, usually the threshold is barely crossed

                                 Metric Checks                                  
┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Status ┃ Check                  ┃ Value              ┃ Message               ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ PASS   │ data["train/loss"]["1… │ 0.5874428153038025 │                       │
│        │ < 0.6                  │                    │                       │
│ PASS   │ data["train/loss"]["2… │ 0.3165355920791626 │                       │
│        │ < 0.36                 │                    │                       │
│ PASS   │ max(data["ray/node.0.… │ 68.64453125        │                       │
│        │ < 70                   │                    │                       │
│ FAIL   │ mean(data["timing/tra… │ 10.07073718764217  │ mean(data["timing/tr… │
│        │ 2) < 10                │                    │ 2) < 10 (condition    │
│        │                        │                    │ evaluated to False)   │
└────────┴────────────────────────┴────────────────────┴───────────────────────┘

It's biased a little high because it also includes the checkpointing https://wandb.ai/nvidia/nemo-rl/panel/z7zystoxj?nw=rv5zmj1j76g&yAxisMax=13

image

The performance seems to be pretty high variance between commits, even the initial commit had a step time pretty close to 10

                                 Metric Checks                             
┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓┃ Status ┃ Check                                 ┃ Value             ┃ Message ┃┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ PASS   │ mean(data["timing/train/total_step_t… │ 9.983369773650265 │         │   │        │ 2) < 10                               │                   │         │   └────────┴───────────────────────────────────────┴───────────────────┴─────────┘

Averaging just the last steps seems to give a larger gap and fewer false positives

Summary by CodeRabbit

  • Tests
    • Updated LLM dynamic-batch test to compute the mean of training total step time over a recent window (indices -6 to -1) instead of a single index.
    • Metrics check now aggregates multiple recent samples for timing validation within the test suite.
    • No changes to runtime behavior or user-facing functionality outside of the test environment.

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong requested a review from a team as a code owner October 3, 2025 06:30
@terrykong terrykong requested review from ashors1 and chtruong814 and removed request for a team October 3, 2025 06:30
@terrykong terrykong added the CI:docs Run doctest label Oct 3, 2025
@terrykong terrykong enabled auto-merge (squash) October 3, 2025 06:31
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 3, 2025

📝 Walkthrough

Walkthrough

Updated a test script’s metric validation: the mean calculation for timing data now uses a trailing slice (-6 to -1) instead of a single index (2). No other logic or control flow changes.

Changes

Cohort / File(s) Summary of Changes
LLM test dynamic batch metrics check
tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh
Changed mean computation from a single-index lookup to a trailing-slice mean for timing/train/total_step_time (mean(..., -6, -1) vs mean(..., 2)).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

CI:L0

Suggested reviewers

  • ashors1
  • chtruong814

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit's high-level summary is enabled.
Title Check ✅ Passed The title clearly summarizes the primary change by stating that the SFT dynamic batch step time check is being made more stable, which directly aligns with the PR’s goal of reducing flakiness in the timing metric.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Test Results For Major Changes ✅ Passed The PR only adjusts the averaging window in a test script and does not introduce new features, break existing APIs, or alter model behavior or performance; it is a minor change limited to test stability. According to the custom check, test results are only required for major numeric, convergence, or performance changes, so this adjustment passes as a minor change.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tk/dynbatch-fix

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 557b7ec and a1cd1f5.

📒 Files selected for processing (1)
  • tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Follow the Google Shell Style Guide for all shell scripts
Use uv run to execute Python scripts in shell/driver scripts instead of activating virtualenvs and calling python directly
Add the NVIDIA copyright header (with current year) at the top of all shell scripts, excluding tests/ and test-only scripts

Files:

  • tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh
tests/test_suites/llm/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

LLM driver script filenames must mirror the YAML base name and follow the same pattern with .sh extension

Files:

  • tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh
tests/test_suites/**

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Place driver shell scripts and common.env under tests/test_suites// and list nightly tests in tests/test_suites/nightly.txt

Files:

  • tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: build-container / main
  • GitHub Check: sphinx-build / Build docs
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@terrykong terrykong added CI:docs Run doctest and removed CI:docs Run doctest labels Oct 5, 2025
@terrykong terrykong merged commit c68b4c2 into main Oct 5, 2025
68 of 70 checks passed
@terrykong terrykong deleted the tk/dynbatch-fix branch October 5, 2025 06:31
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
…1265)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants