Skip to content

fix: more robust fp8 rollout metric check#1307

Merged
terrykong merged 7 commits intomainfrom
tk/fp8-rollout-fix
Oct 17, 2025
Merged

fix: more robust fp8 rollout metric check#1307
terrykong merged 7 commits intomainfrom
tk/fp8-rollout-fix

Conversation

@terrykong
Copy link
Copy Markdown
Collaborator

@terrykong terrykong commented Oct 8, 2025

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Low step count sometimes skewed the logprob above 1.1 when there were occasional spikes

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Added ratio_above metric helper to compute the proportion of values meeting a threshold.
    • Enhanced mean with ignore_top_p to exclude top-percentile outliers for more robust averages.
    • Exposed ratio_above in evaluation checks for immediate use.
  • Documentation

    • Updated examples and help text to include ratio_above and ignore_top_p usage.
  • Tests

    • Added comprehensive unit tests covering mean (with/without outlier filtering), min/max, ratio_above, and evaluation checks.
  • Chores

    • Consolidated a metrics condition to use mean with ignore_top_p for clearer, outlier-resistant checks.

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong requested a review from a team as a code owner October 8, 2025 05:40
@terrykong terrykong added r0.4.0 CI:L0 Run doctests and unit tests labels Oct 8, 2025
@terrykong terrykong added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Oct 8, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 8, 2025

📝 Walkthrough

Walkthrough

Introduces ratio_above helper, extends mean with ignore_top_p parameter and validation, updates evaluate_check to expose ratio_above and use builtins for eval safety, and adjusts a GRPO LLM test to use mean(..., ignore_top_p=0.05). Adds comprehensive unit tests covering helpers and evaluate_check scenarios.

Changes

Cohort / File(s) Summary
Metrics utilities
tests/check_metrics.py
Add ratio_above(value, threshold). Extend mean(value, range_start=1, range_end=0, ignore_top_p=0.0) with top-percentile filtering and validation; switch to builtins.max. Update evaluate_check to expose ratio_above and use builtins in eval contexts.
LLM suite check update
tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2.sh
Consolidate token_mult_prob_error checks into mean(..., ignore_top_p=0.05) < 1.1; add explanatory comment.
Unit tests
tests/unit/test_check_metrics.py
New tests for mean (incl. ignore_top_p), min/max, ratio_above, and evaluate_check across pass/fail/error and real-world-like datasets.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant evaluate_check
  participant Context as Local Context (min,max,mean,ratio_above)
  participant Builtins as Safe builtins
  Caller->>evaluate_check: evaluate_check(check_expr, value_expr, data)
  activate evaluate_check
  note over evaluate_check,Context: Build context with helpers and data
  evaluate_check->>Builtins: Use builtins for eval (no __builtins__ leakage)
  evaluate_check->>evaluate_check: eval(value_expr, {__builtins__: builtins}, Context)
  evaluate_check->>evaluate_check: eval(check_expr, {__builtins__: builtins}, Context)
  evaluate_check-->>Caller: {passed, message, value}
  deactivate evaluate_check
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • guyueh1
  • chtruong814

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning The PR introduces notable functional changes to the metrics evaluation utilities—including a new ratio_above helper, an ignore_top_p option on mean, and integration updates affecting numerical behavior—so it qualifies as a major change. The PR description, however, is effectively empty and does not document any tests, convergence checks, or performance validation despite the numerics-impacting modifications. Because the required testing evidence is missing from the description, the custom check fails. Please update the PR description to summarize the executed tests (e.g., relevant unit tests, regression checks, or rollouts) and include any necessary convergence or performance evidence showing no regressions, then re-run this check.
✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly describes that the PR fixes the FP8 rollout metric check to be more robust by ignoring outliers, which aligns with the key changes to the mean function and rollout testing logic.
Docstring Coverage ✅ Passed Docstring coverage is 97.62% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tk/fp8-rollout-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2.sh Outdated
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong requested a review from a team as a code owner October 12, 2025 04:42
@guyueh1
Copy link
Copy Markdown
Contributor

guyueh1 commented Oct 14, 2025

@terrykong are we running this in CI? Is 100 minutes enough for 100 steps?

@terrykong
Copy link
Copy Markdown
Collaborator Author

terrykong commented Oct 14, 2025

I cut the generations in half and we should be able to complete in 3 hr. The threshold is still too strict on ratio_above. Am running it once more and will update that threshold

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Oct 14, 2025
@terrykong terrykong enabled auto-merge (squash) October 14, 2025 23:20
@terrykong terrykong added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Oct 14, 2025
guyueh1
guyueh1 previously approved these changes Oct 15, 2025
Copy link
Copy Markdown
Contributor

@guyueh1 guyueh1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

parthchadha
parthchadha previously approved these changes Oct 15, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong dismissed stale reviews from parthchadha and guyueh1 via e0e4ae1 October 17, 2025 01:07
@terrykong terrykong added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Oct 17, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Oct 17, 2025
@terrykong terrykong added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Oct 17, 2025
@terrykong terrykong merged commit 85eeb8d into main Oct 17, 2025
51 of 58 checks passed
@terrykong terrykong deleted the tk/fp8-rollout-fix branch October 17, 2025 17:31
chtruong814 pushed a commit that referenced this pull request Oct 17, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
lbliii pushed a commit that referenced this pull request Nov 3, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L0 Run doctests and unit tests r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants