cp: `fix: more robust fp8 rollout metric check (1307)` into `r0.4.0` by chtruong814 · Pull Request #1386 · NVIDIA-NeMo/RL

chtruong814 · 2025-10-17T17:31:43Z

beep boop [🤖]: Hi @terrykong 👋,

we've cherry picked #1307 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

New Features
- Added ratio-based metric function for threshold-based value analysis
- Enhanced mean metric with optional outlier filtering capability
Tests
- Increased test execution duration and step counts for comprehensive validation
- Added extensive unit test coverage for metrics checking functionality

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai · 2025-10-17T17:32:10Z

📝 Walkthrough

Walkthrough

This PR enhances the test metrics framework with new evaluation functions and upgrades the FP8 rollout test suite from v2 to v3. A ratio_above() function is added for proportion-based threshold checks, the mean() function gains an ignore_top_p parameter for outlier filtering, and corresponding test parameters and metric gates are updated accordingly.

Changes

Cohort / File(s)	Summary
Configuration & Test Parameters `examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.yaml`, `tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.sh`	Removed `grpo.num_prompts_per_step` from config; increased STEPS_PER_RUN, MAX_STEPS (40→100), and NUM_MINUTES (120→180) in test script
Metric Evaluation Framework `tests/check_metrics.py`	Added `ratio_above(value, threshold)` function; extended `mean()` with `ignore_top_p` parameter for outlier filtering; updated `evaluate_check()` to expose new functions; replaced `__builtins__` with `builtins` module
Metric Gates & Checks `tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.sh`	Updated metric checks: `mean(..., ignore_top_p=0.05)` for train/token_mult_prob_error; replaced threshold check with `ratio_above(train/token_mult_prob_error, 1.1) < 0.1`
Test Suite Registry `tests/test_suites/nightly.txt`	Updated FP8 test reference from v2.sh to v3.sh
Unit Tests `tests/unit/test_check_metrics.py`	New comprehensive test module covering mean (with/without ignore_top_p), min, max, ratio_above, evaluate_check, and real-world outlier scenarios

Sequence Diagram

sequenceDiagram
    participant Test as Test Suite (v3)
    participant Eval as evaluate_check()
    participant Metrics as Metric Functions
    participant Data as Metric Data

    Test->>Eval: check expression with mean/ratio_above
    Eval->>Metrics: parse & execute mean(..., ignore_top_p=0.05)
    Metrics->>Data: fetch train/token_mult_prob_error values
    Data-->>Metrics: return dict of values
    Metrics->>Metrics: filter top outliers (ignore_top_p)
    Metrics-->>Eval: return filtered mean
    
    Eval->>Metrics: execute ratio_above(data, 1.1)
    Metrics->>Data: fetch train/token_mult_prob_error values
    Data-->>Metrics: return dict of values
    Metrics->>Metrics: count values >= threshold
    Metrics-->>Eval: return proportion
    
    Eval->>Eval: compare results to gate condition
    alt Gate passes
        Eval-->>Test: ✓ metrics pass
    else Gate fails
        Eval-->>Test: ✗ metrics fail
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

The changes span heterogeneous areas: new public API functions with parameter additions (ignore_top_p in mean(), new ratio_above() function), test parameter updates, updated metric gate logic with new expressions, and a comprehensive new unit test module with extensive coverage of edge cases and real-world scenarios. While individual changes are straightforward, the collective scope and the density of the new unit tests (including synthetic data construction and validation of filtering behavior) warrant careful review across multiple dimensions.

Possibly related PRs

NVIDIA-NeMo/RL#1307: Directly related—modifies the same metric functions in tests/check_metrics.py (adding ratio_above and extending mean with ignore_top_p), updates FP8 test scripts and unit tests to use the new APIs
NVIDIA-NeMo/RL#1308: Related—modifies nightly metric-check logic that depends on the mean() metric helper; this PR changes the function signature and behavior while the related PR applies those changes operationally

Suggested labels

CI:L0, r0.4.0

Suggested reviewers

terrykong
guyueh1

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	This PR contains substantive changes to the metrics infrastructure: it adds a new public function `ratio_above()`, updates the `mean()` function signature with a new `ignore_top_p` parameter for outlier filtering, introduces 36 comprehensive unit tests across 407 lines, and modifies the FP8 rollout test script to use these new functions with specific thresholds. These changes affect numerics and convergence gating (token_mult_prob_error validation). However, the PR description is minimal (auto-generated cherry-pick message only) and contains no documentation of test results, regression analysis, performance comparisons, or rationale explaining why these changes improve metric robustness. While the code includes unit tests and inline comments (e.g., "ratio_above @ 1.1 was 0.03,0.06,0.05: 3sigma ~=0.1"), these are not documented in the PR description itself as required.	Update the PR description to include: (1) test results demonstrating the new metric functions work correctly and comparing old vs. new metric gating behavior; (2) confirmation that no regression occurs in FP8 rollout convergence; (3) explanation of why the `ignore_top_p=0.05` filtering and `ratio_above() < 0.1` threshold improve robustness over previous checks; (4) reference to the empirical data supporting the 3-sigma threshold choice mentioned in code comments. The PR description should substantiate why these changes constitute a robustness improvement for production testing.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The PR title "fix: more robust fp8 rollout metric check (1307) into r0.4.0" accurately describes the primary changes in the changeset. The modifications across all files are focused on enhancing the FP8 rollout metric check robustness: new utility functions like `ratio_above()` are added to the metrics system, the `mean()` function is enhanced with an `ignore_top_p` parameter for better outlier handling, and the test suite is updated to leverage these improvements. The title clearly and concisely communicates this main objective, making it easy for a reviewer to understand the core purpose of the change. The title is also specific and avoids vague terms, correctly identifying this as a cherry-pick operation.
Docstring Coverage	✅ Passed	Docstring coverage is 97.62% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cherry-pick-1307-r0.4.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (6)

tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.sh (1)
7-10: Config bumps LGTM; SC2034 warnings are expected.

NUM_RUNS and NUM_MINUTES are consumed by external tooling/common.env; safe to keep as-is. If you want to silence shellcheck, export or mark readonly.
 NUM_RUNS=$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))  # Round up
-NUM_MINUTES=180
+NUM_MINUTES=180
+# shellcheck disable=SC2034  # Used by launcher/common.env
+readonly NUM_RUNS NUM_MINUTES
Based on learnings.
tests/check_metrics.py (4)
78-82: Ensure range filtering is order-robust by iterating steps numerically.

Dict insertion order may not match step order. Iterate by sorted step keys.
-    vals = []
-    for step, v in value.items():
-        if range_start <= int(step) and int(step) < range_end:
-            vals.append(float(v))
+    vals = []
+    for step_str in sorted(value.keys(), key=int):
+        step = int(step_str)
+        if range_start <= step < range_end:
+            vals.append(float(value[step_str]))
83-96: ignore_top_p logic is solid; add empty-range guard to improve error messaging.

statistics.mean([]) raises StatisticsError; raise a clear ValueError instead when no values remain after filtering/ranging.
-    # Filter out top outliers if requested
+    # Filter out top outliers if requested
     if ignore_top_p > 0.0 and len(vals) > 0:
@@
-    return statistics.mean(vals)
+    if not vals:
+        raise ValueError("No values in selected range after filtering")
+    return statistics.mean(vals)
107-113: Limit eval’s builtins to reduce attack surface.

These checks run in CI with trusted strings, but it’s safer to remove full builtins exposure.
-    local_context = {
+    local_context = {
         "data": data,
         "min": min,
         "max": max,
         "mean": mean,
         "ratio_above": ratio_above,
     }
And in eval calls:
-        value = eval(value_expr, {"__builtins__": builtins}, local_context)
+        value = eval(value_expr, {"__builtins__": {}}, local_context)
@@
-        result = eval(check, {"__builtins__": builtins}, local_context)
+        result = eval(check, {"__builtins__": {}}, local_context)
This keeps only the whitelisted helpers available. If you need specific safe builtins (e.g., True/False/None), pass them explicitly.

155-157: Update usage examples to mirror project tooling.

Our shell drivers use uv; reflect that in examples.
-      python check_metrics.py results.json "mean(data['loss'], ignore_top_p=0.05) < 1.5"
-      python check_metrics.py results.json "ratio_above(data['error'], 1.05) < 0.02"
+      uv run tests/check_metrics.py results.json "mean(data['loss'], ignore_top_p=0.05) < 1.5"
+      uv run tests/check_metrics.py results.json "ratio_above(data['error'], 1.05) < 0.02"
tests/unit/test_check_metrics.py (1)
30-97: Prefer pytest.approx for floats to avoid flakiness.

Replace exact float equality with approx where applicable.
-        assert result == 3.0
+        assert result == pytest.approx(3.0)
@@
-        assert result_no_filter == 22.0  # (1+2+3+4+100)/5
+        assert result_no_filter == pytest.approx(22.0)  # (1+2+3+4+100)/5
@@
-        assert result_with_filter == 2.5  # (1+2+3+4)/4
+        assert result_with_filter == pytest.approx(2.5)  # (1+2+3+4)/4
(Apply similarly across other float assertions in this file.)

Also applies to: 112-133, 166-219, 224-307, 313-407

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3b700a and 349a8a7.

📒 Files selected for processing (5)

examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.yaml (0 hunks)
tests/check_metrics.py (5 hunks)
tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.sh (2 hunks)
tests/test_suites/nightly.txt (1 hunks)
tests/unit/test_check_metrics.py (1 hunks)

💤 Files with no reviewable changes (1)

examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.yaml

🧰 Additional context used

📓 Path-based instructions (5)

tests/test_suites/nightly.txt