Skip to content

cp: feat: enhance advantages tracking and normalization stability in GRPO (1423) into r0.4.0#1516

Merged
terrykong merged 1 commit intor0.4.0from
cherry-pick-1423-r0.4.0
Nov 13, 2025
Merged

cp: feat: enhance advantages tracking and normalization stability in GRPO (1423) into r0.4.0#1516
terrykong merged 1 commit intor0.4.0from
cherry-pick-1423-r0.4.0

Conversation

@chtruong814
Copy link
Copy Markdown
Contributor

@chtruong814 chtruong814 commented Nov 13, 2025

beep boop [🤖]: Hi @ffrujeri 👋,

we've cherry picked #1423 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • New Features

    • Added advantage metrics tracking (mean, max, min) for improved training visibility and diagnostics.
  • Improvements

    • Enhanced numerical stability in advantage normalization with epsilon-based scaling to prevent division-by-zero errors.
    • Improved per-prompt standard deviation computation for more accurate baseline calculations during training.

…#1423)

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 13, 2025

📝 Walkthrough

Walkthrough

The pull request introduces a new epsilon-based advantage normalization function and refactors the advantage/baseline computation pipeline. Changes include per-prompt standard deviation calculation, updated metric tracking for advantages, and comprehensive test coverage for the modified functions across both GRPO training and utility modules.

Changes

Cohort / File(s) Summary
GRPO Training Modifications
nemo_rl/algorithms/grpo.py
Added normalize_advantages_with_epsilon() helper function for epsilon-stabilized advantage scaling. Replaced per-entry masking logic with epsilon-based normalization in grpo_train and async_grpo_train. Integrated new advantage metrics (mean, max, min) computed from masked response tokens in training loops.
Baseline/Std Computation
nemo_rl/algorithms/utils.py
Refactored to compute per-prompt standard deviation instead of global baseline-based std. Updated device selection to explicitly construct CUDA device from rewards tensor. Incorporated biased sample correction factor in per-prompt std calculation and NaN normalization.
GRPO Algorithm Tests
tests/unit/algorithms/test_grpo.py
Exposed normalize_advantages_with_epsilon in public API and added comprehensive unit tests covering basic normalization, zero-std handling, edge cases (all-zero std, variable shapes, negative advantages).
Utility Function Tests
tests/unit/algorithms/test_utils.py
Added extensive test suite for calculate_baseline_and_std_per_prompt validating multiple scenarios: basic multi-prompt operation, single generation, identical rewards, mixed prompt sizes, empty inputs, NaN/mask handling, CUDA compatibility, and numerical precision edge cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Device handling in utils.py: Verify CUDA device construction logic correctly extracts device ordinal from rewards tensor and maintains consistent device placement throughout the pipeline.
  • Per-prompt std calculation: Confirm the biased sample correction factor (num_valid / (num_valid - 1)) is applied correctly and that NaN normalization doesn't mask numerical instability.
  • Epsilon normalization consistency: Ensure epsilon-based scaling is uniformly applied across synchronous and asynchronous training branches in grpo_train and async_grpo_train.
  • Edge case handling: Review masking logic for empty valid tokens and boundary conditions in advantage metric computation.

Possibly related PRs

Suggested labels

CI:L1, r0.4.0

Suggested reviewers

  • terrykong
  • yuki-97

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR contains major algorithmic changes affecting advantage normalization and reward scaling with no test results, performance metrics, or regression validation in the description. Add test results, convergence curves, performance comparisons, and regression validation demonstrating numeric correctness and no training regression.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main changes: enhancing advantages tracking and normalization stability in GRPO, with a specific PR reference (1423) and target branch (r0.4.0).
Docstring Coverage ✅ Passed Docstring coverage is 95.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-1423-r0.4.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
nemo_rl/algorithms/grpo.py (1)

540-556: LGTM! Clean epsilon-based normalization implementation.

The function provides a simple, stable approach to advantage normalization by adding epsilon to the denominator instead of masking. When std is very small or zero, advantages are divided by epsilon (1e-6), producing large normalized values (up to 1e6x). This behavior upweights samples from prompts with low variance, which may be desirable but could impact numerical stability in extreme cases.

If numerical stability becomes a concern in practice, consider clamping the normalized advantages or using a larger epsilon value. However, the current implementation aligns with the PR objectives and test expectations.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c7c6b1d and 3e04e3b.

📒 Files selected for processing (4)
  • nemo_rl/algorithms/grpo.py (5 hunks)
  • nemo_rl/algorithms/utils.py (2 hunks)
  • tests/unit/algorithms/test_grpo.py (2 hunks)
  • tests/unit/algorithms/test_utils.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
nemo_rl/algorithms/grpo.py (1)
tests/check_metrics.py (3)
  • mean (52-97)
  • max (30-32)
  • min (25-27)
tests/unit/algorithms/test_grpo.py (1)
nemo_rl/algorithms/grpo.py (1)
  • normalize_advantages_with_epsilon (540-556)
tests/unit/algorithms/test_utils.py (1)
nemo_rl/algorithms/utils.py (1)
  • calculate_baseline_and_std_per_prompt (51-128)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (11)
tests/unit/algorithms/test_grpo.py (2)

27-27: LGTM! Correctly imports the new public function.

The import of normalize_advantages_with_epsilon properly exposes it for testing.


1215-1279: Excellent test coverage for the new normalization function.

The test suite comprehensively validates:

  • Basic normalization behavior
  • Zero std handling (dividing by epsilon)
  • All-zero std edge case
  • Various tensor shapes
  • Negative advantages

All test expectations correctly reflect the epsilon-based normalization approach where zero std results in advantage / epsilon.

nemo_rl/algorithms/utils.py (3)

75-75: LGTM! Proper std tensor initialization.

Initializing the std tensor ensures it exists for all samples, including those where std computation is skipped (e.g., when num_valid <= 1).


80-80: LGTM! More explicit device construction.

Constructing the CUDA device explicitly from the device ordinal is clearer and more maintainable.


119-126: LGTM! Correct per-prompt std computation with Bessel's correction.

The implementation correctly computes unbiased sample standard deviation:

  1. Computes variance: (mean_of_squares - square_of_mean)
  2. Applies Bessel's correction: * (num_valid / (num_valid - 1))
  3. Takes square root and handles NaN

The correction factor properly accounts for the leave-one-out baseline when enabled. Edge case where num_valid <= 1 is handled by lines 95-98, which skip this computation entirely.

tests/unit/algorithms/test_utils.py (2)

22-22: LGTM! Correctly imports the function under test.


404-595: Excellent comprehensive test coverage!

The test suite validates calculate_baseline_and_std_per_prompt across multiple scenarios:

  • Multiple prompts with varying generations
  • Edge cases (single generation, identical rewards, empty input)
  • Masking behavior with valid_mask
  • Cross-device compatibility (CUDA)
  • Numerical precision with extreme values

The test expectations correctly reflect the per-prompt baseline and std computation including Bessel's correction.

nemo_rl/algorithms/grpo.py (4)

1038-1041: LGTM! Correct integration of epsilon-based normalization.

The function is properly called when normalize_rewards is enabled, using the per-prompt std computed earlier.


1153-1177: LGTM! Proper masked advantages metrics tracking.

The implementation correctly:

  1. Extracts flat advantages and token masks from the prepared messages
  2. Filters to only valid response tokens using torch.masked_select
  3. Computes mean/max/min with proper handling for empty tensors (defaults to 0.0)

This provides useful training insights into the advantage distribution over actual response tokens.


1910-1914: LGTM! Consistent normalization in async training path.

The epsilon-based normalization is correctly integrated into the async GRPO training path, matching the sync implementation.


2042-2066: LGTM! Consistent masked advantages metrics in async path.

The masked advantages metrics tracking in the async path correctly mirrors the sync implementation, ensuring consistent observability across both training modes.

@terrykong terrykong enabled auto-merge (squash) November 13, 2025 16:42
@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Nov 13, 2025
@terrykong terrykong merged commit b0f3076 into r0.4.0 Nov 13, 2025
64 of 71 checks passed
@terrykong terrykong deleted the cherry-pick-1423-r0.4.0 branch November 13, 2025 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick CI:L1 Run doctests, unit tests, and functional tests Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants