cp: `feat: enhance advantages tracking and normalization stability in GRPO (1423)` into `r0.4.0` by chtruong814 · Pull Request #1516 · NVIDIA-NeMo/RL

chtruong814 · 2025-11-13T08:59:22Z

beep boop [🤖]: Hi @ffrujeri 👋,

we've cherry picked #1423 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

New Features
- Added advantage metrics tracking (mean, max, min) for improved training visibility and diagnostics.
Improvements
- Enhanced numerical stability in advantage normalization with epsilon-based scaling to prevent division-by-zero errors.
- Improved per-prompt standard deviation computation for more accurate baseline calculations during training.

…#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai · 2025-11-13T09:05:16Z

📝 Walkthrough

Walkthrough

The pull request introduces a new epsilon-based advantage normalization function and refactors the advantage/baseline computation pipeline. Changes include per-prompt standard deviation calculation, updated metric tracking for advantages, and comprehensive test coverage for the modified functions across both GRPO training and utility modules.

Changes

Cohort / File(s)	Summary
GRPO Training Modifications `nemo_rl/algorithms/grpo.py`	Added `normalize_advantages_with_epsilon()` helper function for epsilon-stabilized advantage scaling. Replaced per-entry masking logic with epsilon-based normalization in `grpo_train` and `async_grpo_train`. Integrated new advantage metrics (mean, max, min) computed from masked response tokens in training loops.
Baseline/Std Computation `nemo_rl/algorithms/utils.py`	Refactored to compute per-prompt standard deviation instead of global baseline-based std. Updated device selection to explicitly construct CUDA device from rewards tensor. Incorporated biased sample correction factor in per-prompt std calculation and NaN normalization.
GRPO Algorithm Tests `tests/unit/algorithms/test_grpo.py`	Exposed `normalize_advantages_with_epsilon` in public API and added comprehensive unit tests covering basic normalization, zero-std handling, edge cases (all-zero std, variable shapes, negative advantages).
Utility Function Tests `tests/unit/algorithms/test_utils.py`	Added extensive test suite for `calculate_baseline_and_std_per_prompt` validating multiple scenarios: basic multi-prompt operation, single generation, identical rewards, mixed prompt sizes, empty inputs, NaN/mask handling, CUDA compatibility, and numerical precision edge cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Device handling in utils.py: Verify CUDA device construction logic correctly extracts device ordinal from rewards tensor and maintains consistent device placement throughout the pipeline.
Per-prompt std calculation: Confirm the biased sample correction factor (num_valid / (num_valid - 1)) is applied correctly and that NaN normalization doesn't mask numerical instability.
Epsilon normalization consistency: Ensure epsilon-based scaling is uniformly applied across synchronous and asynchronous training branches in grpo_train and async_grpo_train.
Edge case handling: Review masking logic for empty valid tokens and boundary conditions in advantage metric computation.

Possibly related PRs

feat: enhance advantages tracking and normalization stability in GRPO #1423: Implements identical code-level changes—adds normalize_advantages_with_epsilon, replaces per-entry std-masking with epsilon normalization in both training paths, introduces advantage metrics, and modifies per-prompt std computation.

Suggested labels

CI:L1, r0.4.0

Suggested reviewers

terrykong
yuki-97

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR contains major algorithmic changes affecting advantage normalization and reward scaling with no test results, performance metrics, or regression validation in the description.	Add test results, convergence curves, performance comparisons, and regression validation demonstrating numeric correctness and no training regression.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main changes: enhancing advantages tracking and normalization stability in GRPO, with a specific PR reference (1423) and target branch (r0.4.0).
Docstring Coverage	✅ Passed	Docstring coverage is 95.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cherry-pick-1423-r0.4.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

nemo_rl/algorithms/grpo.py (1)

540-556: LGTM! Clean epsilon-based normalization implementation.

The function provides a simple, stable approach to advantage normalization by adding epsilon to the denominator instead of masking. When std is very small or zero, advantages are divided by epsilon (1e-6), producing large normalized values (up to 1e6x). This behavior upweights samples from prompts with low variance, which may be desirable but could impact numerical stability in extreme cases.

If numerical stability becomes a concern in practice, consider clamping the normalized advantages or using a larger epsilon value. However, the current implementation aligns with the PR objectives and test expectations.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c7c6b1d and 3e04e3b.

📒 Files selected for processing (4)

nemo_rl/algorithms/grpo.py (5 hunks)
nemo_rl/algorithms/utils.py (2 hunks)
tests/unit/algorithms/test_grpo.py (2 hunks)
tests/unit/algorithms/test_utils.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

nemo_rl/algorithms/grpo.py (1)

tests/check_metrics.py (3)

mean (52-97)

max (30-32)

min (25-27)

tests/unit/algorithms/test_grpo.py (1)

nemo_rl/algorithms/grpo.py (1)

normalize_advantages_with_epsilon (540-556)

tests/unit/algorithms/test_utils.py (1)

nemo_rl/algorithms/utils.py (1)

calculate_baseline_and_std_per_prompt (51-128)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

🔇 Additional comments (11)

tests/unit/algorithms/test_grpo.py (2)

27-27: LGTM! Correctly imports the new public function.

The import of normalize_advantages_with_epsilon properly exposes it for testing.

1215-1279: Excellent test coverage for the new normalization function.

The test suite comprehensively validates:

Basic normalization behavior

Zero std handling (dividing by epsilon)

All-zero std edge case

Various tensor shapes

Negative advantages

All test expectations correctly reflect the epsilon-based normalization approach where zero std results in advantage / epsilon.

nemo_rl/algorithms/utils.py (3)

75-75: LGTM! Proper std tensor initialization.

Initializing the std tensor ensures it exists for all samples, including those where std computation is skipped (e.g., when num_valid <= 1).

80-80: LGTM! More explicit device construction.

Constructing the CUDA device explicitly from the device ordinal is clearer and more maintainable.

119-126: LGTM! Correct per-prompt std computation with Bessel's correction.

The implementation correctly computes unbiased sample standard deviation:

Computes variance: (mean_of_squares - square_of_mean)

Applies Bessel's correction: * (num_valid / (num_valid - 1))

Takes square root and handles NaN

The correction factor properly accounts for the leave-one-out baseline when enabled. Edge case where num_valid <= 1 is handled by lines 95-98, which skip this computation entirely.

tests/unit/algorithms/test_utils.py (2)

22-22: LGTM! Correctly imports the function under test.

404-595: Excellent comprehensive test coverage!

The test suite validates calculate_baseline_and_std_per_prompt across multiple scenarios:

Multiple prompts with varying generations

Edge cases (single generation, identical rewards, empty input)

Masking behavior with valid_mask

Cross-device compatibility (CUDA)

Numerical precision with extreme values

The test expectations correctly reflect the per-prompt baseline and std computation including Bessel's correction.

nemo_rl/algorithms/grpo.py (4)

1038-1041: LGTM! Correct integration of epsilon-based normalization.

The function is properly called when normalize_rewards is enabled, using the per-prompt std computed earlier.

1153-1177: LGTM! Proper masked advantages metrics tracking.

The implementation correctly:

Extracts flat advantages and token masks from the prepared messages

Filters to only valid response tokens using torch.masked_select

Computes mean/max/min with proper handling for empty tensors (defaults to 0.0)

This provides useful training insights into the advantage distribution over actual response tokens.

1910-1914: LGTM! Consistent normalization in async training path.

The epsilon-based normalization is correctly integrated into the async GRPO training path, matching the sync implementation.

2042-2066: LGTM! Consistent masked advantages metrics in async path.

The masked advantages metrics tracking in the async path correctly mirrors the sync implementation, ensuring consistent observability across both training modes.

feat: enhance advantages tracking and normalization stability in GRPO (…

3e04e3b

…#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

chtruong814 requested review from a team as code owners November 13, 2025 08:59

chtruong814 requested a review from ffrujeri November 13, 2025 08:59

chtruong814 added cherry-pick Run CICD labels Nov 13, 2025

coderabbitai Bot reviewed Nov 13, 2025

View reviewed changes

terrykong approved these changes Nov 13, 2025

View reviewed changes

terrykong enabled auto-merge (squash) November 13, 2025 16:42

terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Nov 13, 2025

terrykong temporarily deployed to nemo-ci November 13, 2025 16:42 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci November 13, 2025 17:00 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci November 13, 2025 19:23 — with GitHub Actions Inactive

terrykong merged commit b0f3076 into r0.4.0 Nov 13, 2025
64 of 71 checks passed

terrykong deleted the cherry-pick-1423-r0.4.0 branch November 13, 2025 21:45

coderabbitai Bot mentioned this pull request Nov 18, 2025

cp: fix: Incompatible configuration between reward normalization and the loo (1519) into r0.4.0 #1533

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp: `feat: enhance advantages tracking and normalization stability in GRPO (1423)` into `r0.4.0`#1516

cp: `feat: enhance advantages tracking and normalization stability in GRPO (1423)` into `r0.4.0`#1516
terrykong merged 1 commit intor0.4.0from
cherry-pick-1423-r0.4.0

chtruong814 commented Nov 13, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Nov 13, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chtruong814 commented Nov 13, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Nov 13, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chtruong814 commented Nov 13, 2025 •

edited by coderabbitai Bot

Loading