Conversation
…#1423) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
📝 WalkthroughWalkthroughThe pull request introduces a new epsilon-based advantage normalization function and refactors the advantage/baseline computation pipeline. Changes include per-prompt standard deviation calculation, updated metric tracking for advantages, and comprehensive test coverage for the modified functions across both GRPO training and utility modules. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
nemo_rl/algorithms/grpo.py (1)
540-556: LGTM! Clean epsilon-based normalization implementation.The function provides a simple, stable approach to advantage normalization by adding epsilon to the denominator instead of masking. When
stdis very small or zero, advantages are divided byepsilon(1e-6), producing large normalized values (up to 1e6x). This behavior upweights samples from prompts with low variance, which may be desirable but could impact numerical stability in extreme cases.If numerical stability becomes a concern in practice, consider clamping the normalized advantages or using a larger epsilon value. However, the current implementation aligns with the PR objectives and test expectations.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
nemo_rl/algorithms/grpo.py(5 hunks)nemo_rl/algorithms/utils.py(2 hunks)tests/unit/algorithms/test_grpo.py(2 hunks)tests/unit/algorithms/test_utils.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
nemo_rl/algorithms/grpo.py (1)
tests/check_metrics.py (3)
mean(52-97)max(30-32)min(25-27)
tests/unit/algorithms/test_grpo.py (1)
nemo_rl/algorithms/grpo.py (1)
normalize_advantages_with_epsilon(540-556)
tests/unit/algorithms/test_utils.py (1)
nemo_rl/algorithms/utils.py (1)
calculate_baseline_and_std_per_prompt(51-128)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Lint check
- GitHub Check: Lint check
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (11)
tests/unit/algorithms/test_grpo.py (2)
27-27: LGTM! Correctly imports the new public function.The import of
normalize_advantages_with_epsilonproperly exposes it for testing.
1215-1279: Excellent test coverage for the new normalization function.The test suite comprehensively validates:
- Basic normalization behavior
- Zero std handling (dividing by epsilon)
- All-zero std edge case
- Various tensor shapes
- Negative advantages
All test expectations correctly reflect the epsilon-based normalization approach where zero std results in
advantage / epsilon.nemo_rl/algorithms/utils.py (3)
75-75: LGTM! Proper std tensor initialization.Initializing the std tensor ensures it exists for all samples, including those where std computation is skipped (e.g., when
num_valid <= 1).
80-80: LGTM! More explicit device construction.Constructing the CUDA device explicitly from the device ordinal is clearer and more maintainable.
119-126: LGTM! Correct per-prompt std computation with Bessel's correction.The implementation correctly computes unbiased sample standard deviation:
- Computes variance:
(mean_of_squares - square_of_mean)- Applies Bessel's correction:
* (num_valid / (num_valid - 1))- Takes square root and handles NaN
The correction factor properly accounts for the leave-one-out baseline when enabled. Edge case where
num_valid <= 1is handled by lines 95-98, which skip this computation entirely.tests/unit/algorithms/test_utils.py (2)
22-22: LGTM! Correctly imports the function under test.
404-595: Excellent comprehensive test coverage!The test suite validates
calculate_baseline_and_std_per_promptacross multiple scenarios:
- Multiple prompts with varying generations
- Edge cases (single generation, identical rewards, empty input)
- Masking behavior with
valid_mask- Cross-device compatibility (CUDA)
- Numerical precision with extreme values
The test expectations correctly reflect the per-prompt baseline and std computation including Bessel's correction.
nemo_rl/algorithms/grpo.py (4)
1038-1041: LGTM! Correct integration of epsilon-based normalization.The function is properly called when
normalize_rewardsis enabled, using the per-prompt std computed earlier.
1153-1177: LGTM! Proper masked advantages metrics tracking.The implementation correctly:
- Extracts flat advantages and token masks from the prepared messages
- Filters to only valid response tokens using
torch.masked_select- Computes mean/max/min with proper handling for empty tensors (defaults to 0.0)
This provides useful training insights into the advantage distribution over actual response tokens.
1910-1914: LGTM! Consistent normalization in async training path.The epsilon-based normalization is correctly integrated into the async GRPO training path, matching the sync implementation.
2042-2066: LGTM! Consistent masked advantages metrics in async path.The masked advantages metrics tracking in the async path correctly mirrors the sync implementation, ensuring consistent observability across both training modes.
beep boop [🤖]: Hi @ffrujeri 👋,
Summary by CodeRabbit
New Features
Improvements