Conversation
📝 WalkthroughWalkthroughThe PR adds KL divergence metrics (gen_kl, policy_kl, and js_divergence) to track policy distribution divergence. These metrics are implemented in the ClippedPGLossFn loss function and documented with detailed explanations, interpretations, and concrete examples in the guides. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Pre-merge checks and finishing touches✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
docs/guides/grpo.md(1 hunks)nemo_rl/algorithms/loss_functions.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
docs/**/*.md
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
When a markdown doc under docs/**/*.md is added or renamed, update docs/index.md to include it in the appropriate section
Files:
docs/guides/grpo.md
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts
Files:
nemo_rl/algorithms/loss_functions.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)
Files:
nemo_rl/algorithms/loss_functions.py
🧬 Code graph analysis (1)
nemo_rl/algorithms/loss_functions.py (1)
nemo_rl/algorithms/utils.py (1)
masked_mean(134-146)
🪛 LanguageTool
docs/guides/grpo.md
[grammar] ~285-~285: Ensure spelling is correct
Context: ...: exp(-5 + 15) - (-5 + 15) - 1 = 22015, servere mismatch dominiting the metrics. * `js_...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~285-~285: Ensure spelling is correct
Context: ...(-5 + 15) - 1 = 22015, servere mismatch dominiting the metrics. * js_divergence: close t...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
docs/guides/grpo.md
276-276: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.1)
nemo_rl/algorithms/loss_functions.py
191-191: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
🔇 Additional comments (3)
nemo_rl/algorithms/loss_functions.py (3)
171-179: LGTM! Correct implementation of gen_kl.The gen_kl metric correctly computes KL(P_gen || P_train) using the Schulman approximation. The formula
exp(log_ratio) - log_ratio - 1withlog_ratio = prev_logprobs - generation_logprobsproperly implements the divergence from the generation distribution to the training distribution.
181-189: LGTM! Correct implementation of policy_kl.The policy_kl metric correctly computes KL(P_train || P_gen) by reversing the log ratio direction. This provides a complementary view to gen_kl, treating the policy distribution as ground truth.
416-418: LGTM! Metrics properly integrated into return dictionary.The three new KL/divergence metrics are correctly added to the metrics dictionary and will be logged alongside existing metrics.
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
fa2bff4 to
464934e
Compare
terrykong
left a comment
There was a problem hiding this comment.
small comment. otherwise lgtm
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
What does this PR do ?
Add metrics:


Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Documentation