feat: Integrate Penguin env logic by bxyu-nvidia · Pull Request #1450 · NVIDIA-NeMo/RL

bxyu-nvidia · 2025-10-31T19:12:38Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Release Notes

New Features
- Integrated Penguin async rollout pathway with Ray distributed computing support
- Enhanced vLLM server configuration with improved model identification
Bug Fixes
- Fixed TensorBoard logging to filter out non-primitive metric types
- Reduced server log noise by suppressing standard access logs
Dependencies
- Pinned OpenAI to version ≤2.6.1
- Added Ray framework support
Tests
- Added comprehensive test coverage for Penguin async rollouts

Signed-off-by: Brian Yu <bxyu@nvidia.com>

coderabbitai · 2025-10-31T19:19:01Z

📝 Walkthrough

Walkthrough

This PR introduces Penguin-based asynchronous rollout capability integrated with Ray and vLLM, extending the GRPO training pipeline. Changes include Penguin environment initialization with Ray integration, new async rollout functions, vLLM server configuration updates, and corresponding test coverage.

Changes

Cohort / File(s)	Summary
Build and dependencies `.gitignore`, `3rdparty/Penguin-workspace/setup.py`	Added `.cache/` to ignore rules; pinned openai to <=2.6.1 and added ray[default] as runtime dependencies.
GRPO training orchestration `nemo_rl/algorithms/grpo.py`	Added `_should_use_penguin()` helper to validate Penguin usage; integrated Penguin rollout execution prior to async rollouts in training and validation flows, with result merging and metric collection.
Penguin environment `nemo_rl/environments/penguin.py`	Added Ray initialization validation and GCS address extraction in Penguin.init; propagates head node address to global config.
Async rollout pipeline `nemo_rl/experience/rollouts.py`	Introduced `run_async_penguin_rollout()` function with per-row generation parameter updates and per-sample metrics computation; added `AsyncPenguinRolloutResult` dataclass and helper utilities (`_tensorize_by_key()`, `_calculate_single_metric()`); imported GenerationConfig and wandb logging.
vLLM server configuration `nemo_rl/models/generation/vllm/vllm_worker.py`	Added `served_model_name` parameter to vLLM LLM initialization kwargs.
vLLM OpenAI API server `nemo_rl/models/generation/vllm/vllm_worker_async.py`	Added second `BaseModelPath` entry; replaced `template_prefix_token_ids` with `actual_corresponding_token_ids` in chat preprocessing; removed custom `create_tokenize` override; introduced uvicorn access log filtering to suppress 200 OK messages.
Logging filtering `nemo_rl/utils/logger.py`	Updated `TensorboardLogger.log_metrics()` to filter and skip non-primitive metric types, preventing incompatible objects from being logged.
Test configuration and coverage `tests/unit/environments/test_penguin.py`, `tests/unit/experience/test_rollouts.py`, `tests/unit/models/generation/test_vllm_generation.py`	Added `uses_reasoning_parser: true` to Penguin YAML config; added `test_run_async_penguin_rollout()` test with data preparation, collation, and result validation; added HTTP request model field in `test_vllm_http_server()`.

Sequence Diagram(s)

sequenceDiagram
    participant GRPO as GRPO Training
    participant Penguin as Penguin Rollout<br/>(Ray-enabled)
    participant AsyncRL as Async RL Rollout
    participant vLLM as vLLM Engine
    
    rect rgb(200, 240, 255)
    Note over GRPO,vLLM: New Penguin-prioritized flow
    end
    
    GRPO->>GRPO: _should_use_penguin(config)?
    alt Penguin enabled
        GRPO->>Penguin: run_async_penguin_rollout()
        Penguin->>Penguin: Validate Ray initialized<br/>Fetch GCS address
        Penguin->>vLLM: Per-row generation via HTTP
        vLLM-->>Penguin: Token responses
        Penguin->>Penguin: Compute per-sample metrics<br/>(turns, rewards, etc.)
        Penguin-->>GRPO: AsyncPenguinRolloutResult<br/>(input_ids, final_batch, metrics)
        GRPO->>GRPO: Merge Penguin results<br/>into batch
    end
    
    rect rgb(240, 220, 255)
    Note over GRPO,vLLM: Fallback to existing async path
    end
    
    alt Async vLLM enabled (fallback)
        GRPO->>AsyncRL: Existing async rollout
        AsyncRL->>vLLM: Batch generation
        vLLM-->>AsyncRL: Results
        AsyncRL-->>GRPO: Batch
    else Standard multi-turn
        GRPO->>GRPO: Multi-turn rollout
    end
    
    GRPO->>GRPO: Update metrics & continue training

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

nemo_rl/algorithms/grpo.py: Integration of Penguin rollout orchestration into training and validation pipelines introduces conditional branching and result merging logic requiring verification of correct metric aggregation and batch handling.
nemo_rl/experience/rollouts.py: Substantial new run_async_penguin_rollout() function with per-row generation parameter handling, metric computation pipeline, and tensor construction—requires careful review of correctness in metric aggregation, logging format, and AsyncPenguinRolloutResult structure.
nemo_rl/models/generation/vllm/vllm_worker_async.py: Multiple interdependent changes (BaseModelPath duplication, tokenization logic replacement, log filtering)—requires verification that the removed custom tokenization override doesn't break downstream expectations and that token ID alignment is correct.
tests/unit/experience/test_rollouts.py: New test function depends on complex fixture setup and data transformation (penguin_example_to_nemo_rl_datum_spec); requires validation that test data assumptions align with actual Penguin rollout behavior.
nemo_rl/environments/penguin.py: Ray initialization validation adds external runtime dependency; ensure Ray context is correctly captured and propagated through config.

Possibly related PRs

PR feat: Expose async vLLM engine as HTTP server #1110: Modifies async vLLM worker to expose in-process HTTP server and related utilities (server base URLs, tokenization/HTTP bridging), providing the vLLM HTTP infrastructure this PR depends on.
PR feat: Add Penguin env #1327: Adds and modifies Penguin environment integration (penguin.py, rollout paths, tests, and packaging), directly overlapping with core Penguin infrastructure introduced here.
PR feat: Add Penguin stub #1325: Updates Penguin workspace packaging (setup.py) with related dependency pinning and registry support, directly related to dependency management changes in this PR.

Suggested reviewers

terrykong
parthchadha

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 59.09% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	This PR introduces major feature additions by integrating Penguin environment logic throughout the codebase, including new public functions, dataclasses, and significant refactoring of rollout orchestration. However, the PR description consists primarily of template placeholders with no substantive test results, performance metrics, convergence regression information, or documented testing outcomes. While tests have been added to the codebase, the PR description itself lacks the required documentation of test results to verify that these major changes do not introduce regressions or negative effects on performance or training dynamics. The existence of an unresolved critical issue flagged in review comments further suggests insufficient testing verification.	To pass this check, the PR description should be updated to include: (1) documented test results showing that the new Penguin rollout path executes successfully across various batch sizes and configurations, (2) convergence/numeric regression verification demonstrating that enabling Penguin does not degrade training outcomes, (3) performance benchmarks comparing rollout latency and throughput with and without Penguin enabled, and (4) confirmation that all pre-merge checklist items have been addressed and completed. Additionally, the critical bug regarding single-sample batch handling in `_calculate_single_metric()` should be fixed before merge.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "feat: Integrate Penguin env logic" directly reflects the primary purpose of the changeset, which involves integrating Penguin environment functionality across multiple core modules including algorithms (grpo.py), environments (penguin.py), experience/rollouts (rollouts.py), and supporting infrastructure like vLLM workers and logging. The title uses the conventional "feat:" prefix, is concise and clear, and provides sufficient specificity that a developer scanning the repository history would understand this PR introduces Penguin integration rather than describing the changes in vague or generic terms.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch bxyu/integrate-penguin-env-logic

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

nemo_rl/utils/logger.py (1)
133-137: Consider logging a warning for skipped metrics.

While filtering non-primitive types is correct for TensorBoard compatibility, silently skipping metrics with continue could make debugging difficult if users expect certain metrics to appear. Consider logging a debug or warning message when metrics are skipped.

Apply this diff to add visibility:
             # Penguin will add additional metrics like wandb histograms. However, some people will log to Tensorboard instead which may not be compatible
             # This logic catches non-compatible objects being logged.
             if not isinstance(value, (int, float, bool, str)):
+                logging.getLogger(__name__).debug(
+                    f"Skipping non-primitive metric '{name}' of type {type(value).__name__} for TensorBoard"
+                )
                 continue

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd2e645 and 11bd9a1.

📒 Files selected for processing (11)

.gitignore (1 hunks)
3rdparty/Penguin-workspace/setup.py (2 hunks)
nemo_rl/algorithms/grpo.py (5 hunks)
nemo_rl/environments/penguin.py (1 hunks)
nemo_rl/experience/rollouts.py (3 hunks)
nemo_rl/models/generation/vllm/vllm_worker.py (1 hunks)
nemo_rl/models/generation/vllm/vllm_worker_async.py (5 hunks)
nemo_rl/utils/logger.py (1 hunks)
tests/unit/environments/test_penguin.py (1 hunks)
tests/unit/experience/test_rollouts.py (3 hunks)
tests/unit/models/generation/test_vllm_generation.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

tests/unit/models/generation/test_vllm_generation.py
tests/unit/environments/test_penguin.py
nemo_rl/models/generation/vllm/vllm_worker.py
nemo_rl/utils/logger.py
nemo_rl/environments/penguin.py
nemo_rl/algorithms/grpo.py
nemo_rl/models/generation/vllm/vllm_worker_async.py
3rdparty/Penguin-workspace/setup.py
nemo_rl/experience/rollouts.py
tests/unit/experience/test_rollouts.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/models/generation/vllm/vllm_worker.py
nemo_rl/utils/logger.py
nemo_rl/environments/penguin.py
nemo_rl/algorithms/grpo.py
nemo_rl/models/generation/vllm/vllm_worker_async.py
nemo_rl/experience/rollouts.py

🧠 Learnings (4)

📚 Learning: 2025-09-10T05:34:35.406Z

Learnt from: bxyu-nvidia
Repo: NVIDIA-NeMo/RL PR: 1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:346-359
Timestamp: 2025-09-10T05:34:35.406Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server intentionally uses different path structures: `/v1/chat/completions` is under the `/v1` prefix while `/tokenize` is at the root level without the `/v1` prefix. This is the intended design.

Applied to files:

tests/unit/models/generation/test_vllm_generation.py
nemo_rl/models/generation/vllm/vllm_worker_async.py

📚 Learning: 2025-09-10T05:35:59.840Z

Learnt from: bxyu-nvidia
Repo: NVIDIA-NeMo/RL PR: 1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:363-369
Timestamp: 2025-09-10T05:35:59.840Z
Learning: In nemo_rl/models/generation/vllm/vllm_worker_async.py, the HTTP server should explicitly bind to "0.0.0.0" (all interfaces) rather than a specific interface, as confirmed by bxyu-nvidia. This is an intentional design decision for the vLLM HTTP server functionality.

Applied to files:

tests/unit/models/generation/test_vllm_generation.py
nemo_rl/models/generation/vllm/vllm_worker_async.py

📚 Learning: 2025-09-20T14:58:45.492Z

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/**/*.yaml : Recipe YAMLs under examples/configs/recipes/** are runnable snapshots and may omit documentation

Applied to files:

.gitignore

📚 Learning: 2025-09-10T05:29:34.349Z

Learnt from: bxyu-nvidia
Repo: NVIDIA-NeMo/RL PR: 1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.349Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.

Applied to files:

nemo_rl/models/generation/vllm/vllm_worker_async.py

🧬 Code graph analysis (3)

nemo_rl/algorithms/grpo.py (1)

nemo_rl/experience/rollouts.py (1)

run_async_penguin_rollout (965-1155)

nemo_rl/experience/rollouts.py (5)

nemo_rl/models/generation/interfaces.py (2)

GenerationConfig (118-131)

GenerationInterface (215-249)

nemo_rl/distributed/batched_data_dict.py (1)

BatchedDataDict (75-860)

nemo_rl/data/interfaces.py (1)

DatumSpec (32-40)

nemo_rl/environments/penguin.py (1)

run_rollouts (107-115)

nemo_rl/data/llm_message_utils.py (1)

batched_message_log_to_flat_message (233-390)

tests/unit/experience/test_rollouts.py (5)

nemo_rl/data/collate_fn.py (1)

rl_collate_fn (29-73)

nemo_rl/data/interfaces.py (1)

DatumSpec (32-40)

nemo_rl/environments/penguin.py (1)

penguin_example_to_nemo_rl_datum_spec (199-212)

nemo_rl/experience/rollouts.py (2)

run_async_multi_turn_rollout (779-934)

run_async_penguin_rollout (965-1155)

nemo_rl/distributed/batched_data_dict.py (2)

BatchedDataDict (75-860)

get_dict (858-860)

🪛 Ruff (0.14.2)

nemo_rl/experience/rollouts.py

1077-1077: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

tests/unit/experience/test_rollouts.py

46-46: Unused noqa directive (non-enabled: F401)