feat: Overlap param iteration and broadcast in non-colocated refit#1379
feat: Overlap param iteration and broadcast in non-colocated refit#1379terrykong merged 5 commits intoNVIDIA-NeMo:mainfrom
Conversation
0c96140 to
205d1fb
Compare
📝 WalkthroughWalkthroughTwo files are modified: Changes
Sequence Diagram(s)sequenceDiagram
participant Producer as Packed Producer
participant BufferLoop as Round-Robin Loop
participant Stream1 as CUDA Stream (Buffer 1)
participant Stream2 as CUDA Stream (Buffer N)
participant Broadcast as Broadcast Op
participant Consumer as Packed Consumer
Producer->>Producer: Initialize per-buffer streams<br/>& metadata lists
loop For each buffer round-robin
alt Buffer has data
BufferLoop->>Stream1: Sync stream, pack data
Stream1->>Stream1: Process buffer data
Stream1->>Broadcast: Broadcast per-buffer tensor
Broadcast->>Stream2: Receive broadcast
Stream2->>Stream2: Unpack to consumer
Stream2->>Consumer: Yield unpacked data
else Buffer exhausted
BufferLoop->>BufferLoop: Handle StopIteration<br/>for this buffer
end
end
Note over Producer,Consumer: Multi-buffer pipelining<br/>with per-buffer synchronization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes The change introduces moderate complexity through multi-buffer pipelining logic in Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/utils/packed_tensor.py (1)
25-31: NRL_REFIT_BUFFER_MEMORY_RATIO has inconsistent defaults and memory basis across codebase; add validationYour findings are confirmed. The environment variable is used with three different defaults and two different memory bases:
- packed_tensor.py:25: default "0.02", calculates from
total_memory(total GPU capacity)- megatron_policy_worker.py:1655: default "0.2", calculates from
free_memory(available memory)- dtensor_policy_worker_v2.py:1714 & dtensor_policy_worker.py:1753: default "0.8", calculates from
free_memoryThis inconsistency can cause unexpected memory allocation behavior. Additionally, the direct
float(memory_ratio)call at line 30 in packed_tensor.py (and similarly in policy worker files) lacks error handling—invalid or malformed environment values will raise an uncaught exception.Recommend:
- Add validation guard around float parsing (your suggested try/catch is appropriate)
- Document or unify the semantics: decide whether the ratio should apply to total or free memory, and establish a single default across all usages, or clearly document why they differ
🧹 Nitpick comments (2)
nemo_rl/utils/packed_tensor.py (2)
34-36: Add docstring and clamp for NRL_REFIT_NUM_BUFFERSExpose intent and prevent invalid config (e.g., 0 or negative) causing modulo/division errors.
Apply:
@lru_cache(maxsize=1) def get_num_buffers(): - return int(os.getenv("NRL_REFIT_NUM_BUFFERS", "2")) + """Number of pipeline buffers used to overlap pack/broadcast. + + Controlled by env NRL_REFIT_NUM_BUFFERS. Must be >= 1. Default: 2. + """ + try: + n = int(os.getenv("NRL_REFIT_NUM_BUFFERS", "2")) + except Exception: + n = 2 + return max(1, n)
142-152: Same setup on consumer side: consider docstrings for per-buffer state for maintainabilityOptional: briefly document packing_tensor_meta_data/offsets to ease future changes.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_rl/models/policy/megatron_policy_worker.py(1 hunks)nemo_rl/utils/packed_tensor.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts
Files:
nemo_rl/utils/packed_tensor.pynemo_rl/models/policy/megatron_policy_worker.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)
Files:
nemo_rl/utils/packed_tensor.pynemo_rl/models/policy/megatron_policy_worker.py
🧬 Code graph analysis (1)
nemo_rl/utils/packed_tensor.py (1)
tests/unit/utils/test_packed_tensor.py (3)
broadcast(33-37)broadcast(47-51)post_unpack_func(111-114)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Lint check
- GitHub Check: L0_Unit_Tests_Policy
- GitHub Check: L0_Unit_Tests_Other
- GitHub Check: L0_Unit_Tests_Generation
🔇 Additional comments (1)
nemo_rl/models/policy/megatron_policy_worker.py (1)
1758-1762: ****The concern assumes
broadcast_weights_for_collective()could execute withself.refit_conversion_tasksuninitialized (None), but this is contradicted by the codebase structure:
grpo.pyline 455:prepare_refit_info()is called during setupgrpo.pyline 549:broadcast_weights_for_collective()is called later in the training loopdistillation.pyfollows the same patternSince
prepare_refit_info()always runs beforebroadcast_weights_for_collective(),self.refit_conversion_tasksis guaranteed to be populated (not None) when line 1758-1762 executes. No guard or fallback is required.Likely an incorrect or invalid review comment.
205d1fb to
218d1cd
Compare
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
218d1cd to
f2e9023
Compare
|
Hi @terrykong can I ask for your help with merge? This is the PR that I shared in the meeting yesterday. |
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
7d456ec
…1379) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
…1379) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
…1379) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
…VIDIA-NeMo#1379) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
…VIDIA-NeMo#1379) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
What does this PR do ?
Overlap parameter iteration and the packing/broadcasting in the non-colocated GRPO to enhance the performance.
Performance
Profile
Before
This PR
ENV variables to control
NRL_REFIT_BUFFER_MEMORY_RATIO: each buffer size will betotal_gpu_memory_size x NRL_REFIT_BUFFER_MEMORY_RATIONRL_REFIT_NUM_BUFFERS: degree of parallelism. 2 will be sufficient in most cases.Convergence run
https://wandb.ai/nvidia/async-grpo-refit-convergence?nw=nwuseryoungeunk

LLaMa3 8B token_mult_prob_error
opt0: baseline without any optimization (main branch before applying #1264 and #1313)
opt123: applying #1264 and #1313
opt1234: this PR (current main ToT)
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Performance