feat: Overlap param iteration and broadcast in non-colocated refit by youngeunkwon0405 · Pull Request #1379 · NVIDIA-NeMo/RL

youngeunkwon0405 · 2025-10-16T18:50:32Z

What does this PR do ?

Overlap parameter iteration and the packing/broadcasting in the non-colocated GRPO to enhance the performance.

Performance

QWEN3 235B: 13 s --> 5.5 s (2.36x)
DSV3: 34 s --> 14 s (2.4x)

Profile

Before

This PR

ENV variables to control

NRL_REFIT_BUFFER_MEMORY_RATIO: each buffer size will be total_gpu_memory_size x NRL_REFIT_BUFFER_MEMORY_RATIO
NRL_REFIT_NUM_BUFFERS: degree of parallelism. 2 will be sufficient in most cases.

Convergence run

https://wandb.ai/nvidia/async-grpo-refit-convergence?nw=nwuseryoungeunk
LLaMa3 8B token_mult_prob_error
opt0: baseline without any optimization (main branch before applying #1264 and #1313)
opt123: applying #1264 and #1313
opt1234: this PR (current main ToT)

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added multi-buffer pipelining for enhanced data throughput and processing efficiency
Performance
- Optimized memory efficiency in tensor packing operations
- Improved parameter conversion handling during model weight export

coderabbitai · 2025-10-17T01:04:21Z

📝 Walkthrough

Walkthrough

Two files are modified: megatron_policy_worker.py receives a keyword argument addition for conversion tasks in the HF weights export call, while packed_tensor.py implements multi-buffer pipelining for packed tensor operations, introduces a new public helper function, and adjusts default memory ratio parameters.

Changes

Cohort / File(s)	Summary
Megatron weight export `nemo_rl/models/policy/megatron_policy_worker.py`	Added `conversion_tasks=self.refit_conversion_tasks` parameter to `self.megatron_bridge.export_hf_weights()` call in `broadcast_weights_for_collective` method
Packed tensor multi-buffer pipelining `nemo_rl/utils/packed_tensor.py`	Introduced `get_num_buffers()` public helper (cached) to retrieve buffer count from environment variable; changed default memory ratio from 0.01 to 0.02; reworked `packed_broadcast_producer` and `packed_broadcast_consumer` to support multi-buffer pipelining with per-buffer streams, packing lists, sizes, and per-buffer state management

Sequence Diagram(s)

sequenceDiagram
    participant Producer as Packed Producer
    participant BufferLoop as Round-Robin Loop
    participant Stream1 as CUDA Stream (Buffer 1)
    participant Stream2 as CUDA Stream (Buffer N)
    participant Broadcast as Broadcast Op
    participant Consumer as Packed Consumer

    Producer->>Producer: Initialize per-buffer streams<br/>& metadata lists
    
    loop For each buffer round-robin
        alt Buffer has data
            BufferLoop->>Stream1: Sync stream, pack data
            Stream1->>Stream1: Process buffer data
            Stream1->>Broadcast: Broadcast per-buffer tensor
            Broadcast->>Stream2: Receive broadcast
            Stream2->>Stream2: Unpack to consumer
            Stream2->>Consumer: Yield unpacked data
        else Buffer exhausted
            BufferLoop->>BufferLoop: Handle StopIteration<br/>for this buffer
        end
    end
    
    Note over Producer,Consumer: Multi-buffer pipelining<br/>with per-buffer synchronization

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

The change introduces moderate complexity through multi-buffer pipelining logic in packed_tensor.py with new per-buffer state management and synchronization patterns, balanced against a minimal one-line modification in megatron_policy_worker.py. Review requires understanding the buffering mechanics, per-buffer stream handling, and control flow changes, though the modifications follow a consistent pattern.

Possibly related PRs

cp: feat: tensor packing and batching for non-colocated refit performance (1313) into r0.4.0 #1346: Extends packed broadcasting functionality with multi-buffer and get_num_buffers() additions while threading refit_conversion_tasks into HF weight export during broadcasting.
fix: enhancing non-colocated refit performance by having inclusive comm group #1264: Modifies megatron_policy_worker.py's broadcast_weights_for_collective and collective/broadcast initialization behavior.
feat: tensor packing and batching for non-colocated refit performance #1313: Implements packed and coalesced broadcast logic alongside modifications to megatron_policy_worker's broadcast flow.

Suggested labels

Performance

Suggested reviewers

terrykong
yuki-97
guyueh1

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	The PR represents a major feature implementing multi-buffer pipelining to overlap parameter iteration and broadcasting for performance optimization. While the PR description includes a convergence run reference (W&B link) with a LLaMa3 8B convergence graph demonstrating that numerics are not regressed, the PR falls short on key testing documentation requirements. Specifically, the PR provides no inline performance metrics showing execution time improvements or memory efficiency gains from the optimization, only qualitative execution flow diagrams. Additionally, all pre-check items in the PR checklist remain unchecked, indicating that formal test results, unit tests, and functional tests have not been documented or confirmed to have been run locally. The convergence reference is helpful for establishing lack of numerical regression, but it does not address the quantified performance impact that the instruction requires for performance-affecting changes.	To pass this check, the PR should be updated to include: (1) quantified before-and-after performance metrics (execution time reduction percentage, memory efficiency gains) with the test configuration clearly specified (model size, batch size, number of tokens, etc.); (2) documented unit and/or functional test results confirming the changes work correctly; and (3) completion of the pre-check checklist items showing that contributor guidelines were followed, tests were written and executed locally, and documentation was updated as needed. The current convergence run link helps establish numerical stability but is insufficient without concrete performance numbers demonstrating the claimed optimization benefit.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "feat: Overlap param iteration and broadcast in non-colocated refit" directly aligns with the main changes in the changeset. The substantial modifications to packed_tensor.py implement multi-buffer pipelining with per-buffer synchronization to achieve exactly this goal of overlapping parameter iteration with packing/broadcasting operations. The supporting change in megatron_policy_worker.py propagates conversion tasks through the export process to enable this functionality. The title is specific about the key optimization (overlap), the affected components (param iteration and broadcast), and the context (non-colocated refit), providing clear understanding of the changeset's primary purpose without unnecessary detail.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_rl/utils/packed_tensor.py (1)

25-31: NRL_REFIT_BUFFER_MEMORY_RATIO has inconsistent defaults and memory basis across codebase; add validation

Your findings are confirmed. The environment variable is used with three different defaults and two different memory bases:

packed_tensor.py:25: default "0.02", calculates from total_memory (total GPU capacity)

megatron_policy_worker.py:1655: default "0.2", calculates from free_memory (available memory)

dtensor_policy_worker_v2.py:1714 & dtensor_policy_worker.py:1753: default "0.8", calculates from free_memory

This inconsistency can cause unexpected memory allocation behavior. Additionally, the direct float(memory_ratio) call at line 30 in packed_tensor.py (and similarly in policy worker files) lacks error handling—invalid or malformed environment values will raise an uncaught exception.

Recommend:

Add validation guard around float parsing (your suggested try/catch is appropriate)

Document or unify the semantics: decide whether the ratio should apply to total or free memory, and establish a single default across all usages, or clearly document why they differ

🧹 Nitpick comments (2)

nemo_rl/utils/packed_tensor.py (2)
34-36: Add docstring and clamp for NRL_REFIT_NUM_BUFFERS

Expose intent and prevent invalid config (e.g., 0 or negative) causing modulo/division errors.

Apply:
 @lru_cache(maxsize=1)
 def get_num_buffers():
-    return int(os.getenv("NRL_REFIT_NUM_BUFFERS", "2"))
+    """Number of pipeline buffers used to overlap pack/broadcast.
+
+    Controlled by env NRL_REFIT_NUM_BUFFERS. Must be >= 1. Default: 2.
+    """
+    try:
+        n = int(os.getenv("NRL_REFIT_NUM_BUFFERS", "2"))
+    except Exception:
+        n = 2
+    return max(1, n)
142-152: Same setup on consumer side: consider docstrings for per-buffer state for maintainability

Optional: briefly document packing_tensor_meta_data/offsets to ease future changes.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dee3fd9 and 205d1fb.

📒 Files selected for processing (2)

nemo_rl/models/policy/megatron_policy_worker.py (1 hunks)
nemo_rl/utils/packed_tensor.py (4 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/utils/packed_tensor.py
nemo_rl/models/policy/megatron_policy_worker.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/utils/packed_tensor.py
nemo_rl/models/policy/megatron_policy_worker.py

🧬 Code graph analysis (1)

nemo_rl/utils/packed_tensor.py (1)

tests/unit/utils/test_packed_tensor.py (3)

broadcast (33-37)

broadcast (47-51)

post_unpack_func (111-114)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Lint check
GitHub Check: L0_Unit_Tests_Policy
GitHub Check: L0_Unit_Tests_Other
GitHub Check: L0_Unit_Tests_Generation

🔇 Additional comments (1)

nemo_rl/models/policy/megatron_policy_worker.py (1)

1758-1762: ****

The concern assumes broadcast_weights_for_collective() could execute with self.refit_conversion_tasks uninitialized (None), but this is contradicted by the codebase structure:

grpo.py line 455: prepare_refit_info() is called during setup

grpo.py line 549: broadcast_weights_for_collective() is called later in the training loop

distillation.py follows the same pattern

Since prepare_refit_info() always runs before broadcast_weights_for_collective(), self.refit_conversion_tasks is guaranteed to be populated (not None) when line 1758-1762 executes. No guard or fallback is required.

Likely an incorrect or invalid review comment.

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

youngeunkwon0405 · 2025-10-21T01:46:39Z

Hi @guyueh1 and @yuki-97, Can I ask for your review please? This PR passes the CI pipeline. And I checked the token_mult_prob_error from multiple wandb reports. It was near 1 except for the dsv3, which has known issue.

guyueh1

LGTM

youngeunkwon0405 · 2025-10-21T20:02:23Z

Hi @terrykong can I ask for your help with merge? This is the PR that I shared in the meeting yesterday.

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

…1379) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

…1379) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

…1379) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

…VIDIA-NeMo#1379) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

…VIDIA-NeMo#1379) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

youngeunkwon0405 self-assigned this Oct 16, 2025

youngeunkwon0405 requested review from a team as code owners October 16, 2025 18:50

youngeunkwon0405 marked this pull request as draft October 16, 2025 18:50

youngeunkwon0405 force-pushed the async-refit-overlap branch from 0c96140 to 205d1fb Compare October 16, 2025 21:47

youngeunkwon0405 added the CI:L1 Run doctests, unit tests, and functional tests label Oct 16, 2025

youngeunkwon0405 temporarily deployed to nemo-ci October 16, 2025 22:01 — with GitHub Actions Inactive

youngeunkwon0405 temporarily deployed to nemo-ci October 16, 2025 22:42 — with GitHub Actions Inactive

youngeunkwon0405 marked this pull request as ready for review October 17, 2025 01:04

youngeunkwon0405 requested review from guyueh1, parthchadha and yuki-97 October 17, 2025 01:04

youngeunkwon0405 added the Performance Related to improving performance label Oct 17, 2025

coderabbitai Bot reviewed Oct 17, 2025

View reviewed changes

Comment thread nemo_rl/utils/packed_tensor.py

Comment thread nemo_rl/utils/packed_tensor.py

youngeunkwon0405 temporarily deployed to nemo-ci October 17, 2025 02:32 — with GitHub Actions Inactive

youngeunkwon0405 marked this pull request as draft October 17, 2025 09:55

youngeunkwon0405 force-pushed the async-refit-overlap branch from 205d1fb to 218d1cd Compare October 17, 2025 18:42

euronymous-aithal added the r0.4.0 label Oct 18, 2025

youngeunkwon0405 added 4 commits October 20, 2025 11:41

cache conversion_tasks

de0a21c

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

draft double-buffering

029c009

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

re-enable conv_task

b4828a8

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

update default values

f2e9023

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

youngeunkwon0405 force-pushed the async-refit-overlap branch from 218d1cd to f2e9023 Compare October 20, 2025 18:42

youngeunkwon0405 marked this pull request as ready for review October 20, 2025 20:26

youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 20, 2025

youngeunkwon0405 temporarily deployed to nemo-ci October 20, 2025 20:27 — with GitHub Actions Inactive

youngeunkwon0405 temporarily deployed to nemo-ci October 20, 2025 20:28 — with GitHub Actions Inactive

youngeunkwon0405 temporarily deployed to nemo-ci October 21, 2025 00:24 — with GitHub Actions Inactive

yuki-97 reviewed Oct 21, 2025

View reviewed changes

Comment thread nemo_rl/models/policy/megatron_policy_worker.py Outdated

guyueh1 previously approved these changes Oct 21, 2025

View reviewed changes

youngeunkwon0405 requested a review from terrykong October 21, 2025 20:01

yuki-97 previously approved these changes Oct 22, 2025

View reviewed changes

Merge branch 'main' into async-refit-overlap

7d456ec

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

youngeunkwon0405 dismissed stale reviews from yuki-97 and guyueh1 via 7d456ec October 22, 2025 18:03

youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 22, 2025

youngeunkwon0405 temporarily deployed to nemo-ci October 22, 2025 18:45 — with GitHub Actions Inactive

youngeunkwon0405 temporarily deployed to nemo-ci October 22, 2025 19:01 — with GitHub Actions Inactive

youngeunkwon0405 temporarily deployed to nemo-ci October 22, 2025 22:18 — with GitHub Actions Inactive

guyueh1 requested a review from yuki-97 October 23, 2025 17:14

guyueh1 approved these changes Oct 23, 2025

View reviewed changes

terrykong merged commit 73e0c09 into NVIDIA-NeMo:main Oct 23, 2025
41 of 42 checks passed

chtruong814 pushed a commit that referenced this pull request Oct 23, 2025

feat: Overlap param iteration and broadcast in non-colocated refit (#…

63b3673

…1379) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai Bot mentioned this pull request Oct 23, 2025

cp: feat: Overlap param iteration and broadcast in non-colocated refit (1379) into r0.4.0 #1418

Merged

ZhiyuLi-Nvidia pushed a commit that referenced this pull request Oct 25, 2025

feat: Overlap param iteration and broadcast in non-colocated refit (#…

c6b8283

…1379) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

guyueh1 linked an issue Oct 28, 2025 that may be closed by this pull request

Refit speedup for non-colocated (NCCL) #817

Closed

lbliii pushed a commit that referenced this pull request Nov 3, 2025

feat: Overlap param iteration and broadcast in non-colocated refit (#…

5aa99e3

…1379) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

feat: Overlap param iteration and broadcast in non-colocated refit (N…

4023473

…VIDIA-NeMo#1379) Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Overlap param iteration and broadcast in non-colocated refit#1379

feat: Overlap param iteration and broadcast in non-colocated refit#1379
terrykong merged 5 commits intoNVIDIA-NeMo:mainfrom
youngeunkwon0405:async-refit-overlap

youngeunkwon0405 commented Oct 16, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot commented Oct 17, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

youngeunkwon0405 commented Oct 21, 2025

Uh oh!

Uh oh!

guyueh1 left a comment

Uh oh!

youngeunkwon0405 commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

youngeunkwon0405 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Performance

Profile

Before

This PR

ENV variables to control

Convergence run

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

youngeunkwon0405 commented Oct 21, 2025

Uh oh!

Uh oh!

guyueh1 left a comment

Choose a reason for hiding this comment

Uh oh!

youngeunkwon0405 commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

youngeunkwon0405 commented Oct 16, 2025 •

edited

Loading

coderabbitai Bot commented Oct 17, 2025 •

edited

Loading