Skip to content

feat: Overlap param iteration and broadcast in non-colocated refit#1379

Merged
terrykong merged 5 commits intoNVIDIA-NeMo:mainfrom
youngeunkwon0405:async-refit-overlap
Oct 23, 2025
Merged

feat: Overlap param iteration and broadcast in non-colocated refit#1379
terrykong merged 5 commits intoNVIDIA-NeMo:mainfrom
youngeunkwon0405:async-refit-overlap

Conversation

@youngeunkwon0405
Copy link
Copy Markdown
Contributor

@youngeunkwon0405 youngeunkwon0405 commented Oct 16, 2025

What does this PR do ?

Overlap parameter iteration and the packing/broadcasting in the non-colocated GRPO to enhance the performance.

Performance

  • QWEN3 235B: 13 s --> 5.5 s (2.36x)
  • DSV3: 34 s --> 14 s (2.4x)

Profile

Before

image

This PR

image

ENV variables to control

  • NRL_REFIT_BUFFER_MEMORY_RATIO: each buffer size will be total_gpu_memory_size x NRL_REFIT_BUFFER_MEMORY_RATIO
  • NRL_REFIT_NUM_BUFFERS: degree of parallelism. 2 will be sufficient in most cases.

Convergence run

https://wandb.ai/nvidia/async-grpo-refit-convergence?nw=nwuseryoungeunk
LLaMa3 8B token_mult_prob_error
opt0: baseline without any optimization (main branch before applying #1264 and #1313)
opt123: applying #1264 and #1313
opt1234: this PR (current main ToT)
image

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Added multi-buffer pipelining for enhanced data throughput and processing efficiency
  • Performance

    • Optimized memory efficiency in tensor packing operations
    • Improved parameter conversion handling during model weight export

@youngeunkwon0405 youngeunkwon0405 self-assigned this Oct 16, 2025
@youngeunkwon0405 youngeunkwon0405 requested review from a team as code owners October 16, 2025 18:50
@youngeunkwon0405 youngeunkwon0405 marked this pull request as draft October 16, 2025 18:50
@youngeunkwon0405 youngeunkwon0405 added the CI:L1 Run doctests, unit tests, and functional tests label Oct 16, 2025
@youngeunkwon0405 youngeunkwon0405 marked this pull request as ready for review October 17, 2025 01:04
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 17, 2025

📝 Walkthrough

Walkthrough

Two files are modified: megatron_policy_worker.py receives a keyword argument addition for conversion tasks in the HF weights export call, while packed_tensor.py implements multi-buffer pipelining for packed tensor operations, introduces a new public helper function, and adjusts default memory ratio parameters.

Changes

Cohort / File(s) Summary
Megatron weight export
nemo_rl/models/policy/megatron_policy_worker.py
Added conversion_tasks=self.refit_conversion_tasks parameter to self.megatron_bridge.export_hf_weights() call in broadcast_weights_for_collective method
Packed tensor multi-buffer pipelining
nemo_rl/utils/packed_tensor.py
Introduced get_num_buffers() public helper (cached) to retrieve buffer count from environment variable; changed default memory ratio from 0.01 to 0.02; reworked packed_broadcast_producer and packed_broadcast_consumer to support multi-buffer pipelining with per-buffer streams, packing lists, sizes, and per-buffer state management

Sequence Diagram(s)

sequenceDiagram
    participant Producer as Packed Producer
    participant BufferLoop as Round-Robin Loop
    participant Stream1 as CUDA Stream (Buffer 1)
    participant Stream2 as CUDA Stream (Buffer N)
    participant Broadcast as Broadcast Op
    participant Consumer as Packed Consumer

    Producer->>Producer: Initialize per-buffer streams<br/>& metadata lists
    
    loop For each buffer round-robin
        alt Buffer has data
            BufferLoop->>Stream1: Sync stream, pack data
            Stream1->>Stream1: Process buffer data
            Stream1->>Broadcast: Broadcast per-buffer tensor
            Broadcast->>Stream2: Receive broadcast
            Stream2->>Stream2: Unpack to consumer
            Stream2->>Consumer: Yield unpacked data
        else Buffer exhausted
            BufferLoop->>BufferLoop: Handle StopIteration<br/>for this buffer
        end
    end
    
    Note over Producer,Consumer: Multi-buffer pipelining<br/>with per-buffer synchronization
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

The change introduces moderate complexity through multi-buffer pipelining logic in packed_tensor.py with new per-buffer state management and synchronization patterns, balanced against a minimal one-line modification in megatron_policy_worker.py. Review requires understanding the buffering mechanics, per-buffer stream handling, and control flow changes, though the modifications follow a consistent pattern.

Possibly related PRs

Suggested labels

Performance

Suggested reviewers

  • terrykong
  • yuki-97
  • guyueh1

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning The PR represents a major feature implementing multi-buffer pipelining to overlap parameter iteration and broadcasting for performance optimization. While the PR description includes a convergence run reference (W&B link) with a LLaMa3 8B convergence graph demonstrating that numerics are not regressed, the PR falls short on key testing documentation requirements. Specifically, the PR provides no inline performance metrics showing execution time improvements or memory efficiency gains from the optimization, only qualitative execution flow diagrams. Additionally, all pre-check items in the PR checklist remain unchecked, indicating that formal test results, unit tests, and functional tests have not been documented or confirmed to have been run locally. The convergence reference is helpful for establishing lack of numerical regression, but it does not address the quantified performance impact that the instruction requires for performance-affecting changes. To pass this check, the PR should be updated to include: (1) quantified before-and-after performance metrics (execution time reduction percentage, memory efficiency gains) with the test configuration clearly specified (model size, batch size, number of tokens, etc.); (2) documented unit and/or functional test results confirming the changes work correctly; and (3) completion of the pre-check checklist items showing that contributor guidelines were followed, tests were written and executed locally, and documentation was updated as needed. The current convergence run link helps establish numerical stability but is insufficient without concrete performance numbers demonstrating the claimed optimization benefit.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "feat: Overlap param iteration and broadcast in non-colocated refit" directly aligns with the main changes in the changeset. The substantial modifications to packed_tensor.py implement multi-buffer pipelining with per-buffer synchronization to achieve exactly this goal of overlapping parameter iteration with packing/broadcasting operations. The supporting change in megatron_policy_worker.py propagates conversion tasks through the export process to enable this functionality. The title is specific about the key optimization (overlap), the affected components (param iteration and broadcast), and the context (non-colocated refit), providing clear understanding of the changeset's primary purpose without unnecessary detail.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@youngeunkwon0405 youngeunkwon0405 added the Performance Related to improving performance label Oct 17, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_rl/utils/packed_tensor.py (1)

25-31: NRL_REFIT_BUFFER_MEMORY_RATIO has inconsistent defaults and memory basis across codebase; add validation

Your findings are confirmed. The environment variable is used with three different defaults and two different memory bases:

  • packed_tensor.py:25: default "0.02", calculates from total_memory (total GPU capacity)
  • megatron_policy_worker.py:1655: default "0.2", calculates from free_memory (available memory)
  • dtensor_policy_worker_v2.py:1714 & dtensor_policy_worker.py:1753: default "0.8", calculates from free_memory

This inconsistency can cause unexpected memory allocation behavior. Additionally, the direct float(memory_ratio) call at line 30 in packed_tensor.py (and similarly in policy worker files) lacks error handling—invalid or malformed environment values will raise an uncaught exception.

Recommend:

  1. Add validation guard around float parsing (your suggested try/catch is appropriate)
  2. Document or unify the semantics: decide whether the ratio should apply to total or free memory, and establish a single default across all usages, or clearly document why they differ
🧹 Nitpick comments (2)
nemo_rl/utils/packed_tensor.py (2)

34-36: Add docstring and clamp for NRL_REFIT_NUM_BUFFERS

Expose intent and prevent invalid config (e.g., 0 or negative) causing modulo/division errors.

Apply:

 @lru_cache(maxsize=1)
 def get_num_buffers():
-    return int(os.getenv("NRL_REFIT_NUM_BUFFERS", "2"))
+    """Number of pipeline buffers used to overlap pack/broadcast.
+
+    Controlled by env NRL_REFIT_NUM_BUFFERS. Must be >= 1. Default: 2.
+    """
+    try:
+        n = int(os.getenv("NRL_REFIT_NUM_BUFFERS", "2"))
+    except Exception:
+        n = 2
+    return max(1, n)

142-152: Same setup on consumer side: consider docstrings for per-buffer state for maintainability

Optional: briefly document packing_tensor_meta_data/offsets to ease future changes.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dee3fd9 and 205d1fb.

📒 Files selected for processing (2)
  • nemo_rl/models/policy/megatron_policy_worker.py (1 hunks)
  • nemo_rl/utils/packed_tensor.py (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/utils/packed_tensor.py
  • nemo_rl/models/policy/megatron_policy_worker.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/utils/packed_tensor.py
  • nemo_rl/models/policy/megatron_policy_worker.py
🧬 Code graph analysis (1)
nemo_rl/utils/packed_tensor.py (1)
tests/unit/utils/test_packed_tensor.py (3)
  • broadcast (33-37)
  • broadcast (47-51)
  • post_unpack_func (111-114)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Lint check
  • GitHub Check: L0_Unit_Tests_Policy
  • GitHub Check: L0_Unit_Tests_Other
  • GitHub Check: L0_Unit_Tests_Generation
🔇 Additional comments (1)
nemo_rl/models/policy/megatron_policy_worker.py (1)

1758-1762: ****

The concern assumes broadcast_weights_for_collective() could execute with self.refit_conversion_tasks uninitialized (None), but this is contradicted by the codebase structure:

  • grpo.py line 455: prepare_refit_info() is called during setup
  • grpo.py line 549: broadcast_weights_for_collective() is called later in the training loop
  • distillation.py follows the same pattern

Since prepare_refit_info() always runs before broadcast_weights_for_collective(), self.refit_conversion_tasks is guaranteed to be populated (not None) when line 1758-1762 executes. No guard or fallback is required.

Likely an incorrect or invalid review comment.

Comment thread nemo_rl/utils/packed_tensor.py
Comment thread nemo_rl/utils/packed_tensor.py
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@youngeunkwon0405 youngeunkwon0405 marked this pull request as ready for review October 20, 2025 20:26
@youngeunkwon0405 youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 20, 2025
@youngeunkwon0405
Copy link
Copy Markdown
Contributor Author

Hi @guyueh1 and @yuki-97, Can I ask for your review please? This PR passes the CI pipeline. And I checked the token_mult_prob_error from multiple wandb reports. It was near 1 except for the dsv3, which has known issue.

Comment thread nemo_rl/models/policy/megatron_policy_worker.py Outdated
guyueh1
guyueh1 previously approved these changes Oct 21, 2025
Copy link
Copy Markdown
Contributor

@guyueh1 guyueh1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@youngeunkwon0405
Copy link
Copy Markdown
Contributor Author

Hi @terrykong can I ask for your help with merge? This is the PR that I shared in the meeting yesterday.

yuki-97
yuki-97 previously approved these changes Oct 22, 2025
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@youngeunkwon0405 youngeunkwon0405 dismissed stale reviews from yuki-97 and guyueh1 via 7d456ec October 22, 2025 18:03
@youngeunkwon0405 youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 22, 2025
@guyueh1 guyueh1 requested a review from yuki-97 October 23, 2025 17:14
@terrykong terrykong merged commit 73e0c09 into NVIDIA-NeMo:main Oct 23, 2025
41 of 42 checks passed
chtruong814 pushed a commit that referenced this pull request Oct 23, 2025
…1379)

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
ZhiyuLi-Nvidia pushed a commit that referenced this pull request Oct 25, 2025
…1379)

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@guyueh1 guyueh1 linked an issue Oct 28, 2025 that may be closed by this pull request
lbliii pushed a commit that referenced this pull request Nov 3, 2025
…1379)

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
…VIDIA-NeMo#1379)

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests Performance Related to improving performance r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refit speedup for non-colocated (NCCL)

5 participants