Skip to content

feat: Output buffer cache in megatron->hf generator#1417

Closed
guyueh1 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
guyueh1:mbridge_export_hf_with_buffer_cache
Closed

feat: Output buffer cache in megatron->hf generator#1417
guyueh1 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
guyueh1:mbridge_export_hf_with_buffer_cache

Conversation

@guyueh1
Copy link
Copy Markdown
Contributor

@guyueh1 guyueh1 commented Oct 23, 2025

What does this PR do ?

Output buffer cache in megatron->hf generator.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Chores
    • Updated Megatron-Bridge dependency to the latest version.
    • Optimized memory buffer management and data transfer operations for improved performance and resource utilization.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1 guyueh1 requested review from a team as code owners October 23, 2025 17:42
@github-actions
Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 0eba3a9 (PR #1417 from mbridge_export_hf_with_buffer_cache)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@guyueh1 guyueh1 linked an issue Oct 23, 2025 that may be closed by this pull request
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 23, 2025

📝 Walkthrough

Walkthrough

This PR updates the Megatron-Bridge submodule reference and modifies buffer handling in the policy worker module. A parameter reuse_buffer_from_cache=True is added to the HF weights export call, and the memory copy operation in tensor packing is changed from non-blocking to blocking mode.

Changes

Cohort / File(s) Summary
Submodule Update
3rdparty/Megatron-Bridge-workspace/Megatron-Bridge
Submodule commit updated from 62f4704 to 94e71fa enabling cache-support functionality
Buffer Reuse and Memory Copy Optimization
nemo_rl/models/policy/megatron_policy_worker.py, nemo_rl/models/policy/utils.py
Added reuse_buffer_from_cache=True parameter to weights export call; removed non_blocking=True flag from buffer copy operation to change from async to blocking memory transfer

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

The changes are limited in scope and primarily consist of parameter additions and flag modifications in performance-sensitive buffer-handling code. The submodule update is straightforward, and the two logic changes follow consistent patterns without complex interdependencies.

Possibly related PRs

Suggested labels

CI:L1, r0.4.0

Suggested reviewers

  • terrykong
  • yfw
  • yaoyu-33

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning This PR introduces changes to the weight streaming mechanism in megatron_policy_worker.py and memory copy behavior in utils.py that could affect performance. Specifically, the removal of non_blocking=True from the buffer copy operation changes memory copy behavior from asynchronous (non-blocking) to blocking, which could affect overlap between computation and data transfer. Additionally, the new reuse_buffer_from_cache=True parameter enables buffer reuse for metadata caching, which is a performance optimization. However, the PR description is minimal and contains no test results, performance benchmarks, before-and-after metrics, or any documentation demonstrating that regressions have been tested. The commit message only states "Output buffer cache in megatron->hf generator" with no additional context, and the PR pre-check boxes indicate tests should have been run but no results are documented in the PR description. The PR should include performance benchmarks or test results in its description demonstrating the impact of these changes. Specifically, document: (1) before-and-after performance numbers for the buffer cache optimization, (2) confirmation that removing non_blocking=True does not cause unintended performance regressions, (3) the configuration and hardware context in which these tests were performed, and (4) evidence that no numerical regressions have occurred. Alternatively, if these changes are minor and do not materially affect performance, document that assessment with supporting evidence.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "feat: Output buffer cache in megatron->hf generator" is clearly related to the main changes in the changeset. The primary modification involves passing reuse_buffer_from_cache=True to the HF weights export call in stream_weights_via_ipc_zmq, which directly enables buffer cache reuse for metadata caching in the megatron-to-huggingface generator conversion. The title accurately summarizes this key feature addition and provides sufficient context (the specific generator being modified) for a reviewer to understand the primary intent. While there are supporting changes such as the non_blocking flag removal in pack_tensor, the title appropriately focuses on the main feature being introduced. The title is concise, specific, and avoids vague terminology.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d843f02 and 0eba3a9.

📒 Files selected for processing (3)
  • 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge (1 hunks)
  • nemo_rl/models/policy/megatron_policy_worker.py (1 hunks)
  • nemo_rl/models/policy/utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/models/policy/megatron_policy_worker.py
  • nemo_rl/models/policy/utils.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/models/policy/megatron_policy_worker.py
  • nemo_rl/models/policy/utils.py
🔇 Additional comments (3)
3rdparty/Megatron-Bridge-workspace/Megatron-Bridge (1)

1-1: Submodule bump to 94e71fa verified—no additional changes needed

The commit is confirmed as legitimate and properly integrated. Commit 94e71fa introduces "Use buffer cache for output of stream_weights_megatron_to_hf" (files: auto_bridge.py, model_bridge.py, param_mapping.py), and the consumer code in nemo_rl/models/policy/megatron_policy_worker.py already uses reuse_buffer_from_cache=True (line 1958). All CI workflows (cicd-main.yml, _submodule_check.yml, test-template/action.yml) are configured with submodules: 'recursive', and the README documents the required git submodule update --init --recursive step. No API changes or code updates are required on this side.

nemo_rl/models/policy/utils.py (1)

324-324: Blocking copy supports buffer reuse but consider performance implications.

Removing non_blocking=True changes the memory copy from asynchronous to blocking (synchronous). This change appears intentional to support the buffer reuse feature introduced in the related PR changes:

  1. Correctness for buffer reuse: Making the copy blocking ensures that tensor data is fully transferred into the buffer before the function returns. This is important for the ping-pong double buffering strategy (lines 328-372), where buffers alternate between being filled and transmitted. With buffer reuse enabled in megatron_policy_worker.py:1958, blocking copies prevent potential race conditions where a buffer might be reused before its data is fully copied.

  2. Interaction with existing synchronization: Note that send_buffer_group_overlap already calls torch.cuda.current_stream().synchronize() at line 299 before getting the IPC handle. This means the stream is synchronized before transmission regardless of the blocking/non-blocking copy. However, the blocking copy provides an additional safety guarantee at the per-tensor level.

  3. Performance consideration: Blocking copies may slightly impact performance compared to non-blocking copies, as they prevent overlap between copy operations and subsequent CPU work. However, given the existing synchronization barrier at line 299, the performance impact is likely minimal.

The change looks correct for the buffer reuse use case. To verify there's no significant performance regression, consider monitoring IPC streaming performance metrics (if available) after this change is deployed.

nemo_rl/models/policy/megatron_policy_worker.py (1)

1957-1958: I have sufficient information to provide the verification. The key findings:

  1. Blocking copy behavior: Verified ✓ - The pack_tensor function in utils.py:323-325 uses .data.copy_() (blocking copy), and torch.cuda.current_stream().synchronize() at line 299 ensures data readiness before IPC handle generation.

  2. Inconsistent usage: Confirmed ✓ - reuse_buffer_from_cache=True appears ONLY in the stream_weights_via_ipc_zmq call (line 1954-1958), NOT in prepare_refit_info (line ~1873) or broadcast_weights_for_collective (line ~1973). However, no comment explains this intentional difference.

  3. Megatron-Bridge compatibility: Cannot verify - this depends on external API implementation.


Add a comment explaining why reuse_buffer_from_cache is only enabled for IPC streaming.

The reuse_buffer_from_cache=True parameter is applied only to the IPC streaming weight export (line 1954-1958), but not to the other two export_hf_weights calls at lines 1873 and 1973. This difference requires clarification: either add a comment explaining that buffer reuse only benefits IPC streaming (due to the ping-pong double buffering in stream_weights_via_ipc_zmq_impl), or verify whether it should be enabled consistently across all three calls. The blocking copy semantics are correct—pack_tensor uses blocking .data.copy_() with explicit synchronization, which is compatible with buffer reuse.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented Nov 6, 2025

Closing due to not a big perf impact, needs further evaluation.

@guyueh1 guyueh1 closed this Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refit Performance for Large MoE Models

1 participant