feat: Using mcore cpu optimizer#1242
Conversation
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
📝 WalkthroughWalkthroughIntroduces runtime guards in MegatronPolicyWorker to enforce optimizer CPU offload only when optimizer_offload_fraction is 1.0, and conditions CUDA transfers of optimizer state based on optimizer_cpu_offload across initialization, prepare_for_training, and refit/offload paths. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Trainer
participant Worker as MegatronPolicyWorker
participant Optim as Optimizer
participant GPU
participant CPU
Trainer->>Worker: __init__(config)
Worker->>Worker: Read optimizer_cpu_offload, optimizer_offload_fraction
alt CPU offload enabled
Worker->>Worker: Assert optimizer_offload_fraction == 1.0
else
Worker->>Worker: Continue
end
Trainer->>Worker: prepare_for_training()
alt optimizer_cpu_offload == True
note over Worker,Optim: Keep optimizer state on CPU<br/>(skip move to CUDA)
Worker-xGPU: Do not transfer optimizer state
Worker->>CPU: Maintain optimizer state
else
note over Worker,Optim: Move optimizer state to CUDA
Worker->>GPU: Transfer optimizer state
end
Trainer->>Worker: offload_before_refit()
alt optimizer_cpu_offload == True
note over Worker: Skip CUDA migration during refit/offload
else
Worker->>GPU: Move optimizer state as needed
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
nemo_rl/models/policy/megatron_policy_worker.py (2)
750-752: Consider caching the optimizer_cpu_offload flag for consistency.The conditional logic correctly prevents moving optimizer state to CUDA when CPU offload is enabled. However, reading
self.cfg['megatron_cfg']['optimizer'].get('optimizer_cpu_offload', False)on every call toprepare_for_trainingis repetitive. Consider caching this flag as an instance variable (e.g.,self.optimizer_cpu_offload) in__init__after line 616 for consistency and clarity.Example:
# In __init__ after line 616: self.optimizer_cpu_offload = optimizer_cpu_offload # In prepare_for_training: if hasattr(self, "optimizer") and self.optimizer is not None and ( not self.optimizer_cpu_offload ):
779-781: Logic is correct; caching suggestion applies here too.The conditional logic correctly prevents moving optimizer state to CPU when CPU offload is already enabled. The same caching suggestion from
prepare_for_trainingapplies here: consider storingoptimizer_cpu_offloadas an instance variable to avoid repeated config lookups.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_rl/models/policy/megatron_policy_worker.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts
Files:
nemo_rl/models/policy/megatron_policy_worker.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)
Files:
nemo_rl/models/policy/megatron_policy_worker.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>
|
lgtm. @yaoyu-33 to review |
|
@guyueh1 you'll need to lint the PR |
|
Closes #915 |
|
I saw here (https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/optimizer/cpu_offloading#configuration-recommendataions) that it is recommended to set the flag |
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Users who want to try this feature can use |
|
@terrykong this PR needs an approval, and do you think it should be v0.4? |
|
@terrykong this PR is stale so I merged main again, could you approve again? |
|
@terrykong this is ok to merge? Or any TODOs? |
|
@guyueh1 strange. i don't know why the CI was skipped, trying again |
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
What does this PR do ?
Add necessary support to use mcore cpu optimizer
Issues
List issues that this PR closes (syntax):
Closes #915
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit