Skip to content

fix: Make the optimizer offloading optional#1404

Merged
terrykong merged 8 commits intoNVIDIA-NeMo:mainfrom
youngeunkwon0405:remove-logprob-loading
Nov 7, 2025
Merged

fix: Make the optimizer offloading optional#1404
terrykong merged 8 commits intoNVIDIA-NeMo:mainfrom
youngeunkwon0405:remove-logprob-loading

Conversation

@youngeunkwon0405
Copy link
Copy Markdown
Contributor

@youngeunkwon0405 youngeunkwon0405 commented Oct 22, 2025

What does this PR do ?

This PR removes optimizer state offloading in policy.prepare_for_lp_inference() and corresponding on-loading in policy.prepare_for_training().

You can turn on optimizer state offloading by setting NRL_OFFLOAD_OPTIMIZER_STATE_FOR_LOGPROB=True (default value is False).
Changed this to yaml config following a comment from @guyueh1. Now it is policy.offload_optimizer_states_for_logprob=true to turn it on.

This will save 3 ~ 15s per step, depending on the model size.

I am not sure why it is required in both colocated and non-colocated cases. Maybe it could help increase logprob batch size to help increase MFU, but I'm not sure if it's worth it.

@yuki-97, do you think this change will cause additional memory overhead?

@parthchadha, this is the topic I mentioned today.
@guyueh1, do you think it should be removed in the colocated path as well?

Performance (logprob_inference_prep + training_prep)

  • QWEN3 235B: 10 s --> 1 s
  • QWEN3 30B: 13 s --> ~1 s

Nsys rep (QWEN3 30B GBS64)

Current ToT

image

This PR

image

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Bug Fixes
    • Refined optimizer state device management. Optimizer state transfers to GPU and CPU offloading are now conditioned on generation configuration, ensuring consistent behavior across distributed training setups.

@youngeunkwon0405 youngeunkwon0405 requested review from a team as code owners October 22, 2025 00:44
@youngeunkwon0405 youngeunkwon0405 self-assigned this Oct 22, 2025
@youngeunkwon0405 youngeunkwon0405 marked this pull request as draft October 22, 2025 00:45
@github-actions
Copy link
Copy Markdown

ℹ️ File Consistency Check

Check based on commit: 5416b9a (PR #1404 from remove-logprob-loading)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 22, 2025

📝 Walkthrough

Walkthrough

Three policy worker implementations now gate optimizer state movement operations (both device transfers and CPU offloading) to only occur when generation is colocated, by adding self.is_generation_colocated checks to existing conditions in prepare_for_training and offload_before_refit methods.

Changes

Cohort / File(s) Change Summary
Policy Worker Optimizer State Gating
nemo_rl/models/policy/dtensor_policy_worker.py, nemo_rl/models/policy/dtensor_policy_worker_v2.py, nemo_rl/models/policy/megatron_policy_worker.py
Added self.is_generation_colocated guard condition to optimizer state transfers in prepare_for_training and offload_before_refit methods, restricting state movement to CUDA and CPU offloading to colocated-generation scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Homogeneous guard condition additions repeated consistently across three closely related files with minimal logic density and straightforward boolean gating.

Possibly related PRs

Suggested labels

r0.4.0

Suggested reviewers

  • parthchadha

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning The PR makes significant changes to optimizer state offloading behavior by adding is_generation_colocated guards to critical code paths in prepare_for_training() and offload_before_refit() across three policy worker implementations. According to the PR objectives, the author claims the changes provide 3-15 seconds per-step performance improvements, which qualifies this as a performance-impacting change that should include test results and performance documentation. However, the PR description explicitly states that "checklist items for contributor guidelines, tests, and documentation are listed but not completed," and no concrete before-and-after performance metrics, test execution results, or convergence validation are documented in the provided context. Before this PR can be merged, it must include: (1) documented test results showing that existing test suites pass with the new conditional optimizer offloading logic, (2) before-and-after performance benchmarks with specific model configurations (model size, batch size, hardware, number of GPUs) that validate the claimed 3-15 second per-step improvement, (3) verification that convergence is not regressed when is_generation_colocated is true versus false, and (4) explicit completion of the contributor testing and documentation checklist items. The PR should transition from draft status and include comprehensive evidence that the optimizer offloading changes provide the claimed performance benefits without introducing regressions in either colocated or non-colocated scenarios.
Title Check ❓ Inconclusive The title “fix: Make the optimizer offloading optional” accurately refers to the change of gating optimizer offload behavior, but it is too generic and does not specify the key condition under which offloading now occurs (i.e., only when generation is colocated), which could obscure the intent for someone scanning the history. Consider revising the title to explicitly mention the new gating condition, for example: “fix: Only perform optimizer offloading when generation is colocated,” to clearly convey the primary change.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@youngeunkwon0405 youngeunkwon0405 changed the title Do the optimizer offloading only in colocated case fix: Do the optimizer offloading only in colocated case Oct 22, 2025
@youngeunkwon0405 youngeunkwon0405 marked this pull request as ready for review October 22, 2025 05:42
@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented Oct 22, 2025

I feel it's unrelated to if it's colocated, b/c colocated will free vLLM's memory before prepare_for_lp_inference, so they should be the same situation. (maybe there are some uncleaned memory in colocated case, but overall they should be the same)

I think why we offload / on-load is for increasing logprob batch size as you said. I feel it's better to apply "not offload optimizer" as an option. wdyt? @parthchadha @terrykong

@youngeunkwon0405
Copy link
Copy Markdown
Contributor Author

I feel it's unrelated to if it's colocated, b/c colocated will free vLLM's memory before prepare_for_lp_inference, so they should be the same situation. (maybe there are some uncleaned memory in colocated case, but overall they should be the same)

I think why we offload / on-load is for increasing logprob batch size as you said. I feel it's better to apply "not offload optimizer" as an option. wdyt? @parthchadha @terrykong

I see, if that is the case, I think logprob might already be able to use enough mbs to saturate the GPU MFU, because of memory saving in not-doing-bwd. Of course, I can make this tunable by NRL_OFFLOAD_OPTIMIZER_STATE_FOR_LOGPROB. I suggest the false as the default since it has huge overheads in large or MOE models.
I am open to others' opinions.

@github-actions
Copy link
Copy Markdown

ℹ️ File Consistency Check

Check based on commit: ffd544d (PR #1404 from remove-logprob-loading)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link
Copy Markdown

ℹ️ File Consistency Check

Check based on commit: 6c80e8a (PR #1404 from remove-logprob-loading)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@youngeunkwon0405 youngeunkwon0405 added r0.4.0 Performance Related to improving performance labels Oct 22, 2025
@youngeunkwon0405 youngeunkwon0405 requested a review from a team as a code owner October 22, 2025 21:48
@github-actions
Copy link
Copy Markdown

ℹ️ File Consistency Check

Check based on commit: ee5e3a6 (PR #1404 from remove-logprob-loading)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@youngeunkwon0405 youngeunkwon0405 changed the title fix: Do the optimizer offloading only in colocated case fix: Make the optimizer offloading optional Oct 22, 2025
@youngeunkwon0405 youngeunkwon0405 added the CI:L1 Run doctests, unit tests, and functional tests label Oct 22, 2025
@youngeunkwon0405 youngeunkwon0405 requested review from a team as code owners October 22, 2025 23:55
@github-actions
Copy link
Copy Markdown

ℹ️ File Consistency Check

Check based on commit: 525252d (PR #1404 from remove-logprob-loading)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@youngeunkwon0405 youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 22, 2025
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 5, 2025

ℹ️ File Consistency Check

Check based on commit: 5fcd404 (PR #1404 from remove-logprob-loading)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@youngeunkwon0405 youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 5, 2025
Comment thread nemo_rl/models/policy/dtensor_policy_worker_v2.py
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 5, 2025

ℹ️ File Consistency Check

Check based on commit: 09a44c3 (PR #1404 from remove-logprob-loading)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@youngeunkwon0405 youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 5, 2025
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 5, 2025

ℹ️ File Consistency Check

Check based on commit: 1e5fa20 (PR #1404 from remove-logprob-loading)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@youngeunkwon0405 youngeunkwon0405 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 5, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 5, 2025

ℹ️ File Consistency Check

Check based on commit: 8660c14 (PR #1404 from remove-logprob-loading)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

guyueh1
guyueh1 previously approved these changes Nov 6, 2025
Copy link
Copy Markdown
Contributor

@guyueh1 guyueh1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@guyueh1
Copy link
Copy Markdown
Contributor

guyueh1 commented Nov 6, 2025

@youngeunkwon0405 please fix CI tests

@youngeunkwon0405
Copy link
Copy Markdown
Contributor Author

Hi @terrykong, this PR fails in test_vllm_http_server test. Do you think the CI is broken? I can not think that this failure is caused by this change.

=================================== FAILURES ===================================
____________________________ test_vllm_http_server _____________________________

cluster = <nemo_rl.distributed.virtual_cluster.RayVirtualCluster object at 0x7f3c12fb3650>
tokenizer = Qwen2TokenizerFast(name_or_path='Qwen/Qwen3-0.6B', vocab_size=151643, model_max_length=131072, is_fast=True, padding_s...),
	151668: AddedToken("</think>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
}
)

    def test_vllm_http_server(cluster, tokenizer):
        """Test that vLLM http server works."""
    
        generation_config = configure_http_server_config(tokenizer)
    
        # Ensure we can get same output
        assert generation_config["model_name"] == "Qwen/Qwen3-0.6B", (
            "Model name should be Qwen/Qwen3-0.6B to get expected output"
        )
        assert generation_config["vllm_cfg"]["tensor_parallel_size"] == 1, (
            "Tensor parallel size should be 1 to get expected output"
        )
    
        # Set to greedy for test reproducibility.
        generation_config["temperature"] = 0.0
    
        # Create vLLM generation
        vllm_generation = VllmGeneration(cluster, generation_config)
    
        # We expect one server per vLLM DP rank.
        base_urls = vllm_generation.dp_openai_server_base_urls
        assert len(base_urls) == cluster.num_gpus_per_node
    
        body = dict(
            messages=[
                {"role": "user", "content": "count to 5"},
            ],
            temperature=generation_config["temperature"],
            top_p=generation_config["top_p"],
            # We want to test the actual train flow and how this is used. So we need to get logprobs here.
            logprobs=True,
            return_tokens_as_token_ids=True,
            max_tokens=1,
        )
    
        _wait_for_vllm_http_server_spinup(base_urls[0])
    
        # Generate and check result
        response = requests.post(url=f"{base_urls[0]}/chat/completions", json=body)
        actual_result = response.json()
    
        # This result assumes this exact model. The expected result here is what the full result looks like before we standardize.
        expected_result = {
            "id": "chatcmpl-7b8c0cdeeab34fd58ad260cf44b1a408",
            "object": "chat.completion",
            "created": 1756421711,
            "model": "Qwen/Qwen3-0.6B",
            "choices": [
                {
                    "index": 0,
                    "message": {
                        "role": "assistant",
                        "content": "<think>",
                        "refusal": None,
                        "annotations": None,
                        "audio": None,
                        "function_call": None,
                        "tool_calls": [],
                        "reasoning_content": None,
                    },
                    "logprobs": {
                        "content": [
                            {
                                "token": "token_id:151667",
                                "logprob": -0.00023779425828251988,
                                "bytes": [60, 116, 104, 105, 110, 107, 62],
                                "top_logprobs": [],
                            }
                        ]
                    },
                    "finish_reason": "length",
                    "stop_reason": None,
                    "token_ids": None,
                }
            ],
            "service_tier": None,
            "system_fingerprint": None,
            "usage": {
                "prompt_tokens": 12,
                "total_tokens": 13,
                "completion_tokens": 1,
                "prompt_tokens_details": None,
            },
            "prompt_logprobs": None,
            "prompt_token_ids": None,
            "kv_transfer_params": None,
        }
    
        def _standardize(d: dict) -> dict:
            d = deepcopy(d)
            d.pop("id")
            d.pop("created")
            # We don't want to implicate log prob accuracy in this test.
            d["choices"][0]["logprobs"]["content"][0].pop("logprob")
    
            return d
    
>       assert _standardize(expected_result) == _standardize(actual_result)
E       AssertionError: assert {'choices': [...pletion', ...} == {'choices': [...pletion', ...}
E         
E         Omitting 8 identical items, use -vv to show
E         Differing items:
E         {'model': 'Qwen/Qwen3-0.6B'} != {'model': '/home/TestData/nemo-rl/hf_home/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca'}
E         Use -v to get more diff

unit/models/generation/test_vllm_generation.py:1189: AssertionError

@terrykong
Copy link
Copy Markdown
Collaborator

@youngeunkwon0405 Could you try updating with main and testing just to ensure it's not a transient error?

@youngeunkwon0405
Copy link
Copy Markdown
Contributor Author

@youngeunkwon0405 Could you try updating with main and testing just to ensure it's not a transient error?

I already did this once. It was happening repeatedly.
Looks like this PR #1422 is facing the same issue.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 6, 2025

ℹ️ File Consistency Check

Check based on commit: ab089fa (PR #1404 from remove-logprob-loading)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 6, 2025

ℹ️ File Consistency Check

Check based on commit: d80a914 (PR #1404 from remove-logprob-loading)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@youngeunkwon0405
Copy link
Copy Markdown
Contributor Author

Hi @terrykong, now it passes the CI. Could I ask for your final review and possibly a merge as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests Performance Related to improving performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants