fix: Re-enable tests/functional/test_converters.sh functional test by RayenTian · Pull Request #2005 · NVIDIA-NeMo/RL

RayenTian · 2026-02-23T06:02:01Z

Summary

This PR improves checkpoint conversion reliability in the Megatron/HF roundtrip flow, with a focus on distributed-context correctness and deterministic model-parallel initialization.

Why

The converter import path may run outside of a pre-initialized distributed runtime (e.g., functional tests launched as a regular Python process, not torchrun).

In that case, calling model_provider.initialize_model_parallel(seed=0) triggers an implicit
torch.distributed.init_process_group("nccl"), which uses env:// rendezvous and requires
environment variables like RANK and WORLD_SIZE. Since those variables are not set in this execution mode, conversion fails with:

ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set.

This change ensures conversion runs under a controlled temporary distributed context and avoids relying on external launcher-provided env vars.

Note

Can not add temporary_distributed_context inside import_model_from_hf_name directly just like commit c59c8af7 did. Because import_model_from_hf_name will be called with distributed initialized in some conditions(e.g. unit/models/policy/test_megatron_worker.py::test_megatron_policy_training[2gpu_dp2_llama]). Then we may meet error: raise ValueError("trying to initialize the default process group twice!")

Related PR

#1883

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Bug Fixes
- Improved model initialization state handling to properly reuse already-initialized parallel contexts.
- Enhanced checkpoint creation robustness with better error handling and graceful fallback when dependencies are unavailable.
- Added optimizer offload configuration option for policy testing.
Tests
- Re-enabled converter functional tests in the GPU test workflow.

coderabbitai · 2026-02-24T02:55:08Z

📝 Walkthrough

Walkthrough

This PR modifies Megatron model-parallel initialization to conditionally initialize based on existing state, harddens checkpoint creation with distributed context wrapping and error handling, and re-enables a previously disabled functional test.

Changes

Cohort / File(s)	Summary
Megatron Model-Parallel Initialization `nemo_rl/models/megatron/community_import.py`	Adds conditional logic to check if model parallelism is already initialized before calling `initialize_model_parallel(seed=0)`. If already initialized, imports `model_parallel_cuda_manual_seed` and sets seed to 0 instead.
Test Checkpoint Creation & Configuration `tests/functional/test_converter_roundtrip.py`	Adds `offload_optimizer_for_logprob: False` to policy test config. Wraps megatron bridge import and checkpoint creation in try/except ImportError and temporary_distributed_context(backend="gloo") for error handling and controlled execution context.
Test Workflow Enablement `tests/functional/L1_Functional_Tests_GPU.sh`	Uncomments line to re-enable the previously disabled test_converters.sh functional test in the GPU workflow.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

chore: Update RL to use megatron-bridge tot #1358: Modifies community_import.py to change Megatron model-parallel initialization flow with adjusted initialization and seed handling.
fix: Reinitialize model parallel after import #1317: Modifies Megatron model-parallel initialization behavior around HF model import to handle already-initialized parallel state.
fix: parallel state initialization error in Megatron to HF model conversion #1120: Wraps Megatron load/save operations in temporary_distributed_context during import, aligning with distributed context hardening pattern.

Suggested reviewers

terrykong
yfw

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly references the main change: re-enabling the test_converters.sh functional test, which matches the primary objective and is the most visible change in the PR.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes	✅ Passed	PR contains minor bug fixes and reliability improvements for checkpoint converter roundtrip flow without test documentation, which is appropriate for non-major changes.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ruit/convert_ckpt_func

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

nemo_rl/models/megatron/community_import.py (1)
79-87: Document the caller precondition and guard against parallelism config mismatch in the else branch.

Two things worth noting about this block:

Missing precondition in docstring — the if branch calls model_provider.initialize_model_parallel(seed=0), which requires torch.distributed.init_process_group to have already been called (per Megatron Core's initialization contract). If a caller invokes import_model_from_hf_name without a pre-established distributed context (e.g., no temporary_distributed_context wrapper), the path still raises ValueError due to missing env vars — the same bug this PR fixes for the test. The docstring should document this precondition so callers know temporary_distributed_context (or equivalent) must be active when model parallel is not yet initialized.

Silent mismatch in the else branch — when model parallel is already initialized, the code only re-seeds but does not validate that the existing parallel topology matches model_provider's configured tensor_model_parallel_size, pipeline_model_parallel_size, etc. (set via megatron_config). If there's a mismatch, provide_distributed_model(wrap_with_ddp=False) may load the model incorrectly. A guard or at least an assertion would make this failure loud.
💡 Suggested docstring update
 def import_model_from_hf_name(
     hf_model_name: str,
     output_path: str,
     megatron_config: Optional[MegatronConfig] = None,
     **config_overrides: Any,
 ):
     """Import a Hugging Face model into Megatron checkpoint format and save the Megatron checkpoint to the output path.

     Args:
         hf_model_name: Hugging Face model ID or local path (e.g., 'meta-llama/Llama-3.1-8B-Instruct').
         output_path: Directory to write the Megatron checkpoint (e.g., /tmp/megatron_ckpt).
         megatron_config: Optional megatron config with paralellism settings for distributed megatron model import.
+
+    Note:
+        When model parallel is not yet initialized, this function calls
+        ``model_provider.initialize_model_parallel``, which requires an active
+        ``torch.distributed`` process group. Callers must ensure distributed is
+        already initialized (e.g., via ``temporary_distributed_context``) before
+        invoking this function in a non-torchrun context.
     """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/models/megatron/community_import.py` around lines 79 - 87, Document
that import_model_from_hf_name requires a pre-established torch distributed
context (e.g., temporary_distributed_context) when
parallel_state.model_parallel_is_initialized() is False because
model_provider.initialize_model_parallel(seed=0) depends on
torch.distributed.init_process_group; update the docstring to state this
precondition. In the else branch (where model_parallel_cuda_manual_seed(0) is
called), add an explicit validation/assertion that the current Megatron parallel
topology (from parallel_state or megatron_config: tensor_model_parallel_size,
pipeline_model_parallel_size, etc.) matches model_provider's configured values
and raise a clear error if they differ before calling
provide_distributed_model(wrap_with_ddp=False), so topology mismatches fail
loudly rather than producing silent incorrect loads.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/L1_Functional_Tests_GPU.sh`:
- Around line 53-54: Remove the stale comment line that reads "Re-enable once
DTensor v2 converter is fixed." since the converter test is already
unconditionally run via the command invoking
./tests/functional/test_converters.sh; locate the comment above the time uv run
--no-sync bash ./tests/functional/test_converters.sh invocation and delete it
(and any identical TODO/comment duplicates) so the script contains only the
active test invocation.

In `@tests/functional/test_converter_roundtrip.py`:
- Around line 224-229: The except block that re-raises an ImportError loses the
original traceback; capture the original ImportError (e.g., except ImportError
as e:) when importing temporary_distributed_context from
megatron.bridge.training.model_load_save and re-raise with exception chaining
(raise ImportError("megatron.bridge.training is not available.") from e) so
callers can see the underlying cause and which submodule/symbol failed to
import.

---

Nitpick comments:
In `@nemo_rl/models/megatron/community_import.py`:
- Around line 79-87: Document that import_model_from_hf_name requires a
pre-established torch distributed context (e.g., temporary_distributed_context)
when parallel_state.model_parallel_is_initialized() is False because
model_provider.initialize_model_parallel(seed=0) depends on
torch.distributed.init_process_group; update the docstring to state this
precondition. In the else branch (where model_parallel_cuda_manual_seed(0) is
called), add an explicit validation/assertion that the current Megatron parallel
topology (from parallel_state or megatron_config: tensor_model_parallel_size,
pipeline_model_parallel_size, etc.) matches model_provider's configured values
and raise a clear error if they differ before calling
provide_distributed_model(wrap_with_ddp=False), so topology mismatches fail
loudly rather than producing silent incorrect loads.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9148186 and 4b4813c.

📒 Files selected for processing (3)

nemo_rl/models/megatron/community_import.py
tests/functional/L1_Functional_Tests_GPU.sh
tests/functional/test_converter_roundtrip.py

yuki-97

thanks for the fix! left some comments.

Signed-off-by: ruit <ruit@nvidia.com>

…nity_import.py to tests/functional/test_converter_roundtrip.py Signed-off-by: ruit <ruit@nvidia.com>

Signed-off-by: ruit <ruit@nvidia.com>

…VIDIA-NeMo#2005) Signed-off-by: ruit <ruit@nvidia.com>

…2005) Signed-off-by: ruit <ruit@nvidia.com>

RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Feb 23, 2026

RayenTian had a problem deploying to nemo-ci February 23, 2026 06:02 — with GitHub Actions Error

RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 23, 2026

RayenTian temporarily deployed to nemo-ci February 23, 2026 06:04 — with GitHub Actions Inactive

RayenTian temporarily deployed to nemo-ci February 23, 2026 06:08 — with GitHub Actions Inactive

RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 23, 2026

RayenTian temporarily deployed to nemo-ci February 23, 2026 08:27 — with GitHub Actions Inactive

RayenTian temporarily deployed to nemo-ci February 23, 2026 08:37 — with GitHub Actions Inactive

RayenTian temporarily deployed to nemo-ci February 23, 2026 11:39 — with GitHub Actions Inactive

RayenTian marked this pull request as ready for review February 24, 2026 02:48

RayenTian requested review from a team as code owners February 24, 2026 02:48

RayenTian requested review from terrykong and yuki-97 February 24, 2026 02:48

RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 24, 2026

RayenTian temporarily deployed to nemo-ci February 24, 2026 02:51 — with GitHub Actions Inactive

coderabbitai Bot reviewed Feb 24, 2026

View reviewed changes

Comment thread tests/functional/L1_Functional_Tests_GPU.sh Outdated

Comment thread tests/functional/test_converter_roundtrip.py

yuki-97 reviewed Feb 24, 2026

View reviewed changes

Comment thread tests/functional/test_converter_roundtrip.py

Comment thread nemo_rl/models/megatron/community_import.py

Comment thread tests/functional/L1_Functional_Tests_GPU.sh Outdated

RayenTian temporarily deployed to nemo-ci February 24, 2026 07:27 — with GitHub Actions Inactive

RayenTian added 3 commits February 24, 2026 00:34

fix: Update model loading process to use temporary distributed context

e431e83

Signed-off-by: ruit <ruit@nvidia.com>

fix: Re-enable test_converters.sh in functional tests for GPU

823e47f

Signed-off-by: ruit <ruit@nvidia.com>

move temporary_distributed_context from nemo_rl/models/megatron/commu…

923768a

…nity_import.py to tests/functional/test_converter_roundtrip.py Signed-off-by: ruit <ruit@nvidia.com>

RayenTian force-pushed the ruit/convert_ckpt_func branch from 4b4813c to 83c8d82 Compare February 24, 2026 08:34

add dtensor v2 convert test

28bf7c7

Signed-off-by: ruit <ruit@nvidia.com>

RayenTian force-pushed the ruit/convert_ckpt_func branch from 83c8d82 to 28bf7c7 Compare February 24, 2026 10:00

RayenTian removed the CI:L1 Run doctests, unit tests, and functional tests label Feb 24, 2026

RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Feb 24, 2026

RayenTian temporarily deployed to nemo-ci February 24, 2026 10:02 — with GitHub Actions Inactive

yuki-97 approved these changes Feb 24, 2026

View reviewed changes

yuki-97 enabled auto-merge (squash) February 24, 2026 15:24

RayenTian temporarily deployed to nemo-ci February 24, 2026 16:04 — with GitHub Actions Inactive

RayenTian temporarily deployed to nemo-ci February 25, 2026 02:44 — with GitHub Actions Inactive

yuki-97 merged commit 6435fe4 into main Feb 25, 2026
38 of 39 checks passed

yuki-97 deleted the ruit/convert_ckpt_func branch February 25, 2026 10:16

sharonyu-115 pushed a commit to sharonyu-115/RL that referenced this pull request Feb 28, 2026

fix: Re-enable tests/functional/test_converters.sh functional test (N…

f7b0427

…VIDIA-NeMo#2005) Signed-off-by: ruit <ruit@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

fix: Re-enable tests/functional/test_converters.sh functional test (#…

65f8453

…2005) Signed-off-by: ruit <ruit@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

fix: Re-enable tests/functional/test_converters.sh functional test (#…

2834b71

…2005) Signed-off-by: ruit <ruit@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 9, 2026

fix: Re-enable tests/functional/test_converters.sh functional test (#…

40944d1

…2005) Signed-off-by: ruit <ruit@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Re-enable tests/functional/test_converters.sh functional test#2005

fix: Re-enable tests/functional/test_converters.sh functional test#2005
yuki-97 merged 4 commits intomainfrom
ruit/convert_ckpt_func

RayenTian commented Feb 23, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Feb 24, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

yuki-97 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RayenTian commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Note

Related PR

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yuki-97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RayenTian commented Feb 23, 2026 •

edited

Loading

coderabbitai Bot commented Feb 24, 2026 •

edited

Loading