Skip to content

fix: Re-enable tests/functional/test_converters.sh functional test#2005

Merged
yuki-97 merged 4 commits intomainfrom
ruit/convert_ckpt_func
Feb 25, 2026
Merged

fix: Re-enable tests/functional/test_converters.sh functional test#2005
yuki-97 merged 4 commits intomainfrom
ruit/convert_ckpt_func

Conversation

@RayenTian
Copy link
Copy Markdown
Contributor

@RayenTian RayenTian commented Feb 23, 2026

Summary

This PR improves checkpoint conversion reliability in the Megatron/HF roundtrip flow, with a focus on distributed-context correctness and deterministic model-parallel initialization.

Why

The converter import path may run outside of a pre-initialized distributed runtime (e.g., functional tests launched as a regular Python process, not torchrun).

In that case, calling model_provider.initialize_model_parallel(seed=0) triggers an implicit
torch.distributed.init_process_group("nccl"), which uses env:// rendezvous and requires
environment variables like RANK and WORLD_SIZE. Since those variables are not set in this execution mode, conversion fails with:

ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set.

This change ensures conversion runs under a controlled temporary distributed context and avoids relying on external launcher-provided env vars.

Note

Can not add temporary_distributed_context inside import_model_from_hf_name directly just like commit c59c8af7 did. Because import_model_from_hf_name will be called with distributed initialized in some conditions(e.g. unit/models/policy/test_megatron_worker.py::test_megatron_policy_training[2gpu_dp2_llama]). Then we may meet error: raise ValueError("trying to initialize the default process group twice!")

Related PR

#1883

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Bug Fixes

    • Improved model initialization state handling to properly reuse already-initialized parallel contexts.
    • Enhanced checkpoint creation robustness with better error handling and graceful fallback when dependencies are unavailable.
    • Added optimizer offload configuration option for policy testing.
  • Tests

    • Re-enabled converter functional tests in the GPU test workflow.

@RayenTian RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Feb 23, 2026
@RayenTian RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 23, 2026
@RayenTian RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 23, 2026
@RayenTian RayenTian marked this pull request as ready for review February 24, 2026 02:48
@RayenTian RayenTian requested review from a team as code owners February 24, 2026 02:48
@RayenTian RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 24, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 24, 2026

📝 Walkthrough

Walkthrough

This PR modifies Megatron model-parallel initialization to conditionally initialize based on existing state, harddens checkpoint creation with distributed context wrapping and error handling, and re-enables a previously disabled functional test.

Changes

Cohort / File(s) Summary
Megatron Model-Parallel Initialization
nemo_rl/models/megatron/community_import.py
Adds conditional logic to check if model parallelism is already initialized before calling initialize_model_parallel(seed=0). If already initialized, imports model_parallel_cuda_manual_seed and sets seed to 0 instead.
Test Checkpoint Creation & Configuration
tests/functional/test_converter_roundtrip.py
Adds offload_optimizer_for_logprob: False to policy test config. Wraps megatron bridge import and checkpoint creation in try/except ImportError and temporary_distributed_context(backend="gloo") for error handling and controlled execution context.
Test Workflow Enablement
tests/functional/L1_Functional_Tests_GPU.sh
Uncomments line to re-enable the previously disabled test_converters.sh functional test in the GPU workflow.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested reviewers

  • terrykong
  • yfw
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly references the main change: re-enabling the test_converters.sh functional test, which matches the primary objective and is the most visible change in the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes ✅ Passed PR contains minor bug fixes and reliability improvements for checkpoint converter roundtrip flow without test documentation, which is appropriate for non-major changes.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ruit/convert_ckpt_func

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
nemo_rl/models/megatron/community_import.py (1)

79-87: Document the caller precondition and guard against parallelism config mismatch in the else branch.

Two things worth noting about this block:

  1. Missing precondition in docstring — the if branch calls model_provider.initialize_model_parallel(seed=0), which requires torch.distributed.init_process_group to have already been called (per Megatron Core's initialization contract). If a caller invokes import_model_from_hf_name without a pre-established distributed context (e.g., no temporary_distributed_context wrapper), the path still raises ValueError due to missing env vars — the same bug this PR fixes for the test. The docstring should document this precondition so callers know temporary_distributed_context (or equivalent) must be active when model parallel is not yet initialized.

  2. Silent mismatch in the else branch — when model parallel is already initialized, the code only re-seeds but does not validate that the existing parallel topology matches model_provider's configured tensor_model_parallel_size, pipeline_model_parallel_size, etc. (set via megatron_config). If there's a mismatch, provide_distributed_model(wrap_with_ddp=False) may load the model incorrectly. A guard or at least an assertion would make this failure loud.

💡 Suggested docstring update
 def import_model_from_hf_name(
     hf_model_name: str,
     output_path: str,
     megatron_config: Optional[MegatronConfig] = None,
     **config_overrides: Any,
 ):
     """Import a Hugging Face model into Megatron checkpoint format and save the Megatron checkpoint to the output path.

     Args:
         hf_model_name: Hugging Face model ID or local path (e.g., 'meta-llama/Llama-3.1-8B-Instruct').
         output_path: Directory to write the Megatron checkpoint (e.g., /tmp/megatron_ckpt).
         megatron_config: Optional megatron config with paralellism settings for distributed megatron model import.
+
+    Note:
+        When model parallel is not yet initialized, this function calls
+        ``model_provider.initialize_model_parallel``, which requires an active
+        ``torch.distributed`` process group. Callers must ensure distributed is
+        already initialized (e.g., via ``temporary_distributed_context``) before
+        invoking this function in a non-torchrun context.
     """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/models/megatron/community_import.py` around lines 79 - 87, Document
that import_model_from_hf_name requires a pre-established torch distributed
context (e.g., temporary_distributed_context) when
parallel_state.model_parallel_is_initialized() is False because
model_provider.initialize_model_parallel(seed=0) depends on
torch.distributed.init_process_group; update the docstring to state this
precondition. In the else branch (where model_parallel_cuda_manual_seed(0) is
called), add an explicit validation/assertion that the current Megatron parallel
topology (from parallel_state or megatron_config: tensor_model_parallel_size,
pipeline_model_parallel_size, etc.) matches model_provider's configured values
and raise a clear error if they differ before calling
provide_distributed_model(wrap_with_ddp=False), so topology mismatches fail
loudly rather than producing silent incorrect loads.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/L1_Functional_Tests_GPU.sh`:
- Around line 53-54: Remove the stale comment line that reads "Re-enable once
DTensor v2 converter is fixed." since the converter test is already
unconditionally run via the command invoking
./tests/functional/test_converters.sh; locate the comment above the time uv run
--no-sync bash ./tests/functional/test_converters.sh invocation and delete it
(and any identical TODO/comment duplicates) so the script contains only the
active test invocation.

In `@tests/functional/test_converter_roundtrip.py`:
- Around line 224-229: The except block that re-raises an ImportError loses the
original traceback; capture the original ImportError (e.g., except ImportError
as e:) when importing temporary_distributed_context from
megatron.bridge.training.model_load_save and re-raise with exception chaining
(raise ImportError("megatron.bridge.training is not available.") from e) so
callers can see the underlying cause and which submodule/symbol failed to
import.

---

Nitpick comments:
In `@nemo_rl/models/megatron/community_import.py`:
- Around line 79-87: Document that import_model_from_hf_name requires a
pre-established torch distributed context (e.g., temporary_distributed_context)
when parallel_state.model_parallel_is_initialized() is False because
model_provider.initialize_model_parallel(seed=0) depends on
torch.distributed.init_process_group; update the docstring to state this
precondition. In the else branch (where model_parallel_cuda_manual_seed(0) is
called), add an explicit validation/assertion that the current Megatron parallel
topology (from parallel_state or megatron_config: tensor_model_parallel_size,
pipeline_model_parallel_size, etc.) matches model_provider's configured values
and raise a clear error if they differ before calling
provide_distributed_model(wrap_with_ddp=False), so topology mismatches fail
loudly rather than producing silent incorrect loads.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9148186 and 4b4813c.

📒 Files selected for processing (3)
  • nemo_rl/models/megatron/community_import.py
  • tests/functional/L1_Functional_Tests_GPU.sh
  • tests/functional/test_converter_roundtrip.py

Comment thread tests/functional/L1_Functional_Tests_GPU.sh Outdated
Comment thread tests/functional/test_converter_roundtrip.py
Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix! left some comments.

Comment thread tests/functional/test_converter_roundtrip.py
Comment thread nemo_rl/models/megatron/community_import.py
Comment thread tests/functional/L1_Functional_Tests_GPU.sh Outdated
…nity_import.py to tests/functional/test_converter_roundtrip.py

Signed-off-by: ruit <ruit@nvidia.com>
@RayenTian RayenTian force-pushed the ruit/convert_ckpt_func branch from 4b4813c to 83c8d82 Compare February 24, 2026 08:34
Signed-off-by: ruit <ruit@nvidia.com>
@RayenTian RayenTian force-pushed the ruit/convert_ckpt_func branch from 83c8d82 to 28bf7c7 Compare February 24, 2026 10:00
@RayenTian RayenTian removed the CI:L1 Run doctests, unit tests, and functional tests label Feb 24, 2026
@RayenTian RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Feb 24, 2026
@yuki-97 yuki-97 enabled auto-merge (squash) February 24, 2026 15:24
@yuki-97 yuki-97 merged commit 6435fe4 into main Feb 25, 2026
38 of 39 checks passed
@yuki-97 yuki-97 deleted the ruit/convert_ckpt_func branch February 25, 2026 10:16
sharonyu-115 pushed a commit to sharonyu-115/RL that referenced this pull request Feb 28, 2026
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
seonjinn pushed a commit that referenced this pull request Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants