fix: Re-enable tests/functional/test_converters.sh functional test#2005
fix: Re-enable tests/functional/test_converters.sh functional test#2005
Conversation
📝 WalkthroughWalkthroughThis PR modifies Megatron model-parallel initialization to conditionally initialize based on existing state, harddens checkpoint creation with distributed context wrapping and error handling, and re-enables a previously disabled functional test. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
nemo_rl/models/megatron/community_import.py (1)
79-87: Document the caller precondition and guard against parallelism config mismatch in theelsebranch.Two things worth noting about this block:
Missing precondition in docstring — the
ifbranch callsmodel_provider.initialize_model_parallel(seed=0), which requirestorch.distributed.init_process_groupto have already been called (per Megatron Core's initialization contract). If a caller invokesimport_model_from_hf_namewithout a pre-established distributed context (e.g., notemporary_distributed_contextwrapper), the path still raisesValueErrordue to missing env vars — the same bug this PR fixes for the test. The docstring should document this precondition so callers knowtemporary_distributed_context(or equivalent) must be active when model parallel is not yet initialized.Silent mismatch in the
elsebranch — when model parallel is already initialized, the code only re-seeds but does not validate that the existing parallel topology matchesmodel_provider's configuredtensor_model_parallel_size,pipeline_model_parallel_size, etc. (set viamegatron_config). If there's a mismatch,provide_distributed_model(wrap_with_ddp=False)may load the model incorrectly. A guard or at least an assertion would make this failure loud.💡 Suggested docstring update
def import_model_from_hf_name( hf_model_name: str, output_path: str, megatron_config: Optional[MegatronConfig] = None, **config_overrides: Any, ): """Import a Hugging Face model into Megatron checkpoint format and save the Megatron checkpoint to the output path. Args: hf_model_name: Hugging Face model ID or local path (e.g., 'meta-llama/Llama-3.1-8B-Instruct'). output_path: Directory to write the Megatron checkpoint (e.g., /tmp/megatron_ckpt). megatron_config: Optional megatron config with paralellism settings for distributed megatron model import. + + Note: + When model parallel is not yet initialized, this function calls + ``model_provider.initialize_model_parallel``, which requires an active + ``torch.distributed`` process group. Callers must ensure distributed is + already initialized (e.g., via ``temporary_distributed_context``) before + invoking this function in a non-torchrun context. """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/models/megatron/community_import.py` around lines 79 - 87, Document that import_model_from_hf_name requires a pre-established torch distributed context (e.g., temporary_distributed_context) when parallel_state.model_parallel_is_initialized() is False because model_provider.initialize_model_parallel(seed=0) depends on torch.distributed.init_process_group; update the docstring to state this precondition. In the else branch (where model_parallel_cuda_manual_seed(0) is called), add an explicit validation/assertion that the current Megatron parallel topology (from parallel_state or megatron_config: tensor_model_parallel_size, pipeline_model_parallel_size, etc.) matches model_provider's configured values and raise a clear error if they differ before calling provide_distributed_model(wrap_with_ddp=False), so topology mismatches fail loudly rather than producing silent incorrect loads.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/functional/L1_Functional_Tests_GPU.sh`:
- Around line 53-54: Remove the stale comment line that reads "Re-enable once
DTensor v2 converter is fixed." since the converter test is already
unconditionally run via the command invoking
./tests/functional/test_converters.sh; locate the comment above the time uv run
--no-sync bash ./tests/functional/test_converters.sh invocation and delete it
(and any identical TODO/comment duplicates) so the script contains only the
active test invocation.
In `@tests/functional/test_converter_roundtrip.py`:
- Around line 224-229: The except block that re-raises an ImportError loses the
original traceback; capture the original ImportError (e.g., except ImportError
as e:) when importing temporary_distributed_context from
megatron.bridge.training.model_load_save and re-raise with exception chaining
(raise ImportError("megatron.bridge.training is not available.") from e) so
callers can see the underlying cause and which submodule/symbol failed to
import.
---
Nitpick comments:
In `@nemo_rl/models/megatron/community_import.py`:
- Around line 79-87: Document that import_model_from_hf_name requires a
pre-established torch distributed context (e.g., temporary_distributed_context)
when parallel_state.model_parallel_is_initialized() is False because
model_provider.initialize_model_parallel(seed=0) depends on
torch.distributed.init_process_group; update the docstring to state this
precondition. In the else branch (where model_parallel_cuda_manual_seed(0) is
called), add an explicit validation/assertion that the current Megatron parallel
topology (from parallel_state or megatron_config: tensor_model_parallel_size,
pipeline_model_parallel_size, etc.) matches model_provider's configured values
and raise a clear error if they differ before calling
provide_distributed_model(wrap_with_ddp=False), so topology mismatches fail
loudly rather than producing silent incorrect loads.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
nemo_rl/models/megatron/community_import.pytests/functional/L1_Functional_Tests_GPU.shtests/functional/test_converter_roundtrip.py
yuki-97
left a comment
There was a problem hiding this comment.
thanks for the fix! left some comments.
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
…nity_import.py to tests/functional/test_converter_roundtrip.py Signed-off-by: ruit <ruit@nvidia.com>
4b4813c to
83c8d82
Compare
Signed-off-by: ruit <ruit@nvidia.com>
83c8d82 to
28bf7c7
Compare
…VIDIA-NeMo#2005) Signed-off-by: ruit <ruit@nvidia.com>
…2005) Signed-off-by: ruit <ruit@nvidia.com>
…2005) Signed-off-by: ruit <ruit@nvidia.com>
…2005) Signed-off-by: ruit <ruit@nvidia.com>
Summary
This PR improves checkpoint conversion reliability in the Megatron/HF roundtrip flow, with a focus on distributed-context correctness and deterministic model-parallel initialization.
Why
The converter import path may run outside of a pre-initialized distributed runtime (e.g., functional tests launched as a regular Python process, not
torchrun).In that case, calling
model_provider.initialize_model_parallel(seed=0)triggers an implicittorch.distributed.init_process_group("nccl"), which usesenv://rendezvous and requiresenvironment variables like
RANKandWORLD_SIZE. Since those variables are not set in this execution mode, conversion fails with:ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set.This change ensures conversion runs under a controlled temporary distributed context and avoids relying on external launcher-provided env vars.
Note
Can not add
temporary_distributed_contextinsideimport_model_from_hf_namedirectly just like commit c59c8af7 did. Becauseimport_model_from_hf_namewill be called with distributed initialized in some conditions(e.g. unit/models/policy/test_megatron_worker.py::test_megatron_policy_training[2gpu_dp2_llama]). Then we may meet error:raise ValueError("trying to initialize the default process group twice!")Related PR
#1883
Before your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Bug Fixes
Tests