Skip to content

fix: Fix checkpoint conversion error for qwen 30b-a3b#1335

Merged
terrykong merged 2 commits intomainfrom
yifu/ckpt_fix
Oct 13, 2025
Merged

fix: Fix checkpoint conversion error for qwen 30b-a3b#1335
terrykong merged 2 commits intomainfrom
yifu/ckpt_fix

Conversation

@yfw
Copy link
Copy Markdown
Contributor

@yfw yfw commented Oct 10, 2025

What does this PR do ?

Pulls in a change in Megatron-Bridge to fix checkpoint conversion error below. Cherry-pick done by @yaoyu-33

ERROR:megatron.core.dist_checkpointing.validation:Invalid access pattern: 24 ShardedObject are missing. Existing shards: ['decoder.layers.self_attention.core_attention._extra_state/shard_0_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_1_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_2_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_3_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_4_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_5_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_6_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_7_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_8_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_9_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_10_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_11_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_12_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_13_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_14_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_15_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_16_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_17_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_18_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_19_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_20_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_21_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_22_48', 'decoder.layers.self_attention.core_attention._extra_state/shard_23_48']
[rank0]: Traceback (most recent call last):
[rank0]:  File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:  File "<frozen runpy>", line 88, in _run_code
[rank0]:  File "/nemo_run/code/nemo_skills/training/nemo_rl/convert_megatron_to_hf.py", line 123, in <module>
[rank0]:   main()
[rank0]:  File "/nemo_run/code/nemo_skills/training/nemo_rl/convert_megatron_to_hf.py", line 113, in main
[rank0]:   export_model_from_megatron(
[rank0]:  File "/opt/NeMo-RL/nemo_rl/models/megatron/community_import.py", line 108, in export_model_from_megatron
[rank0]:   megatron_model = bridge.load_megatron_model(input_path)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:  File "/opt/NeMo-RL/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/conversion/auto_bridge.py", line 523, in load_megatron_model
[rank0]:   model = load_megatron_model(
[rank0]:       ^^^^^^^^^^^^^^^^^^^^
[rank0]:  File "/opt/NeMo-RL/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/training/model_load_save.py", line 256, in load_megatron_model
[rank0]:   return _load_checkpoint()
[rank0]:      ^^^^^^^^^^^^^^^^^^
[rank0]:  File "/opt/NeMo-RL/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/training/model_load_save.py", line 240, in _load_checkpoint
[rank0]:   maybe_state_dict = _load_model_weights_from_checkpoint(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:  File "/opt/NeMo-RL/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/training/checkpointing.py", line 849, in _load_model_weights_from_checkpoint
[rank0]:   state_dict = dist_checkpointing.load(
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:  File "/opt/NeMo-RL/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 141, in load
[rank0]:   sharded_state_dict, missing_keys, unexpected_keys = validate_integrity_and_strict_load(
[rank0]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:  File "/opt/NeMo-RL/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/validation.py", line 201, in validate_integrity_and_strict_load
[rank0]:   validate_sharding_integrity(global_metadata)
[rank0]:  File "/opt/NeMo-RL/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/validation.py", line 442, in validate_sharding_integrity
[rank0]:   _validate_objects_for_key(shardings)
[rank0]:  File "/opt/NeMo-RL/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/validation.py", line 542, in _validate_objects_for_key
[rank0]:   raise CheckpointingException(err_msg)
[rank0]: megatron.core.dist_checkpointing.core.CheckpointingException: Invalid access pattern: 24 ShardedObject are missing.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Chores
    • Updated a third-party component to the latest upstream revision to incorporate minor fixes and enhancements.
    • This maintenance update aligns dependencies for improved stability and compatibility.
    • No user-facing changes are expected from this update.

@yfw yfw requested a review from a team as a code owner October 10, 2025 15:48
@yfw yfw added the CI:L1 Run doctests, unit tests, and functional tests label Oct 10, 2025
@yfw yfw requested review from terrykong and yaoyu-33 October 10, 2025 15:48
@yfw yfw changed the title Fix checkpoint conversion error for qwen 30b-a3b fix: Fix checkpoint conversion error for qwen 30b-a3b Oct 10, 2025
@github-actions
Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 838475f (PR #1335 from yifu/ckpt_fix)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 10, 2025

📝 Walkthrough

Walkthrough

Updated the Git submodule reference for 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge from commit 9d69624 to 62f4704. No other files or configurations changed.

Changes

Cohort / File(s) Summary of Changes
Submodule update
3rdparty/Megatron-Bridge-workspace/Megatron-Bridge
Submodule pointer updated from 9d69624cb75e46f06ddfadd9a726acecfcf8b064 to 62f4704b8d665ac4a8c318a809a070217caa8901.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Suggested labels

external

Suggested reviewers

  • yaoyu-33
  • terrykong

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly identifies that the PR fixes a checkpoint conversion error for the qwen 30b-a3b model, directly reflecting the main change described in the PR without including superfluous details or generic language.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Test Results For Major Changes ✅ Passed The PR only updates a submodule reference to pick up a bug fix in Megatron-Bridge without introducing new features or extensive refactors, so it falls under minor changes and does not require explicit test evidence for this check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yifu/ckpt_fix

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d44598 and 838475f.

📒 Files selected for processing (1)
  • 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build-container / main
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@terrykong
Copy link
Copy Markdown
Collaborator

@yaoyu-33 to review

@terrykong terrykong enabled auto-merge (squash) October 10, 2025 16:22
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 13, 2025
@terrykong terrykong merged commit eb5bb0f into main Oct 13, 2025
88 of 94 checks passed
@terrykong terrykong deleted the yifu/ckpt_fix branch October 13, 2025 18:57
chtruong814 pushed a commit that referenced this pull request Oct 13, 2025
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
lbliii pushed a commit that referenced this pull request Nov 3, 2025
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants