fix: update Megatron-LM submodule to bfd160ba to fix checkpoint SIGSEGV#2070
fix: update Megatron-LM submodule to bfd160ba to fix checkpoint SIGSEGV#2070dafu-wu wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
📝 WalkthroughWalkthroughUpdates the Megatron-LM submodule pointer from commit 193463c4f8414e6906a40dd527a450bca50706b1 to bfd160ba1cb4f6fb11c6cd0b2bdeb85c50fc7f73. No functional or behavioral changes introduced. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@3rdparty/Megatron-LM-workspace/Megatron-LM`:
- Line 1: Confirm that the three NeMo-RL-specific patches (padding metadata fix,
checkpoint loading fix, and any changes related to Ray actor checkpoint saving)
are preserved or no longer needed after the Megatron-LM submodule bump by:
search the new upstream commit bfd160ba... and the updated Megatron-LM tree for
the specific fixes (look for write_preloaded_data_multithread(), padding
metadata references, and checkpoint load logic), verify whether equivalent
changes were merged upstream, check the NeMo-RL repo for re-applied patches or
alternate implementations (grep for write_preloaded_data_multithread, checkpoint
loading functions, and padding metadata handling), and if unclear run the Ray
actor checkpoint save scenario that previously triggered SIGSEGV to confirm
behavior; if the fixes are missing, re-apply the original NeMo-RL patches into
the new tree or adapt them to the upstream changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a5d0215a-b0e2-413d-afbf-2a5dfaa39c69
📒 Files selected for processing (1)
3rdparty/Megatron-LM-workspace/Megatron-LM
| @@ -1 +1 @@ | |||
| Subproject commit 193463c4f8414e6906a40dd527a450bca50706b1 | |||
| Subproject commit bfd160ba1cb4f6fb11c6cd0b2bdeb85c50fc7f73 | |||
There was a problem hiding this comment.
❓ Verification inconclusive
Script executed:
#!/bin/bash
set -euo pipefail
# First, check if the submodule path exists and is a git submodule
SUBMODULE_PATH="3rdparty/Megatron-LM-workspace/Megatron-LM"
if [ ! -f ".gitmodules" ]; then
echo "ERROR: .gitmodules not found"
exit 1
fi
echo "=== Checking .gitmodules for submodule info ==="
cat .gitmodules | grep -A 3 "Megatron-LM" || echo "Pattern not found directly"
echo ""
echo "=== Checking if submodule directory exists ==="
if [ -d "$SUBMODULE_PATH" ]; then
echo "Submodule directory exists at $SUBMODULE_PATH"
cd "$SUBMODULE_PATH"
echo ""
echo "=== Git config in submodule ==="
git config --local --list | head -20 || echo "No git config"
echo ""
echo "=== Current submodule commit ==="
git rev-parse HEAD
else
echo "Submodule directory does not exist at $SUBMODULE_PATH"
fiRepository: NVIDIA-NeMo/RL
Repository: NVIDIA-NeMo/RL
Exit code: 0
stdout:
=== Checking .gitmodules for submodule info ===
[submodule "3rdparty/Megatron-LM"]
path = 3rdparty/Megatron-LM-workspace/Megatron-LM
url = https://github.com/yaoyu-33/Megatron-LM.git
branch = main
shallow = true
[submodule "3rdparty/Megatron-Bridge"]
=== Checking if submodule directory exists ===
Submodule directory exists at 3rdparty/Megatron-LM-workspace/Megatron-LM
=== Git config in submodule ===
core.repositoryformatversion=1
core.filemode=true
core.bare=false
core.logallrefupdates=true
core.hookspath=.git/hooks
gc.auto=0
fsck.auto=0
remote.origin.url=https://github.com/NVIDIA-NeMo/RL.git
remote.origin.tagopt=--no-tags
remote.origin.fetch=+refs/heads/main:refs/remotes/origin/main
remote.origin.promisor=true
remote.origin.partialclonefilter=blob:none
branch.main.remote=origin
branch.main.merge=refs/heads/main
fetch.recursesubmodules=false
=== Current submodule commit ===
42b8daa33f0846ae0ac00023246a8a04328bf56d
Script executed:
#!/bin/bash
set -euo pipefail
# Check the actual .gitmodules file content for the submodule
echo "=== Full .gitmodules content ==="
cat .gitmodules || echo ".gitmodules not found"
echo ""
echo "=== Check git ls-files for submodule reference ==="
git ls-files --stage | grep -i megatron || echo "No megatron submodule in index"Repository: NVIDIA-NeMo/RL
Repository: NVIDIA-NeMo/RL
Exit code: 0
stdout:
=== Full .gitmodules content ===
[submodule "3rdparty/Megatron-LM"]
path = 3rdparty/Megatron-LM-workspace/Megatron-LM
url = https://github.com/yaoyu-33/Megatron-LM.git
branch = main
shallow = true
[submodule "3rdparty/Megatron-Bridge"]
path = 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge
url = https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
branch = main
shallow = true
[submodule "3rdparty/Automodel-workspace/Automodel"]
path = 3rdparty/Automodel-workspace/Automodel
url = https://github.com/NVIDIA-NeMo/Automodel.git
branch = yifu/bump-torch-and-hf
shallow = true
[submodule "3rdparty/Gym-workspace/Gym"]
path = 3rdparty/Gym-workspace/Gym
url = https://github.com/NVIDIA-NeMo/Gym.git
branch = main
shallow = true
=== Check git ls-files for submodule reference ===
160000 15398e08fc86be3de084c7382116527246ab1852 0 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge
100644 b2ae4cf651e08469651b1acb136c8e774f0fcfe2 0 3rdparty/Megatron-Bridge-workspace/is_megatron_bridge_installed.py
100644 783532c6608da3841808319ad308f3bd8a30b7a5 0 3rdparty/Megatron-Bridge-workspace/pyproject.toml
100644 a0beea944956c7aa3092f41e9dd832ae6b71542d 0 3rdparty/Megatron-Bridge-workspace/setup.py
160000 bfd160ba1cb4f6fb11c6cd0b2bdeb85c50fc7f73 0 3rdparty/Megatron-LM-workspace/Megatron-LM
100644 9a88db404f5ba83dc0b85a932b1a18c47c4b2b29 0 3rdparty/Megatron-LM-workspace/is_megatron_installed.py
100644 4537293a9d698c9c74e3fad6a1c9b888889ac1e4 0 3rdparty/Megatron-LM-workspace/pyproject.toml
100644 fb0a7cf92eeac03df7110cc6e7d072cadd813347 0 3rdparty/Megatron-LM-workspace/setup.py
100644 76151678f1a1f47f5ec571c41915fd49a8adf5a1 0 examples/configs/distillation_math_megatron.yaml
100644 b240c6519caa1b15b6d6892868cf8d0ca2dbeb17 0 examples/configs/grpo_math_1B_megatron.yaml
100644 4d17fdcea30b5320f3e5ebdd2bab4266b0c4e3bc 0 examples/configs/grpo_math_70B_megatron.yaml
100644 df239cd8ff8f2b173eeb02508e064347a8b3b1e0 0 examples/configs/grpo_math_70B_megatron_fp8.yaml
100644 977ab394b51dc5f58ee8a042101b39ec5618b1ea 0 examples/configs/grpo_math_8B_megatron.yaml
100644 ba6ee6e5c84eea2e226d49915dcd5bc5d3ff8125 0 examples/configs/grpo_math_8B_megatron_fp8.yaml
100644 37616e32b0f6d0271ddbf1d0ace7b7cd9f4b4982 0 examples/configs/grpo_math_qwen30ba3b_megatron.yaml
100644 95c9e85573263f7171331858697b3e756bb6c543 0 examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n4g-megatron-tp1pp2cp2-pack.yaml
100644 6fda3fe24ed69784a0e5dc9d874264f34fafa106 0 examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml
100644 8324173dfc2920ad91375f2674e6342f5f1b6182 0 examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n4g-megatrontp1pp2-quick.yaml
100644 8df4bc3fb0fbf848e19f75d257b1bb154a16369a 0 examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.yaml
100644 8b3a43ea28cc842e926a4dd53bd7a5418648911e 0 examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.yaml
100644 fb4a4bc880d9864f9d60036474ad5b9d8c9a5773 0 examples/configs/recipes/llm/grpo-dapomath17k-dsv3-32n4g-megatron.yaml
100644 8d19757d54fcb33c12b6df20bc03478f7d309f54 0 examples/configs/recipes/llm/grpo-dapomath17k-dsv3-megatron.yaml
100644 c9719f381f19391f5505d1088187ef67be45c30d 0 examples/configs/recipes/llm/grpo-gptoss-20b-8n4g-megatron.yaml
100755 b3dec78e98b1f60a1729d0ef5cab8473c74529f7 0 examples/configs/recipes/llm/grpo-gptoss-20b-8n8g-megatron.yaml
100644 dcd791eee6903de60cb70923a627ac5b8d80e9ee 0 examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.yaml
100644 5b348e9c5adc3d9ca2b9c3a2aaed1ec19bfcd558 0 examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e.yaml
100644 43608a94026cf1fcccb955aeca499606bcc3273c 0 examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n4g-megatron.yaml
100644 46c1a31fb518fca7b067b17f469bfcf3becb686f 0 examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n4g-megatron_generation.yaml
100755 333a06d98054f3d426c4f521e77096cd20016448 0 examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.yaml
100644 bb641388d87d432e2cc7907654ce05f2f18fffea 0 examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron_generation.yaml
100644 92fb87c19699c3e968f5dde94d980101c728649b 0 examples/configs/recipes/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml
100644 97d6ffede7a59d69190a2d15d6ffeea3d78288c2 0 examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n4g-megatron.yaml
100644 54b8d6671f8dd4a74f7e3328a0be0d59c0850c07 0 examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml
100644 83ea6128efa20204e7152cb1eac144575f8abbd2 0 examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron.yaml
100644 da8301a19b32da5e63d5991237fa00d6dfba4d08 0 examples/configs/recipes/llm/grpo-nano-v2-12b-1n4g-megatron.yaml
100644 86690abcc216d53ba8b79af777a683da1ecfc1d9 0 examples/configs/recipes/llm/grpo-nano-v2-12b-1n8g-megatron.yaml
100644 b21c9dd51fb0ce954b4e69eb5822b634bbd7cec0 0 examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n4g-megatron.yaml
100755 fd0a48a6636575e42eee551ab43267f31c7c60d5 0 examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-megatron.yaml
100644 79fbda389dd06da23c3c10e55f7becc67e4601bd 0 examples/configs/recipes/llm/grpo-qwen3-30ba3b-8n4g-megatron.yaml
100755 6e0aa5cd81ca5bf6e26d93c52ebf41d7197f7dfc 0 examples/configs/recipes/llm/grpo-qwen3-30ba3b-8n8g-megatron.yaml
100644 69ff4a4229a51ebf6401e23c1b60d0111699828e 0 examples/configs/recipes/llm/grpo-qwen3-8b-base-1n8g-fp8-kvcache-megatron.yaml
100644 77c175fadf147f3e3c32cc07c30c514a92644fc1 0 examples/configs/recipes/llm/sft-llama3.1-70b-8n4g-tp2pp2-long-megatron.yaml
100644 bb439558123c31ee1815ccc6a5bdac147b27aedb 0 examples/configs/recipes/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.yaml
100644 b2b76c0afd0bb2f139ad2ea02902dcbba9c1fc21 0 examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron-lora.yaml
100644 aa62330e3e93c413b98e9375130069153e2c43f8 0 examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron-seqpack.yaml
100644 7e9452dff7e299bd199da4129d02b173dfa90a8a 0 examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron.yaml
100644 aad3f5c8e08669da1efa90e681598397c86f6766 0 examples/configs/recipes/llm/sft-qwen2.5-math7b-2n4g-megatron.yaml
100644 d3bdd77bb26910ebeb25e295630c76053df26702 0 examples/configs/recipes/llm/sft-qwen2.5-math7b-2n8g-megatron.yaml
100644 fb70eedb20734a53a946888b067bfbf52627a23a 0 examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.yaml
100644 d81a58980ee95c13e3a0b62e95a6d5b57a1afa6b 0 examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.yaml
100644 40f62473acda2a46532f2bdd62a258f446a2be99 0 examples/configs/sft_openmathinstruct2_megatron.yaml
100644 a38b6e15a8855505b467d5d8144d734b4954a3d2 0 examples/configs/vlm_grpo_3B_megatron.yaml
100644 1a3bef0bee69689608bdab3f1702fb014ea5cb92 0 examples/converters/convert_megatron_to_hf.py
100644 4fc25d0d3c9856b04e2b1b565777ed4655415b2b 0 nemo_rl/models/megatron/__init__.py
100644 69912f209c570815eabfa3465bbcaa47e146b094 0 nemo_rl/models/megatron/common.py
100644 271cda579cef3df234b4c61898c6b5eb124fbe0b 0 nemo_rl/models/megatron/community_import.py
100644 5838e9d43079f924c8d69e1211481f8fffadb04a 0 nemo_rl/models/megatron/config.py
100644 7c765f19b538da299eec52b22107a029b7e3ff45 0 nemo_rl/models/megatron/data.py
100644 7728f80f65c12f8a62d3e8e37e1270d86fac678c 0 nemo_rl/models/megatron/pipeline_parallel.py
100644 e9fc2da9e16348f03cecb089f87aec7ec8a051dc 0 nemo_rl/models/megatron/setup.py
100644 95ccc3761df7643e5d371fbcb33158ef14a46bdc 0 nemo_rl/models/megatron/train.py
100644 d9a1c3d8a383c1eae550f48c56d76c2437ee1e9a 0 nemo_rl/models/policy/workers/megatron_policy_worker.py
100644 7866e2536595449a00134de9da463548c5b3fbaf 0 tests/functional/distillation_megatron.sh
100755 11d8b7602ad48d2c69da0c49531ac0138a99d0d8 0 tests/functional/dpo_megatron.sh
100755 3bae135a7e14a60f81c88ec8ed1b521838251e59 0 tests/functional/grpo_megatron.sh
100644 c5a82781b3eeb3e6ddfcd807f73c95c02e207054 0 tests/functional/grpo_megatron_generation.sh
100755 dfb7fcfdba7ad541fabc212cfc6ddef833a83679 0 tests/functional/sft_megatron.sh
100755 f5dfc6d34170df2816bb9ebf0816f3c281c362b6 0 tests/functional/sft_megatron_lora.sh
100755 5435b3c0180d877a1d50ab1d728b1d5ffb709677 0 tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n4g-megatron-tp1pp2cp2-pack.sh
100755 e8211372771f47ca02ddfaa58bb4078484fd958e 0 tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh
100755 3db1458c8b422f608715967d0caf7837a3867018 0 tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n4g-megatrontp1pp2-quick.sh
100755 7a98815ec7105fcdef96f3e5b19163910c7a268d 0 tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh
100755 f57f755051212ac5c669a9e5a0b0f162080d5590 0 tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.sh
100755 e01290160005b3d785fe0fba1a6299c517419bb9 0 tests/test_suites/llm/grpo-dapomath17k-dsv3-32n4g-megatron.sh
100755 8db1985dfb5c1ef497e3dbf72e99bc104306ab64 0 tests/test_suites/llm/grpo-dapomath17k-dsv3-megatron.sh
100755 38620c0c5b21b125a9b6afc8a5bda09bb97b688d 0 tests/test_suites/llm/grpo-gptoss-20b-8n4g-megatron.sh
100755 346fc8359daf864e20f6fd1145676e49960115e7 0 tests/test_suites/llm/grpo-gptoss-20b-8n8g-megatron.sh
100755 6b3f5673de1a0416bec38cdf53cf4a26243f286c 0 tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.sh
100755 b31c39fe65774af872abbee2fc0c97284a456233 0 tests/test_suites/llm/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e.sh
100755 29eff0a52a14a7d12f2f05b77ba8a716afc36dda 0 tests/test_suites/llm/grpo-llama3.2-1b-instruct-1n4g-megatron.sh
100755 02434e81069ead9f6861d3cceccf3d21d65651ad 0 tests/test_suites/llm/grpo-llama3.2-1b-instruct-1n4g-megatron_generation.sh
100755 60e61e808880540f21df359357596f715179c381 0 tests/test_suites/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.sh
100755 7806a567472695623bcc6a457ac99d08a780e970 0 tests/test_suites/llm/grpo-llama3.2-1b-instruct-1n8g-megatron_generation.sh
100755 e6d96dc04f522de2a47fdb753da7942bd59fbcf5 0 tests/test_suites/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.sh
100755 02b0d2c422d9d76b36636aac46eca2b7142a08aa 0 tests/test_suites/llm/grpo-moonlight-16ba3b-4n4g-megatron.sh
100755 71527d42ef6de0c037ff49a14d1bdb5781045b05 0 tests/test_suites/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.sh
100755 71527d42ef6de0c037ff49a14d1bdb5781045b05 0 tests/test_suites/llm/grpo-moonlight-16ba3b-4n8g-megatron.sh
100755 f746005e04d788ef425d7892be051b919bd7e094 0 tests/test_suites/llm/grpo-nano-v2-12b-1n4g-megatron.sh
100755 05bd0bf3e8a0d4f1c0f6ddf3ea1a23d78235dfe1 0 tests/test_suites/llm/grpo-nano-v2-12b-1n8g-megatron.sh
100755 fc22c6c3dcbf42424ff6d1ff16e8caf4ee6691fe 0 tests/test_suites/llm/grpo-qwen2.5-7b-instruct-4n4g-megatron.sh
100755 391b7f21e9065d1c40b2b1c3292824c5ca08ecc6 0 tests/test_suites/llm/grpo-qwen2.5-7b-instruct-4n8g-megatron.sh
100755 01229e7aaf39435f0ea42fcb86e24092a74b91d3 0 tests/test_suites/llm/grpo-qwen3-30ba3b-8n4g-megatron.sh
100755 ad369c4395a6ac5a82b37d2e7ee6116d943704c6 0 tests/test_suites/llm/grpo-qwen3-30ba3b-8n8g-megatron.sh
100755 86ce085f605ffbef9ecde17e44af7329db5a4ec7 0 tests/test_suites/llm/grpo-qwen3-8b-base-1n8g-fp8-kvcache-megatron.sh
100755 8e0b208e04793a4f164cbb7697da027890f5607c 0 tests/test_suites/llm/sft-llama3.1-70b-8n4g-tp2pp2-long-megatron.sh
100755 99f264e91050cb92c86b7bacf44745a341c80c40 0 tests/test_suites/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.sh
100755 f2d1672bdbf6e37145693b2d657f3e9e09906088 0 tests/test_suites/llm/sft-llama3.1-8b-1n8g-megatron-lora.sh
100755 89aa5b184a5627614d91874dcf613fe0a014687e 0 tests/test_suites/llm/sft-llama3.1-8b-1n8g-megatron-seqpack.sh
100755 81ea9f2f6b763dd5819529af57fa653b7d113311 0 tests/test_suites/llm/sft-llama3.1-8b-1n8g-megatron.sh
100755 9f6ded976012dd4e8aa3684a7c8529b5e9d32d3b 0 tests/test_suites/llm/sft-qwen2.5-math7b-2n4g-megatron.sh
100755 1ce72032036e0e4c012b8967a26074e428dbd1d8 0 tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
100755 7b15555457f6acde5aad5c01a6d3afb12c880836 0 tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.sh
100755 bbd490e4c09fa3af2ef56f5b8834cd935399ec9d 0 tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.sh
100644 341a77c5bc66dee5d2ba0edf888f91e5bf225e3c 0 tests/unit/models/megatron/__init__.py
100644 6e381d2933402d0c880510bf717fda60d635da94 0 tests/unit/models/megatron/test_megatron_data.py
100644 16d77389a68852591ea1633b83e9644a28b60516 0 tests/unit/models/megatron/test_megatron_setup.py
100644 6a8c2ea7fabaae0744dc4f3aa14f7840cbac8c8e 0 tests/unit/models/megatron/test_moe_metrics.py
100644 cf261c3d75e1ad02a36fc3ecc5ac3f61719f6049 0 tests/unit/models/megatron/test_train.py
100644 7d329ab411da81abe44d9a21ccfb63d9c3cd7bd4 0 tests/unit/models/policy/test_megatron_worker.py
Script executed:
#!/bin/bash
set -euo pipefail
# Verify submodule commit history around old/new pointers and inspect missing fixes.
# Read-only reconnaissance only.
SUBMODULE_PATH="3rdparty/Megatron-LM-workspace/Megatron-LM"
OLD_COMMIT="193463c4f8414e6906a40dd527a450bca50706b1"
NEW_COMMIT="bfd160ba1cb4f6fb11c6cd0b2bdeb85c50fc7f73"
echo "=== Verifying commits exist ==="
if git -C "$SUBMODULE_PATH" rev-parse --verify "$OLD_COMMIT" >/dev/null 2>&1; then
echo "✓ OLD_COMMIT found: $OLD_COMMIT"
else
echo "✗ OLD_COMMIT NOT found: $OLD_COMMIT"
fi
if git -C "$SUBMODULE_PATH" rev-parse --verify "$NEW_COMMIT" >/dev/null 2>&1; then
echo "✓ NEW_COMMIT found: $NEW_COMMIT"
else
echo "✗ NEW_COMMIT NOT found: $NEW_COMMIT"
fi
echo ""
echo "=== Commit range summary (first 120 lines) ==="
git -C "$SUBMODULE_PATH" log --oneline --decorate "${OLD_COMMIT}..${NEW_COMMIT}" 2>/dev/null | sed -n '1,120p' || echo "No commits in range or commits not found"
echo ""
echo "=== Search for likely checkpoint patch topics ==="
for q in "padding metadata" "checkpoint load" "filesystem_async" "write_preloaded_data_multithread" "fork" "thread"; do
echo "--- query: '$q'"
git -C "$SUBMODULE_PATH" log --oneline --grep="$q" "$NEW_COMMIT" -n 5 2>/dev/null || echo " (no matches or query failed)"
done
echo ""
echo "=== Diff hotspot check for NEW_COMMIT (checkpoint/filesystem code) ==="
git -C "$SUBMODULE_PATH" show --name-only --pretty=oneline "$NEW_COMMIT" 2>/dev/null | grep -E "filesystem_async|checkpoint|dist_checkpointing|serialization" || echo "(no matching filenames in $NEW_COMMIT)"
echo ""
echo "=== Check if old commit is reachable from new commit ==="
if git -C "$SUBMODULE_PATH" merge-base --is-ancestor "$OLD_COMMIT" "$NEW_COMMIT"; then
echo "✓ OLD_COMMIT is an ancestor of NEW_COMMIT (linear history or merge)"
else
echo "✗ OLD_COMMIT is NOT an ancestor of NEW_COMMIT (may be on different branch)"
fiRepository: NVIDIA-NeMo/RL
Repository: NVIDIA-NeMo/RL
Exit code: 0
stdout:
=== Verifying commits exist ===
✓ OLD_COMMIT found: 193463c4f8414e6906a40dd527a450bca50706b1
✓ NEW_COMMIT found: bfd160ba1cb4f6fb11c6cd0b2bdeb85c50fc7f73
=== Commit range summary (first 120 lines) ===
No commits in range or commits not found
=== Search for likely checkpoint patch topics ===
--- query: 'padding metadata'
(no matches or query failed)
--- query: 'checkpoint load'
(no matches or query failed)
--- query: 'filesystem_async'
(no matches or query failed)
--- query: 'write_preloaded_data_multithread'
(no matches or query failed)
--- query: 'fork'
(no matches or query failed)
--- query: 'thread'
(no matches or query failed)
=== Diff hotspot check for NEW_COMMIT (checkpoint/filesystem code) ===
(no matching filenames in bfd160ba1cb4f6fb11c6cd0b2bdeb85c50fc7f73)
=== Check if old commit is reachable from new commit ===
✗ OLD_COMMIT is NOT an ancestor of NEW_COMMIT (may be on different branch)
stderr:
fatal: remote error: upload-pack: not our ref 193463c4f8414e6906a40dd527a450bca50706b1
fatal: could not fetch 193463c4f8414e6906a40dd527a450bca50706b1 from promisor remote
Script executed:
#!/bin/bash
set -euo pipefail
SUBMODULE_PATH="3rdparty/Megatron-LM-workspace/Megatron-LM"
OLD_COMMIT="193463c4f8414e6906a40dd527a450bca50706b1"
NEW_COMMIT="bfd160ba1cb4f6fb11c6cd0b2bdeb85c50fc7f73"
echo "=== Check which branches contain each commit ==="
echo "OLD_COMMIT branches:"
git -C "$SUBMODULE_PATH" branch -r --contains "$OLD_COMMIT" 2>/dev/null || echo " (none found or error)"
echo ""
echo "NEW_COMMIT branches:"
git -C "$SUBMODULE_PATH" branch -r --contains "$NEW_COMMIT" 2>/dev/null || echo " (none found or error)"
echo ""
echo "=== Show commit details for OLD_COMMIT ==="
git -C "$SUBMODULE_PATH" show --oneline --format=fuller "$OLD_COMMIT" 2>/dev/null | head -20 || echo " (cannot show)"
echo ""
echo "=== Show commit details for NEW_COMMIT ==="
git -C "$SUBMODULE_PATH" show --oneline --format=fuller "$NEW_COMMIT" 2>/dev/null | head -20 || echo " (cannot show)"
echo ""
echo "=== List all remote branches ==="
git -C "$SUBMODULE_PATH" branch -r 2>/dev/null || echo " (no remote branches)"
echo ""
echo "=== Check git log for both commits ==="
echo "Commits in last 100 on origin/main containing either SHA:"
git -C "$SUBMODULE_PATH" log -100 --oneline origin/main 2>/dev/null | grep -E "193463c4|bfd160ba" || echo " (neither found in recent history)"
echo ""
echo "=== Compute common ancestor (merge-base) ==="
COMMON=$(git -C "$SUBMODULE_PATH" merge-base "$OLD_COMMIT" "$NEW_COMMIT" 2>/dev/null)
echo "Common ancestor: $COMMON"
echo "Common ancestor summary:"
git -C "$SUBMODULE_PATH" show --oneline "$COMMON" 2>/dev/null | head -3Repository: NVIDIA-NeMo/RL
Repository: NVIDIA-NeMo/RL
Exit code: 128
stdout:
=== Check which branches contain each commit ===
OLD_COMMIT branches:
(none found or error)
NEW_COMMIT branches:
(none found or error)
=== Show commit details for OLD_COMMIT ===
(cannot show)
=== Show commit details for NEW_COMMIT ===
(cannot show)
=== List all remote branches ===
origin/HEAD -> origin/main
origin/main
=== Check git log for both commits ===
Commits in last 100 on origin/main containing either SHA:
(neither found in recent history)
=== Compute common ancestor (merge-base) ===
Script executed:
#!/bin/bash
set -euorapipefail
SUBMODULE_PATH="3rdparty/Megatron-LM-workspace/Megatron-LM"
NEW_COMMIT="bfd160ba1cb4f6fb11c6cd0b2bdeb85c50fc7f73"
echo "=== Current HEAD in submodule ==="
git -C "$SUBMODULE_PATH" log --oneline -1 || echo "Error getting HEAD"
echo ""
echo "=== Check if NEW_COMMIT is reachable from origin/main ==="
if git -C "$SUBMODULE_PATH" log --all --oneline | grep -q "^bfd160ba"; then
echo "✓ NEW_COMMIT is reachable from some ref"
else
echo "✗ NEW_COMMIT is NOT reachable from any ref"
fi
echo ""
echo "=== Fetch full history to see if we can get more context ==="
git -C "$SUBMODULE_PATH" rev-list --all --oneline | head -50
echo ""
echo "=== Check NeMo RL repo for references to these commits ==="
git log --all --oneline -- "3rdparty/Megatron-LM-workspace/Megatron-LM" | head -20
echo ""
echo "=== Check git ls-files current submodule pointer ==="
git ls-files --stage | grep "Megatron-LM-workspace"Repository: NVIDIA-NeMo/RL
Repository: NVIDIA-NeMo/RL
Exit code: 1
stdout:
allexport off
braceexpand on
emacs off
errexit on
errtrace off
functrace off
hashall on
histexpand off
history off
ignoreeof off
interactive-comments on
keyword off
monitor off
noclobber off
noexec off
noglob off
nolog off
notify off
nounset on
onecmd off
physical off
pipefail off
posix off
privileged off
verbose off
vi off
xtrace off
stderr:
/bin/bash: line 2: set: -l: invalid option
set: usage: set [-abefhkmnptuvxBCEHPT] [-o option-name] [--] [-] [arg ...]
Script executed:
#!/bin/bash
set -euo pipefail
# Since shallow clone is the issue, let's check what's actually being tracked
SUBMODULE_PATH="3rdparty/Megatron-LM-workspace/Megatron-LM"
echo "=== Submodule HEAD commit ==="
HEAD=$(git -C "$SUBMODULE_PATH" rev-parse HEAD 2>/dev/null)
echo "Current HEAD: $HEAD"
echo ""
echo "=== Parent commits of HEAD ==="
git -C "$SUBMODULE_PATH" log --oneline --graph -n 10 || echo "(error)"
echo ""
echo "=== Check .git/shallow file ==="
if [ -f "$SUBMODULE_PATH/.git/shallow" ]; then
echo "Shallow file exists. Contents:"
cat "$SUBMODULE_PATH/.git/shallow"
else
echo "No .git/shallow file (not a shallow clone)"
fi
echo ""
echo "=== Try to unshallow and fetch ==="
git -C "$SUBMODULE_PATH" fetch --unshallow origin 2>&1 | head -20 || echo "(unshallow failed or already unshallowed)"
echo ""
echo "=== List all objects reachable from refs ==="
git -C "$SUBMODULE_PATH" rev-list --all --oneline | wc -l
echo "Total commits reachable"Repository: NVIDIA-NeMo/RL
Repository: NVIDIA-NeMo/RL
Exit code: 0
stdout:
=== Submodule HEAD commit ===
Current HEAD: 42b8daa33f0846ae0ac00023246a8a04328bf56d
=== Parent commits of HEAD ===
* 42b8daa fix: update Megatron-LM submodule to bfd160ba to fix checkpoint SIGSEGV
=== Check .git/shallow file ===
No .git/shallow file (not a shallow clone)
=== Try to unshallow and fetch ===
=== List all objects reachable from refs ===
777
Total commits reachable
Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Search for PR/commit context in main NeMo-RL repo ==="
git log --all --oneline --grep="Megatron" | head -20
echo ""
echo "=== Search for recent submodule updates ==="
git log --all --oneline -- "3rdparty/Megatron-LM-workspace/Megatron-LM" | head -10
echo ""
echo "=== Show the actual PR commit that updated the submodule ==="
git log --all --format="%h %s" --grep="bfd160ba" | head -5
echo ""
echo "=== Check recent commits for checkpoint-related changes ==="
git log --all --oneline --grep="checkpoint\|SIGSEGV" | head -20
echo ""
echo "=== Search codebase for NeMo-RL checkpoint patches ==="
rg -i "nemo.*rl.*checkpoint|checkpoint.*nemo.*rl" --type py | head -20 || echo "(no matches)"
echo ""
echo "=== Check if there are any NeMo-RL specific filesystem_async or write_preloaded_data changes ==="
rg "write_preloaded_data|filesystem_async" --type py | head -20 || echo "(no matches)"Repository: NVIDIA-NeMo/RL
Repository: NVIDIA-NeMo/RL
Exit code: 141
stdout:
=== Search for PR/commit context in main NeMo-RL repo ===
42b8daa fix: update Megatron-LM submodule to bfd160ba to fix checkpoint SIGSEGV
a426896 feat: Megatron LoRA GRPO w/ Weight Merging (`#1889`)
4809765 feat: Megatron SFT LoRA (`#1629`)
90fb0a8 fix: Fixes to make Megatron backend match dtensor (`#1389`)
73c8725 feat: add Megatron support for on-policy distillation (`#1324`)
8f6e00e fix: Megatron worker to have locked dependencies (`#1315`)
00cb570 fix: parallel state initialization error in Megatron to HF model conversion (`#1120`)
16e08cd feat: FP8 Training in Megatron Path (`#971`)
16d9128 feat: Deepseek migration to Megatron-Bridge + CP support (`#1059`)
c4fd5d3 feat: Migration from NeMo Tron to Megatron Bridge (`#905`)
ac7469f test: Add Megatron tests (`#713`)
00f930a fix: Megatron config updates to avoid OOM (`#687`)
a08829b feat: Megatron MoE Support (`#590`)
cfb803d fix: Megatron config fixes (`#576`)
3d36d0a feat: Enable SFT and DPO with Megatron backend (`#525`)
c0927f0 docs: Fixing some Megatron types and small cleanup (`#526`)
ab622eb feat: Add Megatron-LM based training (`#517`)
54bec9d feat: Moving everything to 'Policy' and lm_policy for Megatron (removing 'hf') (`#511`)
=== Search for recent submodule updates ===
42b8daa fix: update Megatron-LM submodule to bfd160ba to fix checkpoint SIGSEGV
5bb8586 chore: Switch to mcore upstream main (`#1990`)
84bede0 feat: Remove do_not_average_loss (`#1988`)
8ef0de9 chore: bump mcore and mbridge (`#1902`)
a26e53b fix: on GB200 use single-thread checkpoint save to avoid Cpu OOM (`#1703`)
5d04b36 chore: update megatron dev (11/21/2025) / mbridge (11/28/2025) (`#1568`)
2e2c2b3 chore: Update RL to use megatron-bridge tot (`#1358`)
66099f5 fix: A fix in megatron YARN module for memory leak (`#1163`)
16d9128 feat: Deepseek migration to Megatron-Bridge + CP support (`#1059`)
c4fd5d3 feat: Migration from NeMo Tron to Megatron Bridge (`#905`)
Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Show full commit message of the update PR ==="
git show --format=full 42b8daa 2>&1 | head -50
echo ""
echo "=== Check if there are documented NeMo-RL patches in PR description or comments ==="
git log -1 --format=%B 42b8daa
echo ""
echo "=== Search for any TODO/FIXME related to dropped patches ==="
rg "TODO|FIXME" nemo_rl/models/megatron/ --type py -A 2 | head -30 || echo "(none found)"Repository: NVIDIA-NeMo/RL
Repository: NVIDIA-NeMo/RL
Exit code: 0
stdout:
=== Show full commit message of the update PR ===
commit 42b8daa33f0846ae0ac00023246a8a04328bf56d
Author: DafuWu <wuchengyi2006@163.com>
Commit: DafuWu <wuchengyi2006@163.com>
fix: update Megatron-LM submodule to bfd160ba to fix checkpoint SIGSEGV
Update Megatron-LM submodule from 193463c4f to bfd160ba1 (NVIDIA/Megatron-LM main).
This fixes a critical SIGSEGV crash during checkpoint saving in Ray actor
environments (e.g., NeMo RL SFT training on GB200 NVL72 clusters).
Root cause: The old filesystem_async.py used mp.get_context('fork') to create
child processes for parallel checkpoint file writing. In Ray actors, fork()
copies complex multi-threaded state (NCCL CUMEM mappings, CUDA contexts, gRPC
threads) into child processes where these resources are invalid, causing SIGSEGV
in crc32_16bytes() during torch.save().
The upstream fix (already merged in NVIDIA/Megatron-LM) replaces fork-based
multiprocessing with threading.Thread in write_preloaded_data_multithread(),
which is safe because:
- All tensor data is already on CPU (staged by preload_tensors)
- Each thread writes to a different file (no contention)
- File I/O releases the GIL (true parallelism for I/O-bound work)
- No fork means no corrupted NCCL/CUDA/gRPC state
Note: 3 NeMo-RL-specific commits on the old branch (193463c4f) are being
replaced. These patches (padding metadata fix, ckpt loading fix) should be
verified for compatibility with the new Megatron-LM version.
diff --git a/3rdparty/Megatron-LM-workspace/Megatron-LM b/3rdparty/Megatron-LM-workspace/Megatron-LM
index 193463c..bfd160b 160000
--- a/3rdparty/Megatron-LM-workspace/Megatron-LM
+++ b/3rdparty/Megatron-LM-workspace/Megatron-LM
@@ -1 +1 @@
-Subproject commit 193463c4f8414e6906a40dd527a450bca50706b1
+Subproject commit bfd160ba1cb4f6fb11c6cd0b2bdeb85c50fc7f73
=== Check if there are documented NeMo-RL patches in PR description or comments ===
fix: update Megatron-LM submodule to bfd160ba to fix checkpoint SIGSEGV
Update Megatron-LM submodule from 193463c4f to bfd160ba1 (NVIDIA/Megatron-LM main).
This fixes a critical SIGSEGV crash during checkpoint saving in Ray actor
environments (e.g., NeMo RL SFT training on GB200 NVL72 clusters).
Root cause: The old filesystem_async.py used mp.get_context('fork') to create
child processes for parallel checkpoint file writing. In Ray actors, fork()
copies complex multi-threaded state (NCCL CUMEM mappings, CUDA contexts, gRPC
threads) into child processes where these resources are invalid, causing SIGSEGV
in crc32_16bytes() during torch.save().
The upstream fix (already merged in NVIDIA/Megatron-LM) replaces fork-based
multiprocessing with threading.Thread in write_preloaded_data_multithread(),
which is safe because:
- All tensor data is already on CPU (staged by preload_tensors)
- Each thread writes to a different file (no contention)
- File I/O releases the GIL (true parallelism for I/O-bound work)
- No fork means no corrupted NCCL/CUDA/gRPC state
Note: 3 NeMo-RL-specific commits on the old branch (193463c4f) are being
replaced. These patches (padding metadata fix, ckpt loading fix) should be
verified for compatibility with the new Megatron-LM version.
=== Search for any TODO/FIXME related to dropped patches ===
nemo_rl/models/megatron/setup.py: # TODO: this validation should happen inside mbridge: https://github.com/NVIDIA-NeMo/Megatron-Bridge/issues/1665
nemo_rl/models/megatron/setup.py- if dtype == torch.bfloat16:
nemo_rl/models/megatron/setup.py- assert model_cfg.bf16 == True, (
--
nemo_rl/models/megatron/setup.py: get_embedding_ranks=None, # TODO `@sahilj`: What is this?
nemo_rl/models/megatron/setup.py- get_position_embedding_ranks=None,
nemo_rl/models/megatron/setup.py-):
--
nemo_rl/models/megatron/setup.py: # TODO: Freeze state.cfg
nemo_rl/models/megatron/setup.py-
nemo_rl/models/megatron/setup.py- megatron_cfg.dist.external_gpu_device_mapping = True
--
nemo_rl/models/megatron/common.py: # TODO: Consider making device explicit if needed, e.g., derive from tensor on src
nemo_rl/models/megatron/common.py- device = torch.cuda.current_device()
nemo_rl/models/megatron/common.py-
Verify NeMo-RL checkpoint patches remain compatible after Megatron-LM submodule bump.
The commit message acknowledges that 3 NeMo-RL-specific patches (padding metadata fix, checkpoint loading fix) on the old branch are being replaced by this update. The change itself addresses a critical SIGSEGV in Ray actor checkpoint saving by replacing fork-based multiprocessing with threading in write_preloaded_data_multithread(). Before merging, confirm these dropped patches are either:
- Present in the new upstream Megatron-LM version (
bfd160ba...), or - Re-applied elsewhere in NeMo-RL, or
- No longer needed due to equivalent upstream changes
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@3rdparty/Megatron-LM-workspace/Megatron-LM` at line 1, Confirm that the three
NeMo-RL-specific patches (padding metadata fix, checkpoint loading fix, and any
changes related to Ray actor checkpoint saving) are preserved or no longer
needed after the Megatron-LM submodule bump by: search the new upstream commit
bfd160ba... and the updated Megatron-LM tree for the specific fixes (look for
write_preloaded_data_multithread(), padding metadata references, and checkpoint
load logic), verify whether equivalent changes were merged upstream, check the
NeMo-RL repo for re-applied patches or alternate implementations (grep for
write_preloaded_data_multithread, checkpoint loading functions, and padding
metadata handling), and if unclear run the Ray actor checkpoint save scenario
that previously triggered SIGSEGV to confirm behavior; if the fixes are missing,
re-apply the original NeMo-RL patches into the new tree or adapt them to the
upstream changes.
Update Megatron-LM submodule from 193463c4f to bfd160ba1 (NVIDIA/Megatron-LM main).
This fixes a critical SIGSEGV crash during checkpoint saving in Ray actor
environments (e.g., NeMo RL SFT training on GB200 NVL72 clusters).
Root cause: The old filesystem_async.py used mp.get_context('fork') to create
child processes for parallel checkpoint file writing. In Ray actors, fork()
copies complex multi-threaded state (NCCL CUMEM mappings, CUDA contexts, gRPC
threads) into child processes where these resources are invalid, causing SIGSEGV
in crc32_16bytes() during torch.save().
The upstream fix (already merged in NVIDIA/Megatron-LM) replaces fork-based
multiprocessing with threading.Thread in write_preloaded_data_multithread(),
which is safe because:
- All tensor data is already on CPU (staged by preload_tensors)
- Each thread writes to a different file (no contention)
- File I/O releases the GIL (true parallelism for I/O-bound work)
- No fork means no corrupted NCCL/CUDA/gRPC state
Note: 3 NeMo-RL-specific commits on the old branch (193463c4f) are being
replaced. These patches (padding metadata fix, ckpt loading fix) should be
verified for compatibility with the new Megatron-LM version.
Signed-off-by: DafuWu <wuchengyi2006@163.com>
42b8daa to
b0a1ebd
Compare
Update Megatron-LM submodule from 193463c4f to bfd160ba1 (NVIDIA/Megatron-LM main).
This fixes a critical SIGSEGV crash during checkpoint saving in Ray actor environments (e.g., NeMo RL SFT training on GB200 NVL72 clusters).
Root cause: The old filesystem_async.py used mp.get_context('fork') to create child processes for parallel checkpoint file writing. In Ray actors, fork() copies complex multi-threaded state (NCCL CUMEM mappings, CUDA contexts, gRPC threads) into child processes where these resources are invalid, causing SIGSEGV in crc32_16bytes() during torch.save().
The upstream fix (already merged in NVIDIA/Megatron-LM) replaces fork-based multiprocessing with threading.Thread in write_preloaded_data_multithread(), which is safe because:
Note: 3 NeMo-RL-specific commits on the old branch (193463c4f) are being replaced. These patches (padding metadata fix, ckpt loading fix) should be verified for compatibility with the new Megatron-LM version.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Release Notes
No user-visible changes in this release. This update includes internal infrastructure maintenance with no impact on functionality or behavior.