feat: mbridge test init by pablo-garay · Pull Request #47 · NVIDIA-NeMo/DFM

pablo-garay · 2025-11-14T22:25:59Z

No description provided.

copy-pr-bot · 2025-11-14T22:26:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

pablo-garay · 2025-11-14T22:33:36Z

/ok to test e387e66

pablo-garay · 2025-11-15T01:21:49Z

/ok to test 175b42d

abhinavg4 · 2025-11-15T14:53:16Z

tests/functional_tests/mcore/recipes/test_wan_pretrain.py

@@ -0,0 +1,107 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.


Can we please move this to tests/functional_tests/mcore/recipes/test_wan_pretrain.py

abhinavg4

Please see the comments

.github/workflows/cicd-main.yml

abhinavg4 · 2025-11-16T03:38:16Z

CONTRIBUTING.md

@@ -1,4 +1,32 @@
 # Contributing To NeMo DFM
+## 🛠️ Setting Up Your Environment


Make this correct. This is updated now.

I've updated. PTAL & lmk

abhinavg4 · 2025-11-16T03:38:31Z

tests/functional_tests/mcore/recipes/test_wan_pretrain.py

@@ -0,0 +1,107 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.


…e commit Signed-off-by: Pablo Garay <pagaray@nvidia.com>

…-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert GHA changes Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> * Revert "Revert GHA changes" This reverts commit d7ad1ab. * tempfortest: timeout setting Signed-off-by: Pablo Garay <pagaray@nvidia.com> * workflow dispatch Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update Signed-off-by: Pablo Garay <pagaray@nvidia.com> * add logging Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com>

This reverts commit fdb911f. Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

…IDIA-NeMo/DFM into pablo-garay/mbridge-test-init

abhinavg4 · 2025-11-16T09:37:16Z

CONTRIBUTING.md

Is this still required? Probably not right?

we really need to take re-reviewing the contribution.md file as a separate follow up task . We should make it right in follow up PR . It'd probably take a few tweaks to make it fully right. I suggest we merge this PR since it's been a lot of work already involving several engineers :)

abhinavg4

Looks good. Added a small comment

abhinavg4 · 2025-11-16T09:40:30Z

/ok to test 04d802e

chtruong814

Had some questions.

chtruong814 · 2025-11-16T12:43:01Z

CONTRIBUTING.md

Do we actually need the settings for ipc and ulimit?

Seems not. Removed

chtruong814 · 2025-11-16T12:44:43Z

docker/Dockerfile.ci

Why is this going back to copying all of the MBridge source code?

I think this came as a conflict in PR resosolution. Removed

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

abhinavg4

Thanks

abhinavg4 · 2025-11-16T18:18:09Z

CONTRIBUTING.md

+docker run --gpus all -v $(pwd):/opt/DFM -it dfm:latest bash
+```
+
+### Inside the container


Remove this section please

* Explicit mcore path override to use Megatron-Bridge's pinned submodule commit Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update Megatron-Bridge submodule to latest main with correct Megatron-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Add Mcore WAN pretrain mock test to CI/CD Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Fix slow Docker build from Megatron-LM source Signed-off-by: Pablo Garay <pagaray@nvidia.com> * ci: Update gpu runners to use self-hosted-nemo (#48) * ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert GHA changes Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> * Revert "Revert GHA changes" This reverts commit 1aec54a4d19588a3038da3d922a33779d4c034d2. * tempfortest: timeout setting Signed-off-by: Pablo Garay <pagaray@nvidia.com> * workflow dispatch Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update Signed-off-by: Pablo Garay <pagaray@nvidia.com> * add logging Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Reapply "Revert GHA changes" This reverts commit fdb911f. Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update path per request Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update CONTRIBUTING.md Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * adjustments Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> --------- Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>

pablo-garay requested a review from a team as a code owner November 14, 2025 22:26

copy-pr-bot bot temporarily deployed to test November 14, 2025 22:33 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 14, 2025 22:33 Error

copy-pr-bot bot temporarily deployed to test November 15, 2025 01:22 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 15, 2025 01:22 Failure

copy-pr-bot bot had a problem deploying to nemo-ci November 15, 2025 07:59 Failure

abhinavg4 reviewed Nov 15, 2025

View reviewed changes

abhinavg4 mentioned this pull request Nov 16, 2025

Chtruong/runner update pagaray test #50

Closed

abhinavg4 requested changes Nov 16, 2025

View reviewed changes

pablo-garay force-pushed the pablo-garay/mbridge-test-init branch from cffa69c to dfd8a00 Compare November 16, 2025 05:44

pablo-garay and others added 9 commits November 15, 2025 22:12

Explicit mcore path override to use Megatron-Bridge's pinned submodul…

926a951

…e commit Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Update Megatron-Bridge submodule to latest main with correct Megatron…

ecdef9e

…-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Add Mcore WAN pretrain mock test to CI/CD

5038518

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

lintfix

c746d18

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Fix slow Docker build from Megatron-LM source

697201d

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Reapply "Revert GHA changes"

f240ccd

This reverts commit fdb911f. Signed-off-by: Pablo Garay <pagaray@nvidia.com>

update path per request

0964c62

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

lintfix

d08b5af

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

pablo-garay force-pushed the pablo-garay/mbridge-test-init branch from f45d3c9 to d08b5af Compare November 16, 2025 06:13

copy-pr-bot bot temporarily deployed to test November 16, 2025 06:13 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 16, 2025 06:14 Error

update CONTRIBUTING.md

eebe731

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 16, 2025 06:34 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 16, 2025 06:34 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 16, 2025 07:10 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 16, 2025 07:43 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 16, 2025 07:58 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 16, 2025 07:58 Error

pablo-garay added 2 commits November 16, 2025 00:01

lintfix

6685a54

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Merge branch 'pablo-garay/mbridge-test-init' of https://github.com/NV…

04d802e

…IDIA-NeMo/DFM into pablo-garay/mbridge-test-init

copy-pr-bot bot temporarily deployed to test November 16, 2025 08:03 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 16, 2025 08:03 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 16, 2025 08:05 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 16, 2025 08:21 Inactive

abhinavg4 reviewed Nov 16, 2025

View reviewed changes

abhinavg4 previously approved these changes Nov 16, 2025

View reviewed changes

chtruong814 reviewed Nov 16, 2025

View reviewed changes

adjustments

61aa73c

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

pablo-garay dismissed abhinavg4’s stale review via 61aa73c November 16, 2025 18:06

copy-pr-bot bot temporarily deployed to test November 16, 2025 18:06 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 16, 2025 18:06 Error

lintfix

580a7ca

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 16, 2025 18:09 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 16, 2025 18:09 Inactive

abhinavg4 approved these changes Nov 16, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci November 16, 2025 18:45 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 16, 2025 19:01 Inactive

pablo-garay merged commit 56fdad7 into main Nov 17, 2025
16 checks passed

chtruong814 deleted the pablo-garay/mbridge-test-init branch January 29, 2026 20:26

		@@ -0,0 +1,107 @@
		# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

		@@ -1,4 +1,32 @@
		# Contributing To NeMo DFM
		## 🛠️ Setting Up Your Environment

Conversation

pablo-garay commented Nov 14, 2025

Uh oh!

copy-pr-bot bot commented Nov 14, 2025

Uh oh!

pablo-garay commented Nov 14, 2025

Uh oh!

pablo-garay commented Nov 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pablo-garay Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

abhinavg4 commented Nov 16, 2025

Uh oh!

chtruong814 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

pablo-garay Nov 16, 2025 •

edited

Loading