Skip to content

feat: mbridge test init#47

Merged
pablo-garay merged 15 commits intomainfrom
pablo-garay/mbridge-test-init
Nov 17, 2025
Merged

feat: mbridge test init#47
pablo-garay merged 15 commits intomainfrom
pablo-garay/mbridge-test-init

Conversation

@pablo-garay
Copy link
Contributor

No description provided.

@pablo-garay pablo-garay requested a review from a team as a code owner November 14, 2025 22:26
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 14, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@pablo-garay
Copy link
Contributor Author

/ok to test e387e66

@pablo-garay
Copy link
Contributor Author

/ok to test 175b42d

@@ -0,0 +1,107 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please move this to tests/functional_tests/mcore/recipes/test_wan_pretrain.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comments

@@ -1,4 +1,32 @@
# Contributing To NeMo DFM
## 🛠️ Setting Up Your Environment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this correct. This is updated now.

Copy link
Contributor Author

@pablo-garay pablo-garay Nov 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated. PTAL & lmk

@@ -0,0 +1,107 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This

@pablo-garay pablo-garay force-pushed the pablo-garay/mbridge-test-init branch from cffa69c to dfd8a00 Compare November 16, 2025 05:44
pablo-garay and others added 9 commits November 15, 2025 22:12
…e commit

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
…-LM commit (3cbe5c68)

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
* ci: Update gpu runners to use self-hosted-nemo

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Use uv run in test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain

* Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert GHA changes

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Move uv run group call to L2_Mcore_Mock_Tests_GPU

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Set test back to 5 minute timeout

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Megatron fixes (#49)

* Enhance DiT and Wan layer specifications

- Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`.
- Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling.

* Implement ProcessGroupCollection initialization in DiT and Wan models

- Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups.
- This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models.

* Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM.

* Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability.

* Refactor code style in DiT and Wan models

- Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes.
- Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines.

* Revert M4 changes

* Ruff

* Ruff

* Lint

---------

Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>

* Revert "Revert GHA changes"

This reverts commit d7ad1ab.

* tempfortest: timeout setting

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* workflow dispatch

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* update

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* add logging

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Update test configuration for Mcore WAN pretraining

- Increased the number of processes per node from 1 to 2 for distributed training.
- Set the number of training iterations to 10 to enhance the training process.

* More changes

* Lint

---------

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>
Co-authored-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
This reverts commit fdb911f.

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
CONTRIBUTING.md Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still required? Probably not right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we really need to take re-reviewing the contribution.md file as a separate follow up task . We should make it right in follow up PR . It'd probably take a few tweaks to make it fully right. I suggest we merge this PR since it's been a lot of work already involving several engineers :)

abhinavg4
abhinavg4 previously approved these changes Nov 16, 2025
Copy link
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Added a small comment

@abhinavg4
Copy link
Contributor

/ok to test 04d802e

Copy link
Collaborator

@chtruong814 chtruong814 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some questions.

CONTRIBUTING.md Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need the settings for ipc and ulimit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not. Removed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this going back to copying all of the MBridge source code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this came as a conflict in PR resosolution. Removed

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Copy link
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

docker run --gpus all -v $(pwd):/opt/DFM -it dfm:latest bash
```

### Inside the container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this section please

@pablo-garay pablo-garay merged commit 56fdad7 into main Nov 17, 2025
16 checks passed
@chtruong814 chtruong814 deleted the pablo-garay/mbridge-test-init branch January 29, 2026 20:26
huvunvidia pushed a commit that referenced this pull request Feb 12, 2026
* Explicit mcore path override to use Megatron-Bridge's pinned submodule commit

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Update Megatron-Bridge submodule to latest main with correct Megatron-LM commit (3cbe5c68)

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Add Mcore WAN pretrain mock test to CI/CD

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* lintfix

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Fix slow Docker build from Megatron-LM source

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* ci: Update gpu runners to use self-hosted-nemo (#48)

* ci: Update gpu runners to use self-hosted-nemo

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Use uv run in test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain

* Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert GHA changes

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Move uv run group call to L2_Mcore_Mock_Tests_GPU

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Set test back to 5 minute timeout

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Megatron fixes (#49)

* Enhance DiT and Wan layer specifications

- Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`.
- Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling.

* Implement ProcessGroupCollection initialization in DiT and Wan models

- Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups.
- This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models.

* Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM.

* Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability.

* Refactor code style in DiT and Wan models

- Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes.
- Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines.

* Revert M4 changes

* Ruff

* Ruff

* Lint

---------

Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>

* Revert "Revert GHA changes"

This reverts commit 1aec54a4d19588a3038da3d922a33779d4c034d2.

* tempfortest: timeout setting

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* workflow dispatch

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* update

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* add logging

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Update test configuration for Mcore WAN pretraining

- Increased the number of processes per node from 1 to 2 for distributed training.
- Set the number of training iterations to 10 to enhance the training process.

* More changes

* Lint

---------

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>
Co-authored-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Reapply "Revert GHA changes"

This reverts commit fdb911f.

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* update path per request

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* lintfix

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* update CONTRIBUTING.md

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* lintfix

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* adjustments

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* lintfix

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

---------

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments