Skip to content

Wan's unit tests#43

Merged
huvunvidia merged 27 commits intomainfrom
huvu/mcore_wan_unit_tests
Nov 18, 2025
Merged

Wan's unit tests#43
huvunvidia merged 27 commits intomainfrom
huvu/mcore_wan_unit_tests

Conversation

@huvunvidia
Copy link
Contributor

No description provided.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 14, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@huvunvidia huvunvidia requested a review from abhinavg4 November 14, 2025 16:04
Copy link
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for @pablo-garay to fix the CI container before merging this.

@linnanwang
Copy link
Contributor

@huvunvidia would it be possible for you to merge all unit tests into a single file? Right now we have 15 files for wan unit tests. Imaging in future we will have 100 models supported, this will blow up this repo.

@huvunvidia
Copy link
Contributor Author

huvunvidia commented Nov 14, 2025

@huvunvidia would it be possible for you to merge all unit tests into a single file? Right now we have 15 files for wan unit tests. Imaging in future we will have 100 models supported, this will blow up this repo.

Hi @linnanwang , I believe this is the usual practice for unit and functional tests. For example: https://github.com/NVIDIA-NeMo/NeMo/tree/main/tests. We want to have separate files for separate features/components for easy read and tests.

@huvunvidia
Copy link
Contributor Author

/ok to test 550f283

pablo-garay and others added 14 commits November 15, 2025 22:12
…e commit

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
…-LM commit (3cbe5c68)

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
* ci: Update gpu runners to use self-hosted-nemo

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Use uv run in test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain

* Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert GHA changes

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Move uv run group call to L2_Mcore_Mock_Tests_GPU

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Set test back to 5 minute timeout

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Megatron fixes (#49)

* Enhance DiT and Wan layer specifications

- Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`.
- Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling.

* Implement ProcessGroupCollection initialization in DiT and Wan models

- Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups.
- This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models.

* Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM.

* Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability.

* Refactor code style in DiT and Wan models

- Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes.
- Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines.

* Revert M4 changes

* Ruff

* Ruff

* Lint

---------

Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>

* Revert "Revert GHA changes"

This reverts commit d7ad1ab.

* tempfortest: timeout setting

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* workflow dispatch

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* update

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* add logging

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Update test configuration for Mcore WAN pretraining

- Increased the number of processes per node from 1 to 2 for distributed training.
- Set the number of training iterations to 10 to enhance the training process.

* More changes

* Lint

---------

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>
Co-authored-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
This reverts commit fdb911f.

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
@huvunvidia
Copy link
Contributor Author

/ok to test 1b8c2d1

@huvunvidia
Copy link
Contributor Author

/ok to test 166e809

@huvunvidia
Copy link
Contributor Author

/ok to test c1bde61

Copy link
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks

@huvunvidia
Copy link
Contributor Author

/ok to test 2de3124

@huvunvidia huvunvidia enabled auto-merge (squash) November 18, 2025 18:20
@huvunvidia huvunvidia merged commit d0dbfaf into main Nov 18, 2025
15 checks passed
lbliii pushed a commit that referenced this pull request Nov 19, 2025
* adding tests

* ruff lint

* ruff lint

* ruff lint

* Explicit mcore path override to use Megatron-Bridge's pinned submodule commit

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Update Megatron-Bridge submodule to latest main with correct Megatron-LM commit (3cbe5c68)

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Add Mcore WAN pretrain mock test to CI/CD

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* lintfix

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Fix slow Docker build from Megatron-LM source

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* ci: Update gpu runners to use self-hosted-nemo (#48)

* ci: Update gpu runners to use self-hosted-nemo

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Use uv run in test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain

* Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert GHA changes

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Move uv run group call to L2_Mcore_Mock_Tests_GPU

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Set test back to 5 minute timeout

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Megatron fixes (#49)

* Enhance DiT and Wan layer specifications

- Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`.
- Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling.

* Implement ProcessGroupCollection initialization in DiT and Wan models

- Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups.
- This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models.

* Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM.

* Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability.

* Refactor code style in DiT and Wan models

- Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes.
- Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines.

* Revert M4 changes

* Ruff

* Ruff

* Lint

---------

Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>

* Revert "Revert GHA changes"

This reverts commit d7ad1ab.

* tempfortest: timeout setting

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* workflow dispatch

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* update

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* add logging

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Update test configuration for Mcore WAN pretraining

- Increased the number of processes per node from 1 to 2 for distributed training.
- Set the number of training iterations to 10 to enhance the training process.

* More changes

* Lint

---------

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>
Co-authored-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Reapply "Revert GHA changes"

This reverts commit fdb911f.

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* update path per request

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* lintfix

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* update CONTRIBUTING.md

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* lintfix

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* adding v run --group megatron-bridge

* update test

* ruff lint

* restore Dockerfile.ci

* update  .github/workflows/cicd-main.yml

---------

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
Co-authored-by: Pablo Garay <pagaray@nvidia.com>
Co-authored-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
@chtruong814 chtruong814 deleted the huvu/mcore_wan_unit_tests branch January 29, 2026 20:19
huvunvidia added a commit that referenced this pull request Feb 12, 2026
* adding tests

* ruff lint

* ruff lint

* ruff lint

* Explicit mcore path override to use Megatron-Bridge's pinned submodule commit

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Update Megatron-Bridge submodule to latest main with correct Megatron-LM commit (3cbe5c68)

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Add Mcore WAN pretrain mock test to CI/CD

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* lintfix

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Fix slow Docker build from Megatron-LM source

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* ci: Update gpu runners to use self-hosted-nemo (#48)

* ci: Update gpu runners to use self-hosted-nemo

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Use uv run in test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain

* Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert GHA changes

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Move uv run group call to L2_Mcore_Mock_Tests_GPU

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Set test back to 5 minute timeout

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Megatron fixes (#49)

* Enhance DiT and Wan layer specifications

- Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`.
- Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling.

* Implement ProcessGroupCollection initialization in DiT and Wan models

- Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups.
- This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models.

* Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM.

* Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability.

* Refactor code style in DiT and Wan models

- Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes.
- Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines.

* Revert M4 changes

* Ruff

* Ruff

* Lint

---------

Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>

* Revert "Revert GHA changes"

This reverts commit 1aec54a4d19588a3038da3d922a33779d4c034d2.

* tempfortest: timeout setting

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* workflow dispatch

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* update

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* add logging

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Update test configuration for Mcore WAN pretraining

- Increased the number of processes per node from 1 to 2 for distributed training.
- Set the number of training iterations to 10 to enhance the training process.

* More changes

* Lint

---------

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>
Co-authored-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Reapply "Revert GHA changes"

This reverts commit 403efe34db36040b5ac4011f218b63ee723730af.

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* update path per request

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* lintfix

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* update CONTRIBUTING.md

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* lintfix

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* adding v run --group megatron-bridge

* update test

* ruff lint

* restore Dockerfile.ci

* update  .github/workflows/cicd-main.yml

---------

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
Co-authored-by: Pablo Garay <pagaray@nvidia.com>
Co-authored-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments