Wan's unit tests by huvunvidia · Pull Request #43 · NVIDIA-NeMo/DFM

huvunvidia · 2025-11-14T16:01:05Z

No description provided.

copy-pr-bot · 2025-11-14T16:01:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…ests

abhinavg4

Let's wait for @pablo-garay to fix the CI container before merging this.

linnanwang · 2025-11-14T16:52:02Z

@huvunvidia would it be possible for you to merge all unit tests into a single file? Right now we have 15 files for wan unit tests. Imaging in future we will have 100 models supported, this will blow up this repo.

huvunvidia · 2025-11-14T16:54:33Z

@huvunvidia would it be possible for you to merge all unit tests into a single file? Right now we have 15 files for wan unit tests. Imaging in future we will have 100 models supported, this will blow up this repo.

Hi @linnanwang , I believe this is the usual practice for unit and functional tests. For example: https://github.com/NVIDIA-NeMo/NeMo/tree/main/tests. We want to have separate files for separate features/components for easy read and tests.

huvunvidia · 2025-11-14T16:55:09Z

/ok to test 550f283

…e commit Signed-off-by: Pablo Garay <pagaray@nvidia.com>

…-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert GHA changes Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> * Revert "Revert GHA changes" This reverts commit d7ad1ab. * tempfortest: timeout setting Signed-off-by: Pablo Garay <pagaray@nvidia.com> * workflow dispatch Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update Signed-off-by: Pablo Garay <pagaray@nvidia.com> * add logging Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com>

This reverts commit fdb911f. Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

…IDIA-NeMo/DFM into pablo-garay/mbridge-test-init

…ests

huvunvidia · 2025-11-16T15:58:00Z

/ok to test 1b8c2d1

huvunvidia · 2025-11-17T14:41:54Z

/ok to test 166e809

huvunvidia · 2025-11-17T16:22:47Z

/ok to test c1bde61

abhinavg4

Looks good. Thanks

huvunvidia · 2025-11-18T18:16:29Z

/ok to test 2de3124

* adding tests * ruff lint * ruff lint * ruff lint * Explicit mcore path override to use Megatron-Bridge's pinned submodule commit Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update Megatron-Bridge submodule to latest main with correct Megatron-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Add Mcore WAN pretrain mock test to CI/CD Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Fix slow Docker build from Megatron-LM source Signed-off-by: Pablo Garay <pagaray@nvidia.com> * ci: Update gpu runners to use self-hosted-nemo (#48) * ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert GHA changes Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> * Revert "Revert GHA changes" This reverts commit d7ad1ab. * tempfortest: timeout setting Signed-off-by: Pablo Garay <pagaray@nvidia.com> * workflow dispatch Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update Signed-off-by: Pablo Garay <pagaray@nvidia.com> * add logging Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Reapply "Revert GHA changes" This reverts commit fdb911f. Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update path per request Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update CONTRIBUTING.md Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * adding v run --group megatron-bridge * update test * ruff lint * restore Dockerfile.ci * update .github/workflows/cicd-main.yml --------- Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Signed-off-by: Lawrence Lane <llane@nvidia.com>

* adding tests * ruff lint * ruff lint * ruff lint * Explicit mcore path override to use Megatron-Bridge's pinned submodule commit Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update Megatron-Bridge submodule to latest main with correct Megatron-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Add Mcore WAN pretrain mock test to CI/CD Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Fix slow Docker build from Megatron-LM source Signed-off-by: Pablo Garay <pagaray@nvidia.com> * ci: Update gpu runners to use self-hosted-nemo (#48) * ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert GHA changes Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> * Revert "Revert GHA changes" This reverts commit 1aec54a4d19588a3038da3d922a33779d4c034d2. * tempfortest: timeout setting Signed-off-by: Pablo Garay <pagaray@nvidia.com> * workflow dispatch Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update Signed-off-by: Pablo Garay <pagaray@nvidia.com> * add logging Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Reapply "Revert GHA changes" This reverts commit 403efe34db36040b5ac4011f218b63ee723730af. Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update path per request Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update CONTRIBUTING.md Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * adding v run --group megatron-bridge * update test * ruff lint * restore Dockerfile.ci * update .github/workflows/cicd-main.yml --------- Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>

adding tests

dae57f9

Merge remote-tracking branch 'origin/main' into huvu/mcore_wan_unit_t…

7f2ed10

…ests

huvunvidia requested a review from abhinavg4 November 14, 2025 16:04

abhinavg4 requested changes Nov 14, 2025

View reviewed changes

Huy Vu2 added 3 commits November 14, 2025 08:22

ruff lint

93f94bc

ruff lint

244ee74

ruff lint

550f283

copy-pr-bot bot temporarily deployed to test November 14, 2025 16:55 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 14, 2025 16:55 Error

copy-pr-bot bot had a problem deploying to nemo-ci November 15, 2025 14:34 Error

pablo-garay and others added 14 commits November 15, 2025 22:12

Explicit mcore path override to use Megatron-Bridge's pinned submodul…

926a951

…e commit Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Update Megatron-Bridge submodule to latest main with correct Megatron…

ecdef9e

…-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Add Mcore WAN pretrain mock test to CI/CD

5038518

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

lintfix

c746d18

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Fix slow Docker build from Megatron-LM source

697201d

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Reapply "Revert GHA changes"

f240ccd

This reverts commit fdb911f. Signed-off-by: Pablo Garay <pagaray@nvidia.com>

update path per request

0964c62

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

lintfix

d08b5af

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

update CONTRIBUTING.md

eebe731

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Merge branch 'main' into pablo-garay/mbridge-test-init

3dadf02

lintfix

6685a54

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Merge branch 'pablo-garay/mbridge-test-init' of https://github.com/NV…

04d802e

…IDIA-NeMo/DFM into pablo-garay/mbridge-test-init

Merge remote-tracking branch 'origin/main' into huvu/mcore_wan_unit_t…

1b8c2d1

…ests

copy-pr-bot bot temporarily deployed to test November 16, 2025 15:58 Inactive

restore Dockerfile.ci

166e809

copy-pr-bot bot temporarily deployed to test November 17, 2025 14:42 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 17, 2025 14:42 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 17, 2025 15:20 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 17, 2025 15:39 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 17, 2025 15:39 Failure

update .github/workflows/cicd-main.yml

c1bde61

copy-pr-bot bot temporarily deployed to test November 17, 2025 16:22 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 17, 2025 16:23 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 17, 2025 16:35 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 17, 2025 16:53 Inactive

abhinavg4 approved these changes Nov 18, 2025

View reviewed changes

pablo-garay approved these changes Nov 18, 2025

View reviewed changes

Merge branch 'main' into huvu/mcore_wan_unit_tests

2de3124

copy-pr-bot bot temporarily deployed to test November 18, 2025 18:16 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 18, 2025 18:16 Inactive

huvunvidia enabled auto-merge (squash) November 18, 2025 18:20

copy-pr-bot bot temporarily deployed to nemo-ci November 18, 2025 18:58 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 18, 2025 19:17 Inactive

huvunvidia merged commit d0dbfaf into main Nov 18, 2025
15 checks passed

chtruong814 deleted the huvu/mcore_wan_unit_tests branch January 29, 2026 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wan's unit tests#43

Wan's unit tests#43
huvunvidia merged 27 commits intomainfrom
huvu/mcore_wan_unit_tests

huvunvidia commented Nov 14, 2025

Uh oh!

copy-pr-bot bot commented Nov 14, 2025

Uh oh!

abhinavg4 left a comment

Uh oh!

linnanwang commented Nov 14, 2025

Uh oh!

huvunvidia commented Nov 14, 2025 •

edited

Loading

Uh oh!

huvunvidia commented Nov 14, 2025

Uh oh!

huvunvidia commented Nov 16, 2025

Uh oh!

huvunvidia commented Nov 17, 2025

Uh oh!

huvunvidia commented Nov 17, 2025

Uh oh!

abhinavg4 left a comment

Uh oh!

huvunvidia commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

Conversation

huvunvidia commented Nov 14, 2025

Uh oh!

copy-pr-bot bot commented Nov 14, 2025

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

linnanwang commented Nov 14, 2025

Uh oh!

huvunvidia commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huvunvidia commented Nov 14, 2025

Uh oh!

huvunvidia commented Nov 16, 2025

Uh oh!

huvunvidia commented Nov 17, 2025

Uh oh!

huvunvidia commented Nov 17, 2025

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

huvunvidia commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

huvunvidia commented Nov 14, 2025 •

edited

Loading