ci: Update gpu runners to use self-hosted-nemo by chtruong814 · Pull Request #48 · NVIDIA-NeMo/DFM

chtruong814 · 2025-11-15T15:22:48Z

ci: Update gpu runners to use self-hosted-nemo

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

…meout in test_mcore_wan_pretrain

abhinavg4

I need to revert my changes

abhinavg4 · 2025-11-15T17:49:17Z

tests/functional_tests/test_mcore_wan_pretrain.py


        # Build the command for the mock run
        cmd = [
+            "uv",


Small request can you add this to L2_Function_Tests_GPU_Wan_Mock_Data.sh please? That way we are using uv at one place only and it's not confusing, I verified that it works too.

abhinavg4

Need to revert My changes before merging

chtruong814 · 2025-11-15T18:16:54Z

tests/functional_tests/test_mcore_wan_pretrain.py

                capture_output=True,
                text=True,
-                timeout=300,  # 5 minute timeout
+                timeout=3000,  # 5 minute timeout


@abhinavg4 why did we need to change this?

…meout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

…nto chtruong/runner-update

- Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process.

abhinavg4 · 2025-11-16T02:52:48Z

/ok to tests a209623

abhinavg4 · 2025-11-16T02:53:37Z

\ok to test a209623

abhinavg4 · 2025-11-16T02:55:37Z

/ok to test f2a61c1

abhinavg4

Looks good except the commented code whcih should be uncommented.

* ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert GHA changes Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> * Revert "Revert GHA changes" This reverts commit d7ad1ab. * tempfortest: timeout setting Signed-off-by: Pablo Garay <pagaray@nvidia.com> * workflow dispatch Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update Signed-off-by: Pablo Garay <pagaray@nvidia.com> * add logging Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Explicit mcore path override to use Megatron-Bridge's pinned submodule commit Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update Megatron-Bridge submodule to latest main with correct Megatron-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Add Mcore WAN pretrain mock test to CI/CD Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Fix slow Docker build from Megatron-LM source Signed-off-by: Pablo Garay <pagaray@nvidia.com> * ci: Update gpu runners to use self-hosted-nemo (#48) * ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert GHA changes Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> * Revert "Revert GHA changes" This reverts commit d7ad1ab. * tempfortest: timeout setting Signed-off-by: Pablo Garay <pagaray@nvidia.com> * workflow dispatch Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update Signed-off-by: Pablo Garay <pagaray@nvidia.com> * add logging Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Reapply "Revert GHA changes" This reverts commit fdb911f. Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update path per request Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update CONTRIBUTING.md Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * adjustments Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> --------- Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>

* adding tests * ruff lint * ruff lint * ruff lint * Explicit mcore path override to use Megatron-Bridge's pinned submodule commit Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update Megatron-Bridge submodule to latest main with correct Megatron-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Add Mcore WAN pretrain mock test to CI/CD Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Fix slow Docker build from Megatron-LM source Signed-off-by: Pablo Garay <pagaray@nvidia.com> * ci: Update gpu runners to use self-hosted-nemo (#48) * ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert GHA changes Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> * Revert "Revert GHA changes" This reverts commit d7ad1ab. * tempfortest: timeout setting Signed-off-by: Pablo Garay <pagaray@nvidia.com> * workflow dispatch Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update Signed-off-by: Pablo Garay <pagaray@nvidia.com> * add logging Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Reapply "Revert GHA changes" This reverts commit fdb911f. Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update path per request Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update CONTRIBUTING.md Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * adding v run --group megatron-bridge * update test * ruff lint * restore Dockerfile.ci * update .github/workflows/cicd-main.yml --------- Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>

* adding tests * ruff lint * ruff lint * ruff lint * Explicit mcore path override to use Megatron-Bridge's pinned submodule commit Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update Megatron-Bridge submodule to latest main with correct Megatron-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Add Mcore WAN pretrain mock test to CI/CD Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Fix slow Docker build from Megatron-LM source Signed-off-by: Pablo Garay <pagaray@nvidia.com> * ci: Update gpu runners to use self-hosted-nemo (#48) * ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert GHA changes Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> * Revert "Revert GHA changes" This reverts commit d7ad1ab. * tempfortest: timeout setting Signed-off-by: Pablo Garay <pagaray@nvidia.com> * workflow dispatch Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update Signed-off-by: Pablo Garay <pagaray@nvidia.com> * add logging Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Reapply "Revert GHA changes" This reverts commit fdb911f. Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update path per request Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update CONTRIBUTING.md Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * adding v run --group megatron-bridge * update test * ruff lint * restore Dockerfile.ci * update .github/workflows/cicd-main.yml --------- Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Signed-off-by: Lawrence Lane <llane@nvidia.com>

* Explicit mcore path override to use Megatron-Bridge's pinned submodule commit Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update Megatron-Bridge submodule to latest main with correct Megatron-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Add Mcore WAN pretrain mock test to CI/CD Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Fix slow Docker build from Megatron-LM source Signed-off-by: Pablo Garay <pagaray@nvidia.com> * ci: Update gpu runners to use self-hosted-nemo (#48) * ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert GHA changes Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> * Revert "Revert GHA changes" This reverts commit 1aec54a4d19588a3038da3d922a33779d4c034d2. * tempfortest: timeout setting Signed-off-by: Pablo Garay <pagaray@nvidia.com> * workflow dispatch Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update Signed-off-by: Pablo Garay <pagaray@nvidia.com> * add logging Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Reapply "Revert GHA changes" This reverts commit fdb911f. Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update path per request Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update CONTRIBUTING.md Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * adjustments Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> --------- Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>

* adding tests * ruff lint * ruff lint * ruff lint * Explicit mcore path override to use Megatron-Bridge's pinned submodule commit Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update Megatron-Bridge submodule to latest main with correct Megatron-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Add Mcore WAN pretrain mock test to CI/CD Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Fix slow Docker build from Megatron-LM source Signed-off-by: Pablo Garay <pagaray@nvidia.com> * ci: Update gpu runners to use self-hosted-nemo (#48) * ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert GHA changes Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> * Revert "Revert GHA changes" This reverts commit 1aec54a4d19588a3038da3d922a33779d4c034d2. * tempfortest: timeout setting Signed-off-by: Pablo Garay <pagaray@nvidia.com> * workflow dispatch Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update Signed-off-by: Pablo Garay <pagaray@nvidia.com> * add logging Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Reapply "Revert GHA changes" This reverts commit 403efe34db36040b5ac4011f218b63ee723730af. Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update path per request Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update CONTRIBUTING.md Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * adding v run --group megatron-bridge * update test * ruff lint * restore Dockerfile.ci * update .github/workflows/cicd-main.yml --------- Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>

ci: Update gpu runners to use self-hosted-nemo

50f058d

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 requested a review from a team as a code owner November 15, 2025 15:22

copy-pr-bot bot temporarily deployed to test November 15, 2025 15:22 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 15, 2025 15:23 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 15, 2025 15:58 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 15, 2025 16:15 Failure

copy-pr-bot bot temporarily deployed to nemo-ci November 15, 2025 16:15 Inactive

Use uv run in test_mcore_wan_pretrain

4bbb20c

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 15, 2025 16:45 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 15, 2025 16:46 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 15, 2025 16:47 Failure

Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain

b412e4e

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 15, 2025 17:24 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 15, 2025 17:25 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 15, 2025 17:27 Failure

Update TRANSFORMERS_OFFLINE environment variable to 0 and increase ti…

f25bae6

…meout in test_mcore_wan_pretrain

copy-pr-bot bot temporarily deployed to test November 15, 2025 17:49 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 15, 2025 17:49 Inactive

abhinavg4 approved these changes Nov 15, 2025

View reviewed changes

abhinavg4 requested changes Nov 15, 2025

View reviewed changes

copy-pr-bot bot had a problem deploying to nemo-ci November 15, 2025 17:51 Failure

copy-pr-bot bot temporarily deployed to test November 15, 2025 18:15 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 15, 2025 18:16 Error

chtruong814 commented Nov 15, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to test November 15, 2025 18:18 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 15, 2025 18:18 Inactive

Update TRANSFORMERS_OFFLINE environment variable to 0 and increase ti…

39df472

…meout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com>

copy-pr-bot bot had a problem deploying to nemo-ci November 15, 2025 18:20 Error

update

b0f4058

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

pablo-garay temporarily deployed to test November 15, 2025 22:22 — with GitHub Actions Inactive

pablo-garay temporarily deployed to nemo-ci November 15, 2025 22:23 — with GitHub Actions Inactive

pablo-garay had a problem deploying to nemo-ci November 15, 2025 22:54 — with GitHub Actions Failure

pablo-garay had a problem deploying to nemo-ci November 15, 2025 23:13 — with GitHub Actions Error

add logging

08a2f44

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

pablo-garay temporarily deployed to test November 16, 2025 00:41 — with GitHub Actions Inactive

pablo-garay temporarily deployed to nemo-ci November 16, 2025 00:41 — with GitHub Actions Inactive

pablo-garay had a problem deploying to nemo-ci November 16, 2025 01:34 — with GitHub Actions Error

abhinavg4 added 3 commits November 16, 2025 02:11

Merge branch 'chtruong/runner-update' of github.com:NVIDIA-NeMo/DFM i…

39336a7

…nto chtruong/runner-update

Update test configuration for Mcore WAN pretraining

a5d4e44

- Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process.

More changes

a209623

Lint

f2a61c1

abhinavg4 requested review from a team and removed request for a team November 16, 2025 02:54

copy-pr-bot bot temporarily deployed to test November 16, 2025 02:55 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 16, 2025 02:55 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 16, 2025 03:08 Inactive

abhinavg4 approved these changes Nov 16, 2025

View reviewed changes

abhinavg4 merged commit 1cb4679 into pablo-garay/mbridge-test-init Nov 16, 2025
10 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Update gpu runners to use self-hosted-nemo#48

ci: Update gpu runners to use self-hosted-nemo#48
abhinavg4 merged 18 commits intopablo-garay/mbridge-test-initfrom
chtruong/runner-update

chtruong814 commented Nov 15, 2025

Uh oh!

abhinavg4 left a comment

Uh oh!

abhinavg4 Nov 15, 2025

Uh oh!

abhinavg4 left a comment

Uh oh!

chtruong814 Nov 15, 2025

Uh oh!

abhinavg4 commented Nov 16, 2025

Uh oh!

abhinavg4 commented Nov 16, 2025

Uh oh!

abhinavg4 commented Nov 16, 2025

Uh oh!

abhinavg4 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

chtruong814 commented Nov 15, 2025

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

abhinavg4 Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

chtruong814 Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

abhinavg4 commented Nov 16, 2025

Uh oh!

abhinavg4 commented Nov 16, 2025

Uh oh!

abhinavg4 commented Nov 16, 2025

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments