Conversation
|
/ok to test e387e66 |
|
/ok to test 175b42d |
| @@ -0,0 +1,107 @@ | |||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Can we please move this to tests/functional_tests/mcore/recipes/test_wan_pretrain.py
abhinavg4
left a comment
There was a problem hiding this comment.
Please see the comments
| @@ -1,4 +1,32 @@ | |||
| # Contributing To NeMo DFM | |||
| ## 🛠️ Setting Up Your Environment | |||
There was a problem hiding this comment.
Make this correct. This is updated now.
There was a problem hiding this comment.
I've updated. PTAL & lmk
| @@ -0,0 +1,107 @@ | |||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |||
cffa69c to
dfd8a00
Compare
…e commit Signed-off-by: Pablo Garay <pagaray@nvidia.com>
…-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
* ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert GHA changes Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> * Revert "Revert GHA changes" This reverts commit d7ad1ab. * tempfortest: timeout setting Signed-off-by: Pablo Garay <pagaray@nvidia.com> * workflow dispatch Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update Signed-off-by: Pablo Garay <pagaray@nvidia.com> * add logging Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com>
This reverts commit fdb911f. Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
f45d3c9 to
d08b5af
Compare
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
…IDIA-NeMo/DFM into pablo-garay/mbridge-test-init
CONTRIBUTING.md
Outdated
There was a problem hiding this comment.
Is this still required? Probably not right?
There was a problem hiding this comment.
we really need to take re-reviewing the contribution.md file as a separate follow up task . We should make it right in follow up PR . It'd probably take a few tweaks to make it fully right. I suggest we merge this PR since it's been a lot of work already involving several engineers :)
abhinavg4
left a comment
There was a problem hiding this comment.
Looks good. Added a small comment
|
/ok to test 04d802e |
chtruong814
left a comment
There was a problem hiding this comment.
Had some questions.
CONTRIBUTING.md
Outdated
There was a problem hiding this comment.
Do we actually need the settings for ipc and ulimit?
There was a problem hiding this comment.
Seems not. Removed
docker/Dockerfile.ci
Outdated
There was a problem hiding this comment.
Why is this going back to copying all of the MBridge source code?
There was a problem hiding this comment.
I think this came as a conflict in PR resosolution. Removed
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
| docker run --gpus all -v $(pwd):/opt/DFM -it dfm:latest bash | ||
| ``` | ||
|
|
||
| ### Inside the container |
There was a problem hiding this comment.
Remove this section please
* Explicit mcore path override to use Megatron-Bridge's pinned submodule commit Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update Megatron-Bridge submodule to latest main with correct Megatron-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Add Mcore WAN pretrain mock test to CI/CD Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Fix slow Docker build from Megatron-LM source Signed-off-by: Pablo Garay <pagaray@nvidia.com> * ci: Update gpu runners to use self-hosted-nemo (#48) * ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert GHA changes Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> * Revert "Revert GHA changes" This reverts commit 1aec54a4d19588a3038da3d922a33779d4c034d2. * tempfortest: timeout setting Signed-off-by: Pablo Garay <pagaray@nvidia.com> * workflow dispatch Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update Signed-off-by: Pablo Garay <pagaray@nvidia.com> * add logging Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Reapply "Revert GHA changes" This reverts commit fdb911f. Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update path per request Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update CONTRIBUTING.md Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * adjustments Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> --------- Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>
No description provided.