Skip to content

DFM Performance Improvements#45

Merged
abhinavg4 merged 92 commits intomainfrom
pmannan/dfm_perf
Nov 21, 2025
Merged

DFM Performance Improvements#45
abhinavg4 merged 92 commits intomainfrom
pmannan/dfm_perf

Conversation

@parthmannan
Copy link
Contributor

No description provided.

Huy Vu2 and others added 30 commits October 30, 2025 07:32
- Added comments to clarify file purposes in example_commands.sh, inference_wan.py, pretrain_wan.py, wan_provider.py, wan_step.py, and wan.py.
- Introduced EnergonMultiModalDataModule for handling multimodal datasets in nemo_vfm.
- Created SequentialMegatronSampler for efficient sequential sampling in large datasets.
- Added new files for DIT attention and base data handling.

This commit enhances documentation and introduces new functionalities for better data management and processing.
Signed-off-by: Parth Mannan <pmannan@nvidia.com>
@parthmannan
Copy link
Contributor Author

/ok to test 04f6c14

Signed-off-by: Parth Mannan <pmannan@nvidia.com>
@parthmannan
Copy link
Contributor Author

/ok to test 7247bc7

Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Parth Mannan <pmannan@nvidia.com>
@parthmannan
Copy link
Contributor Author

/ok to test 7fedda7

Copy link
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks a ton for your help.

@abhinavg4 abhinavg4 merged commit 2eb57c2 into main Nov 21, 2025
16 checks passed
lbliii pushed a commit that referenced this pull request Dec 3, 2025
* first commit

* workable code

* workable thd

* clean up, remove all CP for sbhd, CP now is only for thd

* run outside of Mbridge

* Update example scripts and add new data module for multimodal datasets

- Added comments to clarify file purposes in example_commands.sh, inference_wan.py, pretrain_wan.py, wan_provider.py, wan_step.py, and wan.py.
- Introduced EnergonMultiModalDataModule for handling multimodal datasets in nemo_vfm.
- Created SequentialMegatronSampler for efficient sequential sampling in large datasets.
- Added new files for DIT attention and base data handling.

This commit enhances documentation and introduces new functionalities for better data management and processing.

* workable code before refactoring

* refactor attention submodules + reorder files locations

* update refactor

* update refactor

* reorganize files

* reorganize files

* refactoring code

* add README for perf test

* using vae, t5, scheduler from Diffusers

* update repo, remove Wan's Github moduels

* fix Ruff

* fix ruff + copyright

* fix Ruff + Lint

* fix Ruff + Lint

* fix Ruff + Lint

* fix Ruff + Lint

* fix Ruff + Lint

* fix Ruff + Lint

* fix Ruff + Lint

* fix Ruff + Lint

* merged main + address comments

* remove example_commands.md, Google waits until mid Nov

* refactor inference_configs + mockdatamodule

* add dit_embeddings.py

* fix lint ruff

* add 'average_gradients_across_tp_domain' to torch.nn for when running sequence_parallelism

* add english negative prompt

* fix ruff lint

* Update uv.lock for deps: diffusers==0.35.1, easydict, imageio

* update dfm/src/megatron/data/dit

* change english negative prompt

* seem to workable seq_packing

* refactor with Sajad's PR - DiT data to common dir

* fix Ruff, lint

* fix Ruff, lint

* fix Ruff, lint

* workable mock datamodule (doesn't need setting path); updated training algo + hyper-parameters aligning with Linnan; tested training with anime dataset finetung

* bring wan_task encoders features to common, sharing with dit

* lint, ruff

* lint, ruff

* lint, ruff

* fix CP error (input of thd_split_inputs_cp to be cu_seqlens_q_padded instead of cu_seqlens_q)

* udpate README_perf_test.md

* fix lint, ruff

* update uv.lock, merge main

* uv.lock

* uv.lock

* uv.lock

* update uv.lock [using ci]

* Performance improvements to Wan

* Perf optimizations

* Tiny fix

* Remove CP disable as packed sequences not supported

* Fix comment

* Minor fixes. Revert video_latent comparison

* Fix missed check

* Lint fix

* H100 mock pretraining perf config

* Rename config file

* Lint check

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Adding GB200 perf config

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* GB300 perf config

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Refactor Energon data module to return wrapped dataloaders and add EnergonDataloader class for cyclic iteration. Introduce WAN pretrain mock data configuration for testing.

* Enhance DiffusionTaskEncoder to handle None attributes in stacking and concatenation methods. Add WAN pretrain mock data configuration for testing purposes.

* Refactor data processing in dit_data_step to simplify batch retrieval and update WAN pretrain configuration to include train_iters.

* Add op fusions

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Update H100 config

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Fix lint

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Resolve conflict

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Fix for mock dataloader test

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Fix Dummyiter

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Fix test

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Make RoPE test only GPU

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Rope cuda fix

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

---------

Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>
Co-authored-by: root <root@eos0025.eos.clusters.nvidia.com>
Co-authored-by: root <root@eos0558.eos.clusters.nvidia.com>
Co-authored-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
@chtruong814 chtruong814 deleted the pmannan/dfm_perf branch January 29, 2026 20:26
huvunvidia pushed a commit that referenced this pull request Feb 12, 2026
* first commit

* workable code

* workable thd

* clean up, remove all CP for sbhd, CP now is only for thd

* run outside of Mbridge

* Update example scripts and add new data module for multimodal datasets

- Added comments to clarify file purposes in example_commands.sh, inference_wan.py, pretrain_wan.py, wan_provider.py, wan_step.py, and wan.py.
- Introduced EnergonMultiModalDataModule for handling multimodal datasets in nemo_vfm.
- Created SequentialMegatronSampler for efficient sequential sampling in large datasets.
- Added new files for DIT attention and base data handling.

This commit enhances documentation and introduces new functionalities for better data management and processing.

* workable code before refactoring

* refactor attention submodules + reorder files locations

* update refactor

* update refactor

* reorganize files

* reorganize files

* refactoring code

* add README for perf test

* using vae, t5, scheduler from Diffusers

* update repo, remove Wan's Github moduels

* fix Ruff

* fix ruff + copyright

* fix Ruff + Lint

* fix Ruff + Lint

* fix Ruff + Lint

* fix Ruff + Lint

* fix Ruff + Lint

* fix Ruff + Lint

* fix Ruff + Lint

* fix Ruff + Lint

* merged main + address comments

* remove example_commands.md, Google waits until mid Nov

* refactor inference_configs + mockdatamodule

* add dit_embeddings.py

* fix lint ruff

* add 'average_gradients_across_tp_domain' to torch.nn for when running sequence_parallelism

* add english negative prompt

* fix ruff lint

* Update uv.lock for deps: diffusers==0.35.1, easydict, imageio

* update dfm/src/megatron/data/dit

* change english negative prompt

* seem to workable seq_packing

* refactor with Sajad's PR - DiT data to common dir

* fix Ruff, lint

* fix Ruff, lint

* fix Ruff, lint

* workable mock datamodule (doesn't need setting path); updated training algo + hyper-parameters aligning with Linnan; tested training with anime dataset finetung

* bring wan_task encoders features to common, sharing with dit

* lint, ruff

* lint, ruff

* lint, ruff

* fix CP error (input of thd_split_inputs_cp to be cu_seqlens_q_padded instead of cu_seqlens_q)

* udpate README_perf_test.md

* fix lint, ruff

* update uv.lock, merge main

* uv.lock

* uv.lock

* uv.lock

* update uv.lock [using ci]

* Performance improvements to Wan

* Perf optimizations

* Tiny fix

* Remove CP disable as packed sequences not supported

* Fix comment

* Minor fixes. Revert video_latent comparison

* Fix missed check

* Lint fix

* H100 mock pretraining perf config

* Rename config file

* Lint check

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Adding GB200 perf config

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* GB300 perf config

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Refactor Energon data module to return wrapped dataloaders and add EnergonDataloader class for cyclic iteration. Introduce WAN pretrain mock data configuration for testing.

* Enhance DiffusionTaskEncoder to handle None attributes in stacking and concatenation methods. Add WAN pretrain mock data configuration for testing purposes.

* Refactor data processing in dit_data_step to simplify batch retrieval and update WAN pretrain configuration to include train_iters.

* Add op fusions

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Update H100 config

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Fix lint

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Resolve conflict

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Fix for mock dataloader test

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Fix Dummyiter

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Fix test

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Make RoPE test only GPU

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

* Rope cuda fix

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

---------

Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>
Co-authored-by: root <root@eos0025.eos.clusters.nvidia.com>
Co-authored-by: root <root@eos0558.eos.clusters.nvidia.com>
Co-authored-by: Pablo Garay <pagaray@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments