DFM Performance Improvements by parthmannan · Pull Request #45 · NVIDIA-NeMo/DFM

parthmannan · 2025-11-14T20:30:16Z

No description provided.

- Added comments to clarify file purposes in example_commands.sh, inference_wan.py, pretrain_wan.py, wan_provider.py, wan_step.py, and wan.py. - Introduced EnergonMultiModalDataModule for handling multimodal datasets in nemo_vfm. - Created SequentialMegatronSampler for efficient sequential sampling in large datasets. - Added new files for DIT attention and base data handling. This commit enhances documentation and introduces new functionalities for better data management and processing.

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

parthmannan · 2025-11-20T22:02:24Z

/ok to test 04f6c14

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

parthmannan · 2025-11-21T00:29:35Z

/ok to test 7247bc7

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

parthmannan · 2025-11-21T01:01:11Z

/ok to test 7fedda7

abhinavg4

Looks good. Thanks a ton for your help.

* first commit * workable code * workable thd * clean up, remove all CP for sbhd, CP now is only for thd * run outside of Mbridge * Update example scripts and add new data module for multimodal datasets - Added comments to clarify file purposes in example_commands.sh, inference_wan.py, pretrain_wan.py, wan_provider.py, wan_step.py, and wan.py. - Introduced EnergonMultiModalDataModule for handling multimodal datasets in nemo_vfm. - Created SequentialMegatronSampler for efficient sequential sampling in large datasets. - Added new files for DIT attention and base data handling. This commit enhances documentation and introduces new functionalities for better data management and processing. * workable code before refactoring * refactor attention submodules + reorder files locations * update refactor * update refactor * reorganize files * reorganize files * refactoring code * add README for perf test * using vae, t5, scheduler from Diffusers * update repo, remove Wan's Github moduels * fix Ruff * fix ruff + copyright * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * merged main + address comments * remove example_commands.md, Google waits until mid Nov * refactor inference_configs + mockdatamodule * add dit_embeddings.py * fix lint ruff * add 'average_gradients_across_tp_domain' to torch.nn for when running sequence_parallelism * add english negative prompt * fix ruff lint * Update uv.lock for deps: diffusers==0.35.1, easydict, imageio * update dfm/src/megatron/data/dit * change english negative prompt * seem to workable seq_packing * refactor with Sajad's PR - DiT data to common dir * fix Ruff, lint * fix Ruff, lint * fix Ruff, lint * workable mock datamodule (doesn't need setting path); updated training algo + hyper-parameters aligning with Linnan; tested training with anime dataset finetung * bring wan_task encoders features to common, sharing with dit * lint, ruff * lint, ruff * lint, ruff * fix CP error (input of thd_split_inputs_cp to be cu_seqlens_q_padded instead of cu_seqlens_q) * udpate README_perf_test.md * fix lint, ruff * update uv.lock, merge main * uv.lock * uv.lock * uv.lock * update uv.lock [using ci] * Performance improvements to Wan * Perf optimizations * Tiny fix * Remove CP disable as packed sequences not supported * Fix comment * Minor fixes. Revert video_latent comparison * Fix missed check * Lint fix * H100 mock pretraining perf config * Rename config file * Lint check Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Adding GB200 perf config Signed-off-by: Parth Mannan <pmannan@nvidia.com> * GB300 perf config Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Refactor Energon data module to return wrapped dataloaders and add EnergonDataloader class for cyclic iteration. Introduce WAN pretrain mock data configuration for testing. * Enhance DiffusionTaskEncoder to handle None attributes in stacking and concatenation methods. Add WAN pretrain mock data configuration for testing purposes. * Refactor data processing in dit_data_step to simplify batch retrieval and update WAN pretrain configuration to include train_iters. * Add op fusions Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Update H100 config Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix lint Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Resolve conflict Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix for mock dataloader test Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix Dummyiter Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix test Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Make RoPE test only GPU Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Rope cuda fix Signed-off-by: Parth Mannan <pmannan@nvidia.com> --------- Signed-off-by: Parth Mannan <pmannan@nvidia.com> Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: root <root@eos0025.eos.clusters.nvidia.com> Co-authored-by: root <root@eos0558.eos.clusters.nvidia.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

* first commit * workable code * workable thd * clean up, remove all CP for sbhd, CP now is only for thd * run outside of Mbridge * Update example scripts and add new data module for multimodal datasets - Added comments to clarify file purposes in example_commands.sh, inference_wan.py, pretrain_wan.py, wan_provider.py, wan_step.py, and wan.py. - Introduced EnergonMultiModalDataModule for handling multimodal datasets in nemo_vfm. - Created SequentialMegatronSampler for efficient sequential sampling in large datasets. - Added new files for DIT attention and base data handling. This commit enhances documentation and introduces new functionalities for better data management and processing. * workable code before refactoring * refactor attention submodules + reorder files locations * update refactor * update refactor * reorganize files * reorganize files * refactoring code * add README for perf test * using vae, t5, scheduler from Diffusers * update repo, remove Wan's Github moduels * fix Ruff * fix ruff + copyright * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * merged main + address comments * remove example_commands.md, Google waits until mid Nov * refactor inference_configs + mockdatamodule * add dit_embeddings.py * fix lint ruff * add 'average_gradients_across_tp_domain' to torch.nn for when running sequence_parallelism * add english negative prompt * fix ruff lint * Update uv.lock for deps: diffusers==0.35.1, easydict, imageio * update dfm/src/megatron/data/dit * change english negative prompt * seem to workable seq_packing * refactor with Sajad's PR - DiT data to common dir * fix Ruff, lint * fix Ruff, lint * fix Ruff, lint * workable mock datamodule (doesn't need setting path); updated training algo + hyper-parameters aligning with Linnan; tested training with anime dataset finetung * bring wan_task encoders features to common, sharing with dit * lint, ruff * lint, ruff * lint, ruff * fix CP error (input of thd_split_inputs_cp to be cu_seqlens_q_padded instead of cu_seqlens_q) * udpate README_perf_test.md * fix lint, ruff * update uv.lock, merge main * uv.lock * uv.lock * uv.lock * update uv.lock [using ci] * Performance improvements to Wan * Perf optimizations * Tiny fix * Remove CP disable as packed sequences not supported * Fix comment * Minor fixes. Revert video_latent comparison * Fix missed check * Lint fix * H100 mock pretraining perf config * Rename config file * Lint check Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Adding GB200 perf config Signed-off-by: Parth Mannan <pmannan@nvidia.com> * GB300 perf config Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Refactor Energon data module to return wrapped dataloaders and add EnergonDataloader class for cyclic iteration. Introduce WAN pretrain mock data configuration for testing. * Enhance DiffusionTaskEncoder to handle None attributes in stacking and concatenation methods. Add WAN pretrain mock data configuration for testing purposes. * Refactor data processing in dit_data_step to simplify batch retrieval and update WAN pretrain configuration to include train_iters. * Add op fusions Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Update H100 config Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix lint Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Resolve conflict Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix for mock dataloader test Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix Dummyiter Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Fix test Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Make RoPE test only GPU Signed-off-by: Parth Mannan <pmannan@nvidia.com> * Rope cuda fix Signed-off-by: Parth Mannan <pmannan@nvidia.com> --------- Signed-off-by: Parth Mannan <pmannan@nvidia.com> Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: root <root@eos0025.eos.clusters.nvidia.com> Co-authored-by: root <root@eos0558.eos.clusters.nvidia.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com>

Huy Vu2 and others added 30 commits October 30, 2025 07:32

first commit

75ed5b2

workable code

2ebfd50

workable thd

7b834f0

clean up, remove all CP for sbhd, CP now is only for thd

2152abd

run outside of Mbridge

389a037

workable code before refactoring

d5d0106

Merge remote-tracking branch 'origin/huvu/mcore_wan' into huvu/mcore_wan

c4f5160

refactor attention submodules + reorder files locations

0430384

update refactor

dfff86b

update refactor

abbaa2a

reorganize files

c59f6a2

reorganize files

0b91a1c

refactoring code

aa20504

add README for perf test

d5f58c9

using vae, t5, scheduler from Diffusers

9b8e4fb

update repo, remove Wan's Github moduels

7f414ae

Merge remote-tracking branch 'origin/main' into huvu/mcore_wan

62a518f

fix Ruff

2de5934

fix ruff + copyright

6b46a7f

fix Ruff + Lint

c1d8923

fix Ruff + Lint

e8de1ae

fix Ruff + Lint

287ad34

fix Ruff + Lint

4464fd2

fix Ruff + Lint

547339a

fix Ruff + Lint

9cd082b

fix Ruff + Lint

4514eee

fix Ruff + Lint

acd430d

Merge remote-tracking branch 'origin/main' into huvu/mcore_wan

19c0c29

merged main + address comments

a147258

copy-pr-bot bot temporarily deployed to nemo-ci November 20, 2025 20:29 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 20, 2025 20:38 Failure

Fix test

04f6c14

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 20, 2025 22:02 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 20, 2025 22:02 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 20, 2025 22:10 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 20, 2025 22:10 Failure

Make RoPE test only GPU

7247bc7

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 21, 2025 00:29 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 21, 2025 00:30 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 21, 2025 00:34 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 21, 2025 00:34 Failure

parthmannan added 2 commits November 20, 2025 16:58

Rope cuda fix

b17a40d

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

Merge branch 'main' of github.com:NVIDIA-NeMo/DFM into pmannan/dfm_perf

7fedda7

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 21, 2025 01:01 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 21, 2025 01:01 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 21, 2025 01:38 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 21, 2025 06:40 Inactive

parthmannan requested a review from huvunvidia November 21, 2025 08:23

abhinavg4 approved these changes Nov 21, 2025

View reviewed changes

abhinavg4 merged commit 2eb57c2 into main Nov 21, 2025
16 checks passed

chtruong814 deleted the pmannan/dfm_perf branch January 29, 2026 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DFM Performance Improvements#45

DFM Performance Improvements#45
abhinavg4 merged 92 commits intomainfrom
pmannan/dfm_perf

parthmannan commented Nov 14, 2025

Uh oh!

parthmannan commented Nov 20, 2025

Uh oh!

parthmannan commented Nov 21, 2025

Uh oh!

parthmannan commented Nov 21, 2025

Uh oh!

abhinavg4 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

parthmannan commented Nov 14, 2025

Uh oh!

parthmannan commented Nov 20, 2025

Uh oh!

parthmannan commented Nov 21, 2025

Uh oh!

parthmannan commented Nov 21, 2025

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments