From 32660773907a158ec7aeccead9e79884c7e8fffe Mon Sep 17 00:00:00 2001 From: Abhinav Garg Date: Sun, 16 Nov 2025 18:34:11 +0000 Subject: [PATCH 01/28] Initial README commit --- README.md | 226 +++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 197 insertions(+), 29 deletions(-) diff --git a/README.md b/README.md index 36a555de..512d42c0 100644 --- a/README.md +++ b/README.md @@ -1,30 +1,198 @@ -# NeMo DFM: Diffusion Foundation Models collection - -NeMo DFM is a state-of-the-art framework for fast, large-scale training and inference of video world models. It unifies the latest diffusion-based and autoregressive techniques, prioritizing efficiency and performance from research prototyping to production deployment. - -## Projects - -This collection consists of 4 projects: -1. [Scalable diffusion training framework](nemo_vfm/diffusion/readme.rst) -2. [Accelerated diffusion world models](nemo_vfm/physicalai/Cosmos/cosmos1/models/diffusion/README.md) -3. [Accelerated autoregressive world models](nemo_vfm/physicalai/Cosmos/cosmos1/models/autoregressive/README.md) -4. [Sparse attention for efficient diffusion inference](nemo_vfm/sparse_attention/README.md) - -## Citations - -If you find our code useful, please consider citing the following papers: -```bibtex -@article{patel2025training, - title={Training Video Foundation Models with NVIDIA NeMo}, - author={Patel, Zeeshan and He, Ethan and Mannan, Parth and Ren, Xiaowei and Wolf, Ryan and Agarwal, Niket and Huffman, Jacob and Wang, Zhuoyao and Wang, Carl and Chang, Jack and others}, - journal={arXiv preprint arXiv:2503.12964}, - year={2025} -} - -@article{agarwal2025cosmos, - title={Cosmos world foundation model platform for physical ai}, - author={Agarwal, Niket and Ali, Arslan and Bala, Maciej and Balaji, Yogesh and Barker, Erik and Cai, Tiffany and Chattopadhyay, Prithvijit and Chen, Yongxin and Cui, Yin and Ding, Yifan and others}, - journal={arXiv preprint arXiv:2501.03575}, - year={2025} -} +
+ +# NeMo DFM: Diffusion Foundation Models + + + + + + +[![CICD NeMo](https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/workflows/cicd-main.yml/badge.svg)](https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/workflows/cicd-main.yml) +[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/) +[![GitHub Stars](https://img.shields.io/github/stars/NVIDIA-NeMo/DFM.svg?style=social&label=Star&cacheSeconds=14400)](https://github.com/NVIDIA-NeMo/DFM/stargazers/) + +**State-of-the-art framework for fast, large-scale training and inference of diffusion models** + +[Documentation](https://github.com/NVIDIA-NeMo/DFM/tree/main/docs) | [Supported Models](#supported-models) | [Examples](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples) | [Contributing](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/CONTRIBUTING.md) + +
+ +## Overview + +NeMo DFM (Diffusion Foundation Models) is a comprehensive collection of diffusion models for **Video**, **Image**, and **Text** generation. It unifies cutting-edge diffusion-based architectures and training techniques, prioritizing efficiency and performance from research prototyping to production deployment. + +**Dual-Path Architecture**: DFM provides two complementary training paths to maximize flexibility: + +- **πŸŒ‰ Megatron Bridge Path**: Built on [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) and [Megatron Core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) for maximum scalability with tensor, pipeline, and context parallelism +- **πŸš€ AutoModel Path**: Built on [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) for PyTorch DTensor-native SPMD training with seamless πŸ€— Hugging Face integration + +Choose the path that best fits your workflowβ€”or use both for different stages of development! + + +## πŸ”§ Installation + +### 🐳 Built your own Container + +#### 1. Build the container +```bash +# Initialize all submodules (Megatron-Bridge, Automodel, and nested Megatron-LM) +git submodule update --init --recursive + +# Build the container +docker build -f docker/Dockerfile.ci -t dfm:dev . +``` + +#### 2. Start the container + +```bash +docker run --rm -it --gpus all \ + --entrypoint bash \ + -v $(pwd):/opt/DFM -it dfm:dev +``` + + + +### πŸ“¦ Using DFM Docker (Coming Soon) + +## ⚑ Quickstart + +### Megatron Bridge Path + +#### Run a Receipe +You can find all predefined recipes under [recipes](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/megatron/recipes) directory. + +> **Note:** You will have to use [uv](https://docs.astral.sh/uv/) to run the recipes. Please use `--group` as `megatron-bridge`. + + + + +```bash +uv run --group megatron-bridge python -m torch.distributed.run --nproc_per_node=2 examples/megatron/recipes/wan/pretrain_wan.py model.qkv_format=thd --mock ``` + +### AutoModel Path + +Train with PyTorch-native DTensor parallelism and direct πŸ€— HF integration: + + +```bash +# TODO +# Fine-tune a video diffusion model with FSDP2 +uv run torchrun --nproc-per-node=8 \ + dfm/src/automodel/recipes/finetune.py \ + --config examples/automodel/wan21_finetune.yaml + +# Override parameters via CLI +# TODO +uv run torchrun --nproc-per-node=8 \ + dfm/src/automodel/recipes/finetune.py \ + --config examples/automodel/wan21_finetune.yaml \ + --step_scheduler.local_batch_size 4 \ + --model.pretrained_model_name_or_path "your-model-id" +``` + +## πŸš€ Key Features + +### Dual Training Paths + +- **Megatron Bridge Path** + - πŸ”„ Bidirectional HuggingFace ↔ Megatron checkpoint conversion + - 🎯 Advanced parallelism: Tensor (TP), Pipeline (PP), Context (CP), Expert (EP) + - πŸ“ˆ Near-linear scalability to thousands of nodes + - πŸ”§ Production-ready recipes with optimized hyperparameters + +- **AutoModel Path** + - 🌐 PyTorch DTensor-native SPMD training + - πŸ”€ FSDP2-based Hybrid Sharding Data Parallelism (HSDP) + - πŸ“¦ Sequence packing for efficient training + - 🎨 Minimal ceremony with YAML-driven configs + +### Shared Capabilities + +- **πŸŽ₯ Multi-Modal Diffusion**: Support for video, image, and text generation +- **πŸ”¬ Advanced Samplers**: EDM, Flow Matching, and custom diffusion schedules +- **🎭 Flexible Architectures**: DiT (Diffusion Transformers), WAN (World Action Networks) +- **πŸ“Š Efficient Data Loading**: Data pipelines with sequence packing +- **πŸ’Ύ Distributed Checkpointing**: SafeTensors-based sharded checkpoints +- **🌟 Memory Optimization**: Gradient checkpointing, mixed precision, efficient attention + +## Supported Models + +DFM provides out-of-the-box support for state-of-the-art diffusion architectures: + +| Model | Type | Megatron Bridge | AutoModel | Description | +|-------|------|-----------------|-----------|-------------| +| **DiT** | Image/Video | [pretrain, finetune](@Sajad) | πŸ”œ | Diffusion Transformers with scalable architecture | +| **WAN 2.1** | Video | [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/inference_wan.py), [pretrain, finetune](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/pretrain_wan.py), conversion(@Huy) | @Linnan, @Alex | World Action Networks for video generation | + +## Performance Benchmarking + +For detailed performance benchmarks including throughput metrics across different GPU systems and model configurations, see the (Performance Summary)[https://github.com/NVIDIA-NeMo/DFM/blob/main/docs/performance.md] in our documentation. + +## Project Structure + +``` +DFM/ +β”œβ”€β”€ dfm/ +β”‚ └── src/ +β”‚ β”œβ”€β”€ megatron/ # Megatron Bridge path +β”‚ β”‚ β”œβ”€β”€ base/ # Base utilities for Megatron +β”‚ β”‚ β”œβ”€β”€ data/ # Data loaders and task encoders +β”‚ β”‚ β”‚ β”œβ”€β”€ common/ # Shared data utilities +β”‚ β”‚ β”‚ β”œβ”€β”€ / # model-specific data handling +β”‚ β”‚ β”œβ”€β”€ model/ # Model implementations +β”‚ β”‚ β”‚ β”œβ”€β”€ common/ # Shared model components +β”‚ β”‚ β”‚ β”œβ”€β”€ / # model-specific implementations +β”‚ β”‚ └── recipes/ # Training recipes +β”‚ β”‚ β”œβ”€β”€ / # model-specific training configs +β”‚ β”œβ”€β”€ automodel (@linnan, @alex)/ # AutoModel path (DTensor-native) +β”‚ β”‚ β”œβ”€β”€ _diffusers/ # Diffusion pipeline integrations +β”‚ β”‚ β”œβ”€β”€ datasets/ # Dataset implementations +β”‚ β”‚ β”œβ”€β”€ distributed/ # Parallelization strategies +β”‚ β”‚ β”œβ”€β”€ flow_matching/ # Flow matching implementations +β”‚ β”‚ β”œβ”€β”€ recipes/ # Training scripts +β”‚ β”‚ └── utils/ # Utilities and validation +β”‚ └── common/ # Shared across both paths +β”‚ β”œβ”€β”€ data/ # Common data utilities +β”‚ └── utils/ # Batch ops, video utils, etc. +β”œβ”€β”€ examples/ # Example scripts and configs +``` + +## 🎯 Choosing Your Path + +| Feature | Megatron Bridge | AutoModel | +|---------|-----------------|-----------| +| **Best For** | Maximum scale (1000+ GPUs) | Flexibility & fast iteration | +| **Parallelism** | TP, PP, CP, EP, VPP | FSDP2, TP, SP, CP | +| **HF Integration** | Via bridge/conversion | PyTorch-native DTensor | +| **Checkpoint Format** | Megatron + HF export | SafeTensors DCP | +| **Learning Curve** | Steeper (more knobs) | Gentler (YAML-driven) | +| **Performance** | Highest at scale | Excellent, pytorch-native | + +**Recommendation**: +- Start with **AutoModel** for quick prototyping and HF model compatibility +- Move to **Megatron Bridge** when scaling to 100+ GPUs or need advanced parallelism +- Use **both**: prototype with AutoModel, scale with Megatron Bridge! + + +## 🀝 Contributing + +We welcome contributions! Please see our Contributing Guide for details on: + +- Setting up your development environment +- Code style and testing guidelines +- Submitting pull requests +- Reporting issues + +For questions or discussions, please open an issue on GitHub. + +## Acknowledgements + +NeMo DFM builds upon the excellent work of: + +- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) - Advanced model parallelism +- [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) - HuggingFace ↔ Megatron bridge +- [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) - PyTorch-native SPMD training +- [PyTorch Distributed](https://pytorch.org/docs/stable/distributed.html) - Foundation for distributed training +- [Diffusers](https://github.com/huggingface/diffusers) - Diffusion model implementations + From 9add86704708a1b37032e68610a2925a29571059 Mon Sep 17 00:00:00 2001 From: Abhinav Garg Date: Sun, 16 Nov 2025 19:35:40 +0000 Subject: [PATCH 02/28] Update README and add performance summary documentation - Corrected the link in the README for the performance summary to point to the correct file. - Introduced a new `performance-summary.md` document detailing performance benchmarks for large language models using DFM, including nomenclature, performance metrics, and system configurations. --- README.md | 2 +- docs/performance-summary.md | 62 +++++++++++++++++++++++++++++++++++++ 2 files changed, 63 insertions(+), 1 deletion(-) create mode 100644 docs/performance-summary.md diff --git a/README.md b/README.md index 512d42c0..b4a2fc6b 100644 --- a/README.md +++ b/README.md @@ -127,7 +127,7 @@ DFM provides out-of-the-box support for state-of-the-art diffusion architectures ## Performance Benchmarking -For detailed performance benchmarks including throughput metrics across different GPU systems and model configurations, see the (Performance Summary)[https://github.com/NVIDIA-NeMo/DFM/blob/main/docs/performance.md] in our documentation. +For detailed performance benchmarks including throughput metrics across different GPU systems and model configurations, see the (Performance Summary)[https://github.com/NVIDIA-NeMo/DFM/blob/main/docs/performance-summary.md] in our documentation. ## Project Structure diff --git a/docs/performance-summary.md b/docs/performance-summary.md new file mode 100644 index 00000000..214f74c0 --- /dev/null +++ b/docs/performance-summary.md @@ -0,0 +1,62 @@ +# Performance + +As part of the NVIDIA NeMo Framework, DFM, provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. + +This page provides performance benchmarks for large language models using DFM across different GPU systems and configurations. + +## Nomenclature + +- **GBS**: Global Batch Size +- **MBS**: Micro Batch Size +- **FSDP**: Fully Sharded Data Parallel + - FSDP = 1: use FSDP + - FSDP = 0: use DDP (Distributed Data Parallel) +- **TP**: Tensor Parallel Size +- **PP**: Pipeline Parallel Size +- **CP**: Context Parallel Size +- **VP**: Virtual Pipeline Parallel Size +- **EP**: Expert Parallel Size +- **GA**: Number of Gradient Accumulations + +## Performance Metrics + +Performance is measured using: +- **Tokens/sec/GPU**: Throughput per GPU +- **Model TFLOP/sec/GPU**: Model floating-point operations per second per GPU + +```{contents} +:local: +:depth: 2 +``` + +## Performance Summary for Large Language Models + +Below are performance benchmarks for various large language models organized by release version. + +The performance data includes: + +- **Pre-training Performance**: Throughput metrics for various model sizes and architectures +- **System Configurations**: Results across different GPU systems (DGX-GB200, DGX-B200, DGX-H100) + +--- + +## Megatron-Core Pre-Training Performance + +#### System: DGX-GB200 + +| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | GA | Tokens / sec / GPU | Model TFLOP / sec / GPU | +|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-----------------------|-------------------------| + + +#### System: DGX-B200 + +| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | GA | Tokens / sec / GPU | Model TFLOP / sec / GPU | +|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-----------------------|-------------------------| + +#### System: DGX-H100 + +| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | GA | Tokens / sec / GPU | Model TFLOP / sec / GPU | +|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-----------------------|-------------------------| + +## Automodel Pre-Training Performance + From 79f9d264cdf1e940e5dda183ced255e41604f24c Mon Sep 17 00:00:00 2001 From: sajadn Date: Tue, 18 Nov 2025 14:12:54 -0800 Subject: [PATCH 03/28] add DiT megatron links. Signed-off-by: sajadn --- README.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index b4a2fc6b..521796f1 100644 --- a/README.md +++ b/README.md @@ -59,7 +59,7 @@ docker run --rm -it --gpus all \ ### Megatron Bridge Path #### Run a Receipe -You can find all predefined recipes under [recipes](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/megatron/recipes) directory. +You can find all predefined recipes under [recipes](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/megatron/recipes) directory. > **Note:** You will have to use [uv](https://docs.astral.sh/uv/) to run the recipes. Please use `--group` as `megatron-bridge`. @@ -122,7 +122,7 @@ DFM provides out-of-the-box support for state-of-the-art diffusion architectures | Model | Type | Megatron Bridge | AutoModel | Description | |-------|------|-----------------|-----------|-------------| -| **DiT** | Image/Video | [pretrain, finetune](@Sajad) | πŸ”œ | Diffusion Transformers with scalable architecture | +| **DiT** | Image/Video | [pretrain](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/pretrain_dit_model.py), [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/inference_dit_model.py) | πŸ”œ | Diffusion Transformers with scalable architecture | | **WAN 2.1** | Video | [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/inference_wan.py), [pretrain, finetune](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/pretrain_wan.py), conversion(@Huy) | @Linnan, @Alex | World Action Networks for video generation | ## Performance Benchmarking @@ -169,7 +169,7 @@ DFM/ | **Learning Curve** | Steeper (more knobs) | Gentler (YAML-driven) | | **Performance** | Highest at scale | Excellent, pytorch-native | -**Recommendation**: +**Recommendation**: - Start with **AutoModel** for quick prototyping and HF model compatibility - Move to **Megatron Bridge** when scaling to 100+ GPUs or need advanced parallelism - Use **both**: prototype with AutoModel, scale with Megatron Bridge! @@ -195,4 +195,3 @@ NeMo DFM builds upon the excellent work of: - [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) - PyTorch-native SPMD training - [PyTorch Distributed](https://pytorch.org/docs/stable/distributed.html) - Foundation for distributed training - [Diffusers](https://github.com/huggingface/diffusers) - Diffusion model implementations - From b96cf8ffd4e81987703e691ec6b200aa147b46a1 Mon Sep 17 00:00:00 2001 From: Parth Mannan Date: Wed, 19 Nov 2025 11:26:10 -0800 Subject: [PATCH 04/28] Performance Docs update Signed-off-by: Parth Mannan --- docs/performance-summary.md | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/docs/performance-summary.md b/docs/performance-summary.md index 214f74c0..539253dc 100644 --- a/docs/performance-summary.md +++ b/docs/performance-summary.md @@ -1,8 +1,8 @@ # Performance -As part of the NVIDIA NeMo Framework, DFM, provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. +As part of the NVIDIA NeMo Framework, DFM, provides the most recent training techniques for training advanced generative AI models, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. -This page provides performance benchmarks for large language models using DFM across different GPU systems and configurations. +This page provides the current performance benchmarks for large language models using DFM across different GPU systems and configurations as we continue to optimize the model for optimal performance. Please refer to `examples/megatron/recipes/wan/conf` for updated YAML configurations. ## Nomenclature @@ -12,11 +12,11 @@ This page provides performance benchmarks for large language models using DFM ac - FSDP = 1: use FSDP - FSDP = 0: use DDP (Distributed Data Parallel) - **TP**: Tensor Parallel Size +- **SP**: Sequence Parallel - **PP**: Pipeline Parallel Size - **CP**: Context Parallel Size - **VP**: Virtual Pipeline Parallel Size - **EP**: Expert Parallel Size -- **GA**: Number of Gradient Accumulations ## Performance Metrics @@ -36,7 +36,7 @@ Below are performance benchmarks for various large language models organized by The performance data includes: - **Pre-training Performance**: Throughput metrics for various model sizes and architectures -- **System Configurations**: Results across different GPU systems (DGX-GB200, DGX-B200, DGX-H100) +- **System Configurations**: Results across different GPU systems (DGX-GB200, DGX-GB300, DGX-H100) --- @@ -44,19 +44,22 @@ The performance data includes: #### System: DGX-GB200 -| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | GA | Tokens / sec / GPU | Model TFLOP / sec / GPU | +| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU | |-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-----------------------|-------------------------| +|Wan 2.1 14B|32|64|1|37440|0|1|1|1|4|0|0|4747.17|787.59| -#### System: DGX-B200 +#### System: DGX-GB300 -| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | GA | Tokens / sec / GPU | Model TFLOP / sec / GPU | +| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU | |-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-----------------------|-------------------------| +|Wan 2.1 14B|32|64|1|37440|0|1|1|1|2|0|0|6161.63|1,022.26| #### System: DGX-H100 -| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | PP | CP | VP | EP | GA | Tokens / sec / GPU | Model TFLOP / sec / GPU | +| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU | |-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-----------------------|-------------------------| +|Wan 2.1 14B|64|64|1|37440|0|2|1|1|4|0|0|1866.47|309.66| ## Automodel Pre-Training Performance From 2b00158742c66db61838e6afa6a86a0eb02c70a3 Mon Sep 17 00:00:00 2001 From: Parth Mannan Date: Wed, 19 Nov 2025 11:27:24 -0800 Subject: [PATCH 05/28] Performance Docs update fix Signed-off-by: Parth Mannan --- docs/performance-summary.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/performance-summary.md b/docs/performance-summary.md index 539253dc..068dbf4a 100644 --- a/docs/performance-summary.md +++ b/docs/performance-summary.md @@ -46,14 +46,14 @@ The performance data includes: | Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU | |-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-----------------------|-------------------------| -|Wan 2.1 14B|32|64|1|37440|0|1|1|1|4|0|0|4747.17|787.59| +|Wan 2.1 14B|32|64|1|37440|0|1|0|1|4|0|0|4747.17|787.59| #### System: DGX-GB300 | Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU | |-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-----------------------|-------------------------| -|Wan 2.1 14B|32|64|1|37440|0|1|1|1|2|0|0|6161.63|1,022.26| +|Wan 2.1 14B|32|64|1|37440|0|1|0|1|2|0|0|6161.63|1,022.26| #### System: DGX-H100 From 8e471a0a33ce4ae790ca15bd5541b74b67264208 Mon Sep 17 00:00:00 2001 From: Abhinav Garg Date: Thu, 20 Nov 2025 11:29:55 +0000 Subject: [PATCH 06/28] Update README to enhance clarity and accuracy - Removed redundant description of the framework. - Clarified the relationship between Megatron Bridge and Megatron Core in the Dual-Path Architecture section. --- README.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/README.md b/README.md index 521796f1..70f626d0 100644 --- a/README.md +++ b/README.md @@ -11,8 +11,6 @@ [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/) [![GitHub Stars](https://img.shields.io/github/stars/NVIDIA-NeMo/DFM.svg?style=social&label=Star&cacheSeconds=14400)](https://github.com/NVIDIA-NeMo/DFM/stargazers/) -**State-of-the-art framework for fast, large-scale training and inference of diffusion models** - [Documentation](https://github.com/NVIDIA-NeMo/DFM/tree/main/docs) | [Supported Models](#supported-models) | [Examples](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples) | [Contributing](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/CONTRIBUTING.md) @@ -23,7 +21,7 @@ NeMo DFM (Diffusion Foundation Models) is a comprehensive collection of diffusio **Dual-Path Architecture**: DFM provides two complementary training paths to maximize flexibility: -- **πŸŒ‰ Megatron Bridge Path**: Built on [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) and [Megatron Core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) for maximum scalability with tensor, pipeline, and context parallelism +- **πŸŒ‰ Megatron Bridge Path**: Built on [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) which leverages [Megatron Core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) for maximum scalability with 6D parallelism - **πŸš€ AutoModel Path**: Built on [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) for PyTorch DTensor-native SPMD training with seamless πŸ€— Hugging Face integration Choose the path that best fits your workflowβ€”or use both for different stages of development! From 6f92b0176c4edeaf44f19723384d12111dcc2cb0 Mon Sep 17 00:00:00 2001 From: Abhinav Garg Date: Thu, 20 Nov 2025 18:56:48 +0000 Subject: [PATCH 07/28] Enhance README with detailed performance optimizations and parallelism descriptions - Updated the Megatron Bridge Path section to include 6D parallelism details. - Added state-of-the-art performance optimizations to the Dual Training Paths section. - Clarified parallelism terminology in the comparison table for better understanding. --- README.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 70f626d0..d0d9102c 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ NeMo DFM (Diffusion Foundation Models) is a comprehensive collection of diffusio **Dual-Path Architecture**: DFM provides two complementary training paths to maximize flexibility: -- **πŸŒ‰ Megatron Bridge Path**: Built on [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) which leverages [Megatron Core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) for maximum scalability with 6D parallelism +- **πŸŒ‰ Megatron Bridge Path**: Built on [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) which leverages [Megatron Core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) for maximum scalability with 6D parallelism (TP, PP, CP, EP, VPP, DP) - **πŸš€ AutoModel Path**: Built on [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) for PyTorch DTensor-native SPMD training with seamless πŸ€— Hugging Face integration Choose the path that best fits your workflowβ€”or use both for different stages of development! @@ -94,8 +94,8 @@ uv run torchrun --nproc-per-node=8 \ ### Dual Training Paths - **Megatron Bridge Path** - - πŸ”„ Bidirectional HuggingFace ↔ Megatron checkpoint conversion - - 🎯 Advanced parallelism: Tensor (TP), Pipeline (PP), Context (CP), Expert (EP) + - State-of-the-art performance optimizations (TFLOPs) + - 🎯 Advanced parallelism: Tensor (TP), Context (CP) Data (DP), etc - πŸ“ˆ Near-linear scalability to thousands of nodes - πŸ”§ Production-ready recipes with optimized hyperparameters @@ -113,6 +113,7 @@ uv run torchrun --nproc-per-node=8 \ - **πŸ“Š Efficient Data Loading**: Data pipelines with sequence packing - **πŸ’Ύ Distributed Checkpointing**: SafeTensors-based sharded checkpoints - **🌟 Memory Optimization**: Gradient checkpointing, mixed precision, efficient attention +- **πŸ€— HuggingFace Integration**: Seamless integration with the HF ecosystem ## Supported Models @@ -161,7 +162,7 @@ DFM/ | Feature | Megatron Bridge | AutoModel | |---------|-----------------|-----------| | **Best For** | Maximum scale (1000+ GPUs) | Flexibility & fast iteration | -| **Parallelism** | TP, PP, CP, EP, VPP | FSDP2, TP, SP, CP | +| **Parallelism** | 6D (TP, CP, DP, etc) | FSDP2, TP, SP, CP | | **HF Integration** | Via bridge/conversion | PyTorch-native DTensor | | **Checkpoint Format** | Megatron + HF export | SafeTensors DCP | | **Learning Curve** | Steeper (more knobs) | Gentler (YAML-driven) | From 223381147948833be71379fc089275457de34463 Mon Sep 17 00:00:00 2001 From: Parth Mannan Date: Thu, 20 Nov 2025 11:27:28 -0800 Subject: [PATCH 08/28] Update perf doc Signed-off-by: Parth Mannan --- docs/performance-summary.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/performance-summary.md b/docs/performance-summary.md index 068dbf4a..65b37c2d 100644 --- a/docs/performance-summary.md +++ b/docs/performance-summary.md @@ -44,22 +44,22 @@ The performance data includes: #### System: DGX-GB200 -| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU | -|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-----------------------|-------------------------| -|Wan 2.1 14B|32|64|1|37440|0|1|0|1|4|0|0|4747.17|787.59| +| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU | +|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------| +|Wan 2.1 14B|32|64|1|37440|0|1|0|1|4|0|0|787.59| #### System: DGX-GB300 -| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU | -|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-----------------------|-------------------------| -|Wan 2.1 14B|32|64|1|37440|0|1|0|1|2|0|0|6161.63|1,022.26| +| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU | +|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------| +|Wan 2.1 14B|32|64|1|37440|0|1|0|1|2|0|0|1,022.26| #### System: DGX-H100 -| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Tokens / sec / GPU | Model TFLOP / sec / GPU | -|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-----------------------|-------------------------| -|Wan 2.1 14B|64|64|1|37440|0|2|1|1|4|0|0|1866.47|309.66| +| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU | +|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------| +|Wan 2.1 14B|128|128|1|37440|0|2|1|1|4|0|0|325.77| ## Automodel Pre-Training Performance From 88ddbf1d873bf9933b8d64e96ba8ab3ca6a35974 Mon Sep 17 00:00:00 2001 From: linnan wang Date: Thu, 20 Nov 2025 16:37:29 -0800 Subject: [PATCH 09/28] update Signed-off-by: linnan wang --- README.md | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index d0d9102c..d52f2ee2 100644 --- a/README.md +++ b/README.md @@ -78,15 +78,12 @@ Train with PyTorch-native DTensor parallelism and direct πŸ€— HF integration: # Fine-tune a video diffusion model with FSDP2 uv run torchrun --nproc-per-node=8 \ dfm/src/automodel/recipes/finetune.py \ - --config examples/automodel/wan21_finetune.yaml + -c examples/automodel/wan21_finetune.yaml -# Override parameters via CLI -# TODO +# Pre-train a video diffusion model with FSDP2 uv run torchrun --nproc-per-node=8 \ - dfm/src/automodel/recipes/finetune.py \ - --config examples/automodel/wan21_finetune.yaml \ - --step_scheduler.local_batch_size 4 \ - --model.pretrained_model_name_or_path "your-model-id" +examples/automodel/pretrain/pretrain.py \ +-c examples/automodel/pretrain/wan2_1_t2v_flow.yaml ``` ## πŸš€ Key Features @@ -122,7 +119,7 @@ DFM provides out-of-the-box support for state-of-the-art diffusion architectures | Model | Type | Megatron Bridge | AutoModel | Description | |-------|------|-----------------|-----------|-------------| | **DiT** | Image/Video | [pretrain](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/pretrain_dit_model.py), [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/inference_dit_model.py) | πŸ”œ | Diffusion Transformers with scalable architecture | -| **WAN 2.1** | Video | [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/inference_wan.py), [pretrain, finetune](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/pretrain_wan.py), conversion(@Huy) | @Linnan, @Alex | World Action Networks for video generation | +| **WAN 2.1** | Video | [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/inference_wan.py), [pretrain, finetune](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/pretrain_wan.py), conversion(@Huy) | [pretrain](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/pretrain), [finetune](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/finetune),[inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/automodel/generate/wan_validate.py) | World Action Networks for video generation | ## Performance Benchmarking @@ -144,7 +141,7 @@ DFM/ β”‚ β”‚ β”‚ β”œβ”€β”€ / # model-specific implementations β”‚ β”‚ └── recipes/ # Training recipes β”‚ β”‚ β”œβ”€β”€ / # model-specific training configs -β”‚ β”œβ”€β”€ automodel (@linnan, @alex)/ # AutoModel path (DTensor-native) +β”‚ β”œβ”€β”€ automodel # AutoModel path (DTensor-native) β”‚ β”‚ β”œβ”€β”€ _diffusers/ # Diffusion pipeline integrations β”‚ β”‚ β”œβ”€β”€ datasets/ # Dataset implementations β”‚ β”‚ β”œβ”€β”€ distributed/ # Parallelization strategies From 2aaae5eb59d35f0c1ad0bf4e4cae893af0554f30 Mon Sep 17 00:00:00 2001 From: linnan wang Date: Fri, 21 Nov 2025 10:44:42 -0800 Subject: [PATCH 10/28] Update README with fine-tuning command Removed TODO comment and added a command for fine-tuning a video diffusion model. --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index d52f2ee2..41f3a4f4 100644 --- a/README.md +++ b/README.md @@ -74,7 +74,6 @@ Train with PyTorch-native DTensor parallelism and direct πŸ€— HF integration: ```bash -# TODO # Fine-tune a video diffusion model with FSDP2 uv run torchrun --nproc-per-node=8 \ dfm/src/automodel/recipes/finetune.py \ From 9abba18f76a73ab41b3a8a4d9d9347a891bf18d7 Mon Sep 17 00:00:00 2001 From: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Date: Fri, 21 Nov 2025 10:54:56 -0800 Subject: [PATCH 11/28] Apply suggestion from @akoumpa --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 41f3a4f4..d94553f9 100644 --- a/README.md +++ b/README.md @@ -160,7 +160,7 @@ DFM/ | **Best For** | Maximum scale (1000+ GPUs) | Flexibility & fast iteration | | **Parallelism** | 6D (TP, CP, DP, etc) | FSDP2, TP, SP, CP | | **HF Integration** | Via bridge/conversion | PyTorch-native DTensor | -| **Checkpoint Format** | Megatron + HF export | SafeTensors DCP | +| **Checkpoint Format** | Megatron + HF export | HF-native (SafeTensors with DCP) | | **Learning Curve** | Steeper (more knobs) | Gentler (YAML-driven) | | **Performance** | Highest at scale | Excellent, pytorch-native | From 22c67902d4c3b049b5da1000cb65cf7d6242beb5 Mon Sep 17 00:00:00 2001 From: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Date: Fri, 21 Nov 2025 10:55:04 -0800 Subject: [PATCH 12/28] Apply suggestion from @akoumpa --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index d94553f9..81413a17 100644 --- a/README.md +++ b/README.md @@ -159,7 +159,7 @@ DFM/ |---------|-----------------|-----------| | **Best For** | Maximum scale (1000+ GPUs) | Flexibility & fast iteration | | **Parallelism** | 6D (TP, CP, DP, etc) | FSDP2, TP, SP, CP | -| **HF Integration** | Via bridge/conversion | PyTorch-native DTensor | +| **HF Integration** | Via bridge/conversion | HF-native (via DTensor) | | **Checkpoint Format** | Megatron + HF export | HF-native (SafeTensors with DCP) | | **Learning Curve** | Steeper (more knobs) | Gentler (YAML-driven) | | **Performance** | Highest at scale | Excellent, pytorch-native | From 49c8a24deed68f83db4ef5cdde9a8f86d19a97a1 Mon Sep 17 00:00:00 2001 From: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Date: Fri, 21 Nov 2025 10:55:09 -0800 Subject: [PATCH 13/28] Apply suggestion from @akoumpa --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 81413a17..5ae3cf56 100644 --- a/README.md +++ b/README.md @@ -158,7 +158,7 @@ DFM/ | Feature | Megatron Bridge | AutoModel | |---------|-----------------|-----------| | **Best For** | Maximum scale (1000+ GPUs) | Flexibility & fast iteration | -| **Parallelism** | 6D (TP, CP, DP, etc) | FSDP2, TP, SP, CP | +| **Parallelism** | 6D (TP, CP, DP, etc) | FSDP2; (TP, SP, CP available soon) | | **HF Integration** | Via bridge/conversion | HF-native (via DTensor) | | **Checkpoint Format** | Megatron + HF export | HF-native (SafeTensors with DCP) | | **Learning Curve** | Steeper (more knobs) | Gentler (YAML-driven) | From 10433b31998408bd839d42fc18c5086baac2226e Mon Sep 17 00:00:00 2001 From: Huy Vu <86480512+huvunvidia@users.noreply.github.com> Date: Fri, 21 Nov 2025 15:03:34 -0500 Subject: [PATCH 14/28] Update README, Wan-related. Updated command syntax and improved clarity in README. --- README.md | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 5ae3cf56..35111d63 100644 --- a/README.md +++ b/README.md @@ -61,11 +61,12 @@ You can find all predefined recipes under [recipes](https://github.com/NVIDIA-Ne > **Note:** You will have to use [uv](https://docs.astral.sh/uv/) to run the recipes. Please use `--group` as `megatron-bridge`. - - - ```bash -uv run --group megatron-bridge python -m torch.distributed.run --nproc_per_node=2 examples/megatron/recipes/wan/pretrain_wan.py model.qkv_format=thd --mock +uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \ + examples/megatron/recipes/wan/pretrain_wan.py \ + --config-file examples/megatron/recipes/wan/conf/wan_1_3B.yaml \ + --training-mode pretrain \ + --mock ``` ### AutoModel Path @@ -91,7 +92,7 @@ examples/automodel/pretrain/pretrain.py \ - **Megatron Bridge Path** - State-of-the-art performance optimizations (TFLOPs) - - 🎯 Advanced parallelism: Tensor (TP), Context (CP) Data (DP), etc + - 🎯 Advanced parallelism: Tensor (TP), Context (CP), Data (DP), etc. - πŸ“ˆ Near-linear scalability to thousands of nodes - πŸ”§ Production-ready recipes with optimized hyperparameters @@ -118,7 +119,7 @@ DFM provides out-of-the-box support for state-of-the-art diffusion architectures | Model | Type | Megatron Bridge | AutoModel | Description | |-------|------|-----------------|-----------|-------------| | **DiT** | Image/Video | [pretrain](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/pretrain_dit_model.py), [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/inference_dit_model.py) | πŸ”œ | Diffusion Transformers with scalable architecture | -| **WAN 2.1** | Video | [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/inference_wan.py), [pretrain, finetune](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/pretrain_wan.py), conversion(@Huy) | [pretrain](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/pretrain), [finetune](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/finetune),[inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/automodel/generate/wan_validate.py) | World Action Networks for video generation | +| **WAN 2.1** | Video | [inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/inference_wan.py), [pretrain, finetune](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/wan/pretrain_wan.py) | [pretrain](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/pretrain), [finetune](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/finetune),[inference](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/automodel/generate/wan_validate.py) | World Action Networks for video generation | ## Performance Benchmarking @@ -158,7 +159,7 @@ DFM/ | Feature | Megatron Bridge | AutoModel | |---------|-----------------|-----------| | **Best For** | Maximum scale (1000+ GPUs) | Flexibility & fast iteration | -| **Parallelism** | 6D (TP, CP, DP, etc) | FSDP2; (TP, SP, CP available soon) | +| **Parallelism** | 6D (TP, CP, DP, etc.) | FSDP2; (TP, SP, CP available soon) | | **HF Integration** | Via bridge/conversion | HF-native (via DTensor) | | **Checkpoint Format** | Megatron + HF export | HF-native (SafeTensors with DCP) | | **Learning Curve** | Steeper (more knobs) | Gentler (YAML-driven) | From 901174e4480d32b4206fd9167289b121798dd314 Mon Sep 17 00:00:00 2001 From: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Date: Fri, 21 Nov 2025 13:49:27 -0800 Subject: [PATCH 15/28] Apply suggestion from @akoumpa --- docs/performance-summary.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/performance-summary.md b/docs/performance-summary.md index 65b37c2d..fc2ac8c4 100644 --- a/docs/performance-summary.md +++ b/docs/performance-summary.md @@ -61,5 +61,4 @@ The performance data includes: |-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------| |Wan 2.1 14B|128|128|1|37440|0|2|1|1|4|0|0|325.77| -## Automodel Pre-Training Performance From 03560c75f4257420b69fec050b4a82b6fd5d30df Mon Sep 17 00:00:00 2001 From: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Date: Fri, 21 Nov 2025 13:58:21 -0800 Subject: [PATCH 16/28] Fixing typo @akoumpa --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 35111d63..c4d30edc 100644 --- a/README.md +++ b/README.md @@ -56,7 +56,7 @@ docker run --rm -it --gpus all \ ### Megatron Bridge Path -#### Run a Receipe +#### Run a Recipe You can find all predefined recipes under [recipes](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/megatron/recipes) directory. > **Note:** You will have to use [uv](https://docs.astral.sh/uv/) to run the recipes. Please use `--group` as `megatron-bridge`. From ca6d9cffea59b608f84e4b726c188dcbe296eef0 Mon Sep 17 00:00:00 2001 From: Alexandros Koumparoulis Date: Fri, 21 Nov 2025 14:00:45 -0800 Subject: [PATCH 17/28] fix automodel section Signed-off-by: Alexandros Koumparoulis --- README.md | 29 +++++++++++++++++++---------- 1 file changed, 19 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index c4d30edc..8370e978 100644 --- a/README.md +++ b/README.md @@ -73,17 +73,26 @@ uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node Train with PyTorch-native DTensor parallelism and direct πŸ€— HF integration: - +#### Run a Recipe + +You can find pre-configured recipes under [automodel/finetune](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/finetune) and [automodel/pretrain](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples/automodel/pretrain) directories. + +> Note: AutoModel examples live under `dfm/examples/automodel`. Use [uv](https://docs.astral.sh/uv/) with `--group automodel`. Configs are YAML-driven; pass `-c ` to override the default. + +The fine-tune recipe sets up WAN 2.1 Text-to-Video training with Flow Matching using FSDP2 Hybrid Sharding. +It parallelizes heavy transformer blocks while keeping lightweight modules (e.g., VAE) unsharded for efficiency. +Adjust batch sizes, LR, and parallel sizes in `dfm/examples/automodel/finetune/wan2_1_t2v_flow.yaml`. +The generation script demonstrates distributed inference with AutoModel DTensor managers, producing an MP4 on rank 0. You can tweak frame size, frames, steps, and CFG in flags. + ```bash -# Fine-tune a video diffusion model with FSDP2 -uv run torchrun --nproc-per-node=8 \ - dfm/src/automodel/recipes/finetune.py \ - -c examples/automodel/wan21_finetune.yaml - -# Pre-train a video diffusion model with FSDP2 -uv run torchrun --nproc-per-node=8 \ -examples/automodel/pretrain/pretrain.py \ --c examples/automodel/pretrain/wan2_1_t2v_flow.yaml +# Fine-tune WAN 2.1 T2V with FSDP2 (single node, 8 GPUs) +uv run --group automodel torchrun --nproc-per-node=8 \ + dfm/examples/automodel/finetune/finetune.py \ + -c dfm/examples/automodel/finetune/wan2_1_t2v_flow.yaml + +# Generate videos with FSDP2 (distributed inference) +uv run --group automodel torchrun --nproc-per-node=8 \ + dfm/examples/automodel/generate/wan_generate.py ``` ## πŸš€ Key Features From 4b38e3da67938fc8d5233123cc84b028822ca950 Mon Sep 17 00:00:00 2001 From: Alexandros Koumparoulis Date: Fri, 21 Nov 2025 14:04:44 -0800 Subject: [PATCH 18/28] fix Signed-off-by: Alexandros Koumparoulis --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 8370e978..03a8baf4 100644 --- a/README.md +++ b/README.md @@ -107,6 +107,7 @@ uv run --group automodel torchrun --nproc-per-node=8 \ - **AutoModel Path** - 🌐 PyTorch DTensor-native SPMD training + - πŸš€ Advanced parallelisms (TP, PP, etc.) coming soon! - πŸ”€ FSDP2-based Hybrid Sharding Data Parallelism (HSDP) - πŸ“¦ Sequence packing for efficient training - 🎨 Minimal ceremony with YAML-driven configs From b628d48ae36c239203191fb2005e8fe45fc8f06d Mon Sep 17 00:00:00 2001 From: Pablo Garay Date: Sun, 23 Nov 2025 23:33:48 -0800 Subject: [PATCH 19/28] update DFM-specific readme Signed-off-by: Pablo Garay --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 03a8baf4..0a4f8558 100644 --- a/README.md +++ b/README.md @@ -7,11 +7,11 @@ -[![CICD NeMo](https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/workflows/cicd-main.yml/badge.svg)](https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/workflows/cicd-main.yml) +[![CICD NeMo](https://github.com/NVIDIA-NeMo/DFM/actions/workflows/cicd-main.yml/badge.svg)](https://github.com/NVIDIA-NeMo/DFM/actions/workflows/cicd-main.yml) [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/) [![GitHub Stars](https://img.shields.io/github/stars/NVIDIA-NeMo/DFM.svg?style=social&label=Star&cacheSeconds=14400)](https://github.com/NVIDIA-NeMo/DFM/stargazers/) -[Documentation](https://github.com/NVIDIA-NeMo/DFM/tree/main/docs) | [Supported Models](#supported-models) | [Examples](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples) | [Contributing](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/CONTRIBUTING.md) +[Documentation](https://github.com/NVIDIA-NeMo/DFM/tree/main/docs) | [Supported Models](#supported-models) | [Examples](https://github.com/NVIDIA-NeMo/DFM/tree/main/examples) | [Contributing](https://github.com/NVIDIA-NeMo/DFM/tree/main/CONTRIBUTING.md) From 48d65a64d6a2674cc14bc1bc40e13f5e5a17555d Mon Sep 17 00:00:00 2001 From: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Date: Wed, 26 Nov 2025 15:15:07 -0800 Subject: [PATCH 20/28] Update performance-summary.md Thanks a lot @linnanwang for the bench numbers. --- docs/performance-summary.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/performance-summary.md b/docs/performance-summary.md index fc2ac8c4..ec715336 100644 --- a/docs/performance-summary.md +++ b/docs/performance-summary.md @@ -62,3 +62,14 @@ The performance data includes: |Wan 2.1 14B|128|128|1|37440|0|2|1|1|4|0|0|325.77| +## NeMo Automodel Pre-Training Performance +The following table summarizes the performance leveraging the NeMo Automodel backend. + +#### System: DGX-H100 + +| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | DP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU | +|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|----|-------------------------| +|Wan 2.1 14B|8|8|1|37440|8|1|1|1|1|1|0|0|175.88| +|Wan 2.1 14B|64|64|1|37440|64|1|1|1|1|1|0|0|228.85| + + From fec3b40846053aee678d9e68e7135cb4205b3e49 Mon Sep 17 00:00:00 2001 From: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Date: Wed, 26 Nov 2025 15:15:44 -0800 Subject: [PATCH 21/28] Update performance-summary.md --- docs/performance-summary.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/performance-summary.md b/docs/performance-summary.md index ec715336..e33fe10b 100644 --- a/docs/performance-summary.md +++ b/docs/performance-summary.md @@ -69,7 +69,7 @@ The following table summarizes the performance leveraging the NeMo Automodel bac | Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | DP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU | |-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|----|-------------------------| -|Wan 2.1 14B|8|8|1|37440|8|1|1|1|1|1|0|0|175.88| -|Wan 2.1 14B|64|64|1|37440|64|1|1|1|1|1|0|0|228.85| +|Wan 2.1 14B|8|8|1|37440|1|8|1|1|1|1|1|0|0|175.88| +|Wan 2.1 14B|64|64|1|37440|1|64|1|1|1|1|1|0|0|228.85| From df982bebca72f7302c37aeae2879a6f5d21be862 Mon Sep 17 00:00:00 2001 From: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Date: Wed, 26 Nov 2025 15:18:06 -0800 Subject: [PATCH 22/28] Update performance-summary.md --- docs/performance-summary.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/performance-summary.md b/docs/performance-summary.md index e33fe10b..5a56d319 100644 --- a/docs/performance-summary.md +++ b/docs/performance-summary.md @@ -69,7 +69,7 @@ The following table summarizes the performance leveraging the NeMo Automodel bac | Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | DP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU | |-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|----|-------------------------| -|Wan 2.1 14B|8|8|1|37440|1|8|1|1|1|1|1|0|0|175.88| -|Wan 2.1 14B|64|64|1|37440|1|64|1|1|1|1|1|0|0|228.85| +|Wan 2.1 14B|8|8|1|37440|1|8|1|1|1|1|0|0|175.88| +|Wan 2.1 14B|64|64|1|37440|1|64|1|1|1|1|0|0|228.85| From 796103efbde7687cdfd39368f25d4a89acc7707d Mon Sep 17 00:00:00 2001 From: Abhinav Garg Date: Mon, 1 Dec 2025 04:31:35 -0800 Subject: [PATCH 23/28] Update README.md Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 0a4f8558..1b05f059 100644 --- a/README.md +++ b/README.md @@ -17,7 +17,7 @@ ## Overview -NeMo DFM (Diffusion Foundation Models) is a comprehensive collection of diffusion models for **Video**, **Image**, and **Text** generation. It unifies cutting-edge diffusion-based architectures and training techniques, prioritizing efficiency and performance from research prototyping to production deployment. +NeMo DFM (Diffusion Foundation Models) is a library under [NeMo Framework](https://github.com/NVIDIA-NeMo), focusing on diffusion models for **Video**, **Image**, and **Text** generation. It unifies cutting-edge diffusion-based architectures and training techniques, prioritizing efficiency and performance from research prototyping to production deployment. **Dual-Path Architecture**: DFM provides two complementary training paths to maximize flexibility: From 9ea61167300a6ccf783e480c13507a1df5e192f1 Mon Sep 17 00:00:00 2001 From: Abhinav Garg Date: Mon, 1 Dec 2025 04:31:45 -0800 Subject: [PATCH 24/28] Update README.md Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 1b05f059..f26fb642 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ NeMo DFM (Diffusion Foundation Models) is a library under [NeMo Framework](https **Dual-Path Architecture**: DFM provides two complementary training paths to maximize flexibility: -- **πŸŒ‰ Megatron Bridge Path**: Built on [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) which leverages [Megatron Core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) for maximum scalability with 6D parallelism (TP, PP, CP, EP, VPP, DP) +- **πŸŒ‰ Megatron Bridge Path**: Built on [NeMo Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) which leverages [Megatron Core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) for maximum scalability with n-D parallelism (TP, PP, CP, EP, VPP, DP) - **πŸš€ AutoModel Path**: Built on [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) for PyTorch DTensor-native SPMD training with seamless πŸ€— Hugging Face integration Choose the path that best fits your workflowβ€”or use both for different stages of development! From ebf00bfa726cad5931e56af2869856ee8bfdb8da Mon Sep 17 00:00:00 2001 From: Abhinav Garg Date: Mon, 1 Dec 2025 04:31:58 -0800 Subject: [PATCH 25/28] Update README.md Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f26fb642..71a40f61 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,7 @@ NeMo DFM (Diffusion Foundation Models) is a library under [NeMo Framework](https **Dual-Path Architecture**: DFM provides two complementary training paths to maximize flexibility: - **πŸŒ‰ Megatron Bridge Path**: Built on [NeMo Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) which leverages [Megatron Core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) for maximum scalability with n-D parallelism (TP, PP, CP, EP, VPP, DP) -- **πŸš€ AutoModel Path**: Built on [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) for PyTorch DTensor-native SPMD training with seamless πŸ€— Hugging Face integration +- **πŸš€ AutoModel Path**: Built on [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) for PyTorch DTensor-native SPMD training, for easy experimentation and also Day-0 support on πŸ€— Hugging Face models. Choose the path that best fits your workflowβ€”or use both for different stages of development! From 7083f8630f82cee195d6556f33881cac0eef7f52 Mon Sep 17 00:00:00 2001 From: Abhinav Garg Date: Mon, 1 Dec 2025 04:32:24 -0800 Subject: [PATCH 26/28] Update README.md Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 71a40f61..514980a9 100644 --- a/README.md +++ b/README.md @@ -196,7 +196,7 @@ For questions or discussions, please open an issue on GitHub. NeMo DFM builds upon the excellent work of: -- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) - Advanced model parallelism +- [Megatron-core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) - Advanced model parallelism - [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) - HuggingFace ↔ Megatron bridge - [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) - PyTorch-native SPMD training - [PyTorch Distributed](https://pytorch.org/docs/stable/distributed.html) - Foundation for distributed training From f6f3a303ed3e58db9958aaf12f9e4890fb3d606a Mon Sep 17 00:00:00 2001 From: Abhinav Garg Date: Mon, 1 Dec 2025 12:53:23 +0000 Subject: [PATCH 27/28] Refactor README.md and performance-summary.md for clarity and conciseness - Simplified descriptions of Megatron Bridge and AutoModel paths in README.md. - Removed outdated comparison table to streamline content. - Updated performance-summary.md to generalize model references and improve clarity. Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> --- README.md | 30 +----------------------------- docs/performance-summary.md | 6 +++--- 2 files changed, 4 insertions(+), 32 deletions(-) diff --git a/README.md b/README.md index 514980a9..a04e5aa3 100644 --- a/README.md +++ b/README.md @@ -99,18 +99,7 @@ uv run --group automodel torchrun --nproc-per-node=8 \ ### Dual Training Paths -- **Megatron Bridge Path** - - State-of-the-art performance optimizations (TFLOPs) - - 🎯 Advanced parallelism: Tensor (TP), Context (CP), Data (DP), etc. - - πŸ“ˆ Near-linear scalability to thousands of nodes - - πŸ”§ Production-ready recipes with optimized hyperparameters - -- **AutoModel Path** - - 🌐 PyTorch DTensor-native SPMD training - - πŸš€ Advanced parallelisms (TP, PP, etc.) coming soon! - - πŸ”€ FSDP2-based Hybrid Sharding Data Parallelism (HSDP) - - πŸ“¦ Sequence packing for efficient training - - 🎨 Minimal ceremony with YAML-driven configs +**Megatron Bridge** delivers maximum throughput and scalability with near-linear performance to thousands of nodes. **AutoModel** provides an easy on-ramp for experimentation and research with PyTorch-native SPMD training. ### Shared Capabilities @@ -164,23 +153,6 @@ DFM/ β”œβ”€β”€ examples/ # Example scripts and configs ``` -## 🎯 Choosing Your Path - -| Feature | Megatron Bridge | AutoModel | -|---------|-----------------|-----------| -| **Best For** | Maximum scale (1000+ GPUs) | Flexibility & fast iteration | -| **Parallelism** | 6D (TP, CP, DP, etc.) | FSDP2; (TP, SP, CP available soon) | -| **HF Integration** | Via bridge/conversion | HF-native (via DTensor) | -| **Checkpoint Format** | Megatron + HF export | HF-native (SafeTensors with DCP) | -| **Learning Curve** | Steeper (more knobs) | Gentler (YAML-driven) | -| **Performance** | Highest at scale | Excellent, pytorch-native | - -**Recommendation**: -- Start with **AutoModel** for quick prototyping and HF model compatibility -- Move to **Megatron Bridge** when scaling to 100+ GPUs or need advanced parallelism -- Use **both**: prototype with AutoModel, scale with Megatron Bridge! - - ## 🀝 Contributing We welcome contributions! Please see our Contributing Guide for details on: diff --git a/docs/performance-summary.md b/docs/performance-summary.md index 5a56d319..3876e485 100644 --- a/docs/performance-summary.md +++ b/docs/performance-summary.md @@ -2,7 +2,7 @@ As part of the NVIDIA NeMo Framework, DFM, provides the most recent training techniques for training advanced generative AI models, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. -This page provides the current performance benchmarks for large language models using DFM across different GPU systems and configurations as we continue to optimize the model for optimal performance. Please refer to `examples/megatron/recipes/wan/conf` for updated YAML configurations. +This page provides the current performance benchmarks for models using DFM across different GPU systems and configurations as we continue to optimize the model for optimal performance. Please refer to `examples/megatron/recipes/wan/conf` for updated YAML configurations. ## Nomenclature @@ -29,9 +29,9 @@ Performance is measured using: :depth: 2 ``` -## Performance Summary for Large Language Models +## Performance Summary for Models -Below are performance benchmarks for various large language models organized by release version. +Below are performance benchmarks for various models using DFM framework. The performance data includes: From f86c51e24cac37788a1d5f936397e99e8253be1f Mon Sep 17 00:00:00 2001 From: Abhinav Garg Date: Mon, 1 Dec 2025 17:56:23 +0000 Subject: [PATCH 28/28] Fix typo in README.md: changed "Built" to "Build" in the container section header for consistency. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a04e5aa3..bc6efef9 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ Choose the path that best fits your workflowβ€”or use both for different stages ## πŸ”§ Installation -### 🐳 Built your own Container +### 🐳 Build your own Container #### 1. Build the container ```bash