From 8f95999aa0c75c39211bbd75ccc0887070d05915 Mon Sep 17 00:00:00 2001
From: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
Date: Thu, 13 Nov 2025 08:44:26 -0800
Subject: [PATCH 1/4] add focs

---
 examples/megatron/recipes/wan/README.md | 102 ++++++++++++++++++++++++
 1 file changed, 102 insertions(+)
 create mode 100644 examples/megatron/recipes/wan/README.md

diff --git a/examples/megatron/recipes/wan/README.md b/examples/megatron/recipes/wan/README.md
new file mode 100644
index 00000000..7bfcdbc4
--- /dev/null
+++ b/examples/megatron/recipes/wan/README.md
@@ -0,0 +1,102 @@
+## Megatron WAN 2.1
+
+### Overview
+WAN 2.1 is an open, large-scale video generative model series focused on high-quality text-to-video and text-to-image generation. This recipe re-implements WAN using [Megatron-Core](https://github.com/NVIDIA/Megatron-LM) to improve training efficiency and scalability via advanced parallelism schemes and throughput optimizations, including data/tensor/sequence/context parallelism and fused kernels (e.g., NVTE fused attention).
+
+
+### Dataset Preparation
+- This recipe uses NVIDIA's [Megatron-Energon](https://github.com/NVIDIA/Megatron-Energon) as an efficient multi-modal data loader.
+- Datasets should be in the WebDataset-compatible format (typically sharded `.tar` archives). Energon efficiently supports large-scale distributed loading, sharding, and sampling for multi-modal pairs (e.g., text-image, text-video).
+- Point `dataset.path` to your WebDataset location or shard pattern (e.g., a directory containing shards). See the Megatron-Energon documentation for format details and advanced options.
+
+
+### Training and Finetuning
+- Use `--training-mode` to select the correct flow-matching hyper-parameters:
+  - `pretrain`: default pretraining configuration
+  - `finetune`: finetuning configuration (uses different flow-matching hyper-parameters)
+
+Set environment variables like `EXP_NAME` and `CHECKPOINT_DIR` as desired before running.
+
+#### Example: Pretrain WAN 1.3B
+```bash
+NVTE_FUSED_ATTN=1 torchrun --nproc_per_node=8 examples/megatron/recipes/wan/pretrain_wan.py \
+  --training-mode pretrain \
+  model.tensor_model_parallel_size=1 \
+  model.pipeline_model_parallel_size=1 \
+  model.context_parallel_size=4 \
+  model.crossattn_emb_size=1536 \
+  model.hidden_size=1536 \
+  model.ffn_hidden_size=8960 \
+  model.num_attention_heads=12 \
+  model.num_layers=30 \
+  model.qkv_format=thd \
+  dataset.path=/path/to/dataset \
+  checkpoint.save=/path/to/checkpoint_dir \
+  checkpoint.load=/path/to/checkpoint_dir \
+  checkpoint.load_optim=true \
+  checkpoint.save_interval=200 \
+  optimizer.lr=5e-6 \
+  optimizer.min_lr=5e-6 \
+  train.eval_iters=0 \
+  scheduler.lr_decay_style=constant \
+  scheduler.lr_warmup_iters=0 \
+  model.seq_length=2048 \
+  dataset.seq_length=2048 \
+  train.global_batch_size=2 \
+  train.micro_batch_size=1 \
+  dataset.global_batch_size=2 \
+  dataset.micro_batch_size=1 \
+  logger.log_interval=1 \
+  logger.wandb_project="wan" \
+  logger.wandb_exp_name="${EXP_NAME}" \
+  logger.wandb_save_dir="${CHECKPOINT_DIR}"
+```
+
+#### Finetuning
+- Switch `--training-mode finetune` to enable the finetuning flow-matching setup. Adjust dataset and optimization parameters (learning rate, warmup steps, etc.) as needed for your task and hardware.
+
+### Inference
+```bash
+NVTE_FUSED_ATTN=1 torchrun --nproc_per_node=1 examples/megatron/recipes/wan/inference_wan.py  \
+  --task t2v-1.3B \
+  --sizes 480*832 \
+  --checkpoint_dir /path/to/checkpoint \
+  --checkpoint_step 0 \
+  --frame_nums 81 \
+  --prompts "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
+  --tensor_parallel_size 1 \
+  --context_parallel_size 1 \
+  --pipeline_parallel_size 1 \
+  --sequence_parallel False \
+  --base_seed 42 \
+  --sample_steps 50
+```
+
+### Parallelism Support
+The table below shows current parallelisms support for corresponding Wan model size.
+
+  | Model | Data Parallel | Tensor Parallel | Sequence Parallel | Pipeline Parallel | Context Parallel | FSDP |
+  |---|---|---|---|---|---|---|
+  | **1.3B** | ✅ | ✅ | ✅ |  | ✅ |  |
+  | **14B**  | ✅ | ✅ | ✅ |  | ✅ |  |
+
+
+### Performance
+The table below shows performances of corresponding Wan model size on a variety of Nvidia hardware (measured by TFLOPs/GPU).
+
+  | Model | H100 | GB200 | GB300 |
+  |---|---|---|---|
+  | **1.3B** |  |    |  |
+  | **14B** | 308 |  790  | 1000 |
+
+
+### Citation
+```bibtex
+@article{wan2.1,
+  title   = {Wan: Open and Advanced Large‐Scale Video Generative Models},
+  author  = {Wan Team},
+  year    = {2025},
+  note    = {Open­source video foundation model series (Wan 2.1), https://github.com/Wan-Video/Wan2.1/}
+}
+```
+

From 9c1f1725a10fa412787b1a6b89078bdaa92d11bb Mon Sep 17 00:00:00 2001
From: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
Date: Fri, 21 Nov 2025 07:04:59 -0800
Subject: [PATCH 2/4] updated README for Wab

---
 dfm/src/megatron/model/wan/wan_provider.py    |   2 +-
 examples/megatron/recipes/wan/README.md       | 201 ++++++++++++------
 .../megatron/recipes/wan/conf/wan_14B.yaml    |  42 ++++
 .../megatron/recipes/wan/conf/wan_1_3B.yaml   |  38 ++++
 4 files changed, 219 insertions(+), 64 deletions(-)
 create mode 100644 examples/megatron/recipes/wan/conf/wan_14B.yaml
 create mode 100644 examples/megatron/recipes/wan/conf/wan_1_3B.yaml

diff --git a/dfm/src/megatron/model/wan/wan_provider.py b/dfm/src/megatron/model/wan/wan_provider.py
index 24e8c87d..2d5267b4 100644
--- a/dfm/src/megatron/model/wan/wan_provider.py
+++ b/dfm/src/megatron/model/wan/wan_provider.py
@@ -50,7 +50,7 @@ class WanModelProvider(TransformerConfig, ModelProviderMixin[VisionModule]):
     parallel_output: bool = True
     bf16: bool = False
     params_dtype: torch.dtype = torch.float32
-    qkv_format: str = "sbhd"  # "thd". NOTE: if we use context parallelism, we need to use "thd"
+    qkv_format: str = "thd"  # "sbhd". NOTE: if we use context parallelism, we need to use "thd"
     # these attributes are unused for images/videos, we just set because bridge training requires for LLMs
     seq_length: int = 1024
     share_embeddings_and_output_weights: bool = False
diff --git a/examples/megatron/recipes/wan/README.md b/examples/megatron/recipes/wan/README.md
index 7bfcdbc4..29d32675 100644
--- a/examples/megatron/recipes/wan/README.md
+++ b/examples/megatron/recipes/wan/README.md
@@ -1,93 +1,168 @@
-## Megatron WAN 2.1
+## 🚀 Megatron WAN
 
-### Overview
-WAN 2.1 is an open, large-scale video generative model series focused on high-quality text-to-video and text-to-image generation. This recipe re-implements WAN using [Megatron-Core](https://github.com/NVIDIA/Megatron-LM) to improve training efficiency and scalability via advanced parallelism schemes and throughput optimizations, including data/tensor/sequence/context parallelism and fused kernels (e.g., NVTE fused attention).
+### 📋 Overview
+An open-source implementation of [WAN 2.1](https://github.com/Wan-Video/Wan2.1) (large-scale text-to-video/image generative models) built on top of [Megatron-Core](https://github.com/NVIDIA/Megatron-LM) and [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)for scalable and efficient training. It supports advanced parallelism strategies (data, tensor, sequence, and context parallelism) and optimized kernels (e.g., Transformer Engine fused attention).
 
+---
 
-### Dataset Preparation
-- This recipe uses NVIDIA's [Megatron-Energon](https://github.com/NVIDIA/Megatron-Energon) as an efficient multi-modal data loader.
-- Datasets should be in the WebDataset-compatible format (typically sharded `.tar` archives). Energon efficiently supports large-scale distributed loading, sharding, and sampling for multi-modal pairs (e.g., text-image, text-video).
-- Point `dataset.path` to your WebDataset location or shard pattern (e.g., a directory containing shards). See the Megatron-Energon documentation for format details and advanced options.
+### 📦 Dataset Preparation
+This recipe uses NVIDIA's Megatron-Energon as an efficient multi-modal data loader. Datasets should be in the WebDataset-compatible format (typically sharded `.tar` archives). Energon supports large-scale distributed loading, sharding, and sampling for video-text and image-text pairs.
 
+- Set `dataset.path` to your WebDataset directory or shard pattern.
+- See Megatron-Energon docs for format details, subflavors, and advanced options.
 
-### Training and Finetuning
-- Use `--training-mode` to select the correct flow-matching hyper-parameters:
-  - `pretrain`: default pretraining configuration
-  - `finetune`: finetuning configuration (uses different flow-matching hyper-parameters)
+If you do not have a dataset yet or only need to validate performance/plumbing, see the "Quick Start with Mock Dataset" section below.
 
-Set environment variables like `EXP_NAME` and `CHECKPOINT_DIR` as desired before running.
+---
+
+#### 🗂️ Dataset Preparation Example
+Starting with a directory containing raw .mp4 videos and their corresponding metadata .json files containing captions, we’ll turn the data into WAN-ready WebDataset shards using our helper script, and then ask Energon to process those shards and create its metadata. After this, you can point `dataset.path` at the output folder and start training.
+
+```bash
+# 1) Define your input (raw videos) and output (WebDataset shards) folders. For example:
+DATASET_SRC=/opt/raw_videos            # contains .mp4 and  per-video .jsonl captions
+DATASET_PATH=/opt/wan_webdataset      # output WebDataset shards
+
+# 2) (Optional) If your WAN models require auth on first download
+export HF_TOKEN=<your_huggingface_token>
+
+# 3) Create WAN shards with latents + text embeddings
+# Wan's VAE encoder and T5 encoder is used to extract videos' latents and caption embeddings
+#    --height/--width: arguments control resize target (832x480 is one supported option for 1.3B and 14B model)
+#    --center-crop: arguments for center crop to exact target size after resize
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+  examples/megatron/recipes/wan/prepare_energon_dataset_wan.py \
+  --video_folder "${DATASET_SRC}" \
+  --output_dir "${DATASET_PATH}" \
+  --model "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"" \
+  --height 480 --width 832 \
+  --center-crop
+
+# 4) Ask Energon to process shards and create its metadata/spec
+energon prepare "${DATASET_PATH}"
+# In the interactive prompts:
+# - Enter a train/val/test split, e.g., "8,1,1"
+# - When asked for the sample type, choose: "Crude sample (plain dict for cooking)"
+```
+
+What gets produced:
+- Each shard contains:
+  - pth: contain WAN video latents
+  - pickle: contain text embeddings
+  - json: contain useful side-info (text caption, sizes, processing choices, etc.)
+- Energon writes a `.nv-meta` directory with dataset info and a `dataset.yaml` you can version/control.
+
+Next steps:
+- Point your WAN config (or CLI overrides) at `dataset.path=${DATASET_PATH}`
+- You’re ready to launch pretraining
+
+---
+
+### 🐳 Build Container
+
+Please follow the instructions in the container section of the main README:
+
+- DFM container guide: https://github.com/NVIDIA-NeMo/DFM#-built-your-own-container
+
+---
+
+### 🏋️ Pretraining
+
+This recipe leverages sequence packing to maximize throughput. When a batch containing videos with different shapes or resolution, naive batching and padding method require significant numner of padded tokens, due to the inherit size of videos. Sequence packing stacks multiple samples (with dirrent resolutions) into a single sequence instead of padding; hence no computation is wasted on padded tokens. When using sequence packing:
+- Set `train.micro_batch_size=1` and `dataset.micro_batch_size=1`
+- Ensure `model.qkv_format=thd` (required with context parallelism and recommended with sequence packing)
+
+Multiple parallelism techniques including tensor, sequence, and context parallelism are supported and configurable per your hardware.
+
+WAN training is driven by `examples/megatron/recipes/wan/pretrain_wan.py`, which supports both a YAML config file and CLI overrides. 
+
+The script exposes a `--training-mode` with `pretrain` and `finetune` presets for flow-matching hyperparameters. As a starting point for experiments, this presets specify that pretraining uses noisier, biased sampling (e.g., logit-normal, higher logit_std, lower flow_shift) for stability and broad learning, while finetuning uses uniform, lower-noise settings (e.g., uniform sampling, lower logit_std, higher flow_shift) to refine details and improve quality.
+
+Notes:
+- If you use `logger.wandb_project` and `logger.wandb_exp_name`, export `WANDB_API_KEY`.
+- Checkpointing is controlled via the `checkpoint.*` section. Use the same path for `save` and `load` to resume training.
+
+#### Pretraining script example
+
+We provide example scripts for running 1.3B and 14B model sizes on mock dataset (see `wan_1_3B.yaml` and `wan_14B.yaml` under `examples/megatron/recipes/wan/conf`). From these starting points, users can set their own configuration by copy one of the example override configs and update it with your settings (e.g., with actual processed data, and specific configurations based on available hardware):
 
-#### Example: Pretrain WAN 1.3B
 ```bash
-NVTE_FUSED_ATTN=1 torchrun --nproc_per_node=8 examples/megatron/recipes/wan/pretrain_wan.py \
+cp examples/megatron/recipes/wan/conf/wan_1_3B.yaml examples/megatron/recipes/wan/conf/my_wan.yaml
+# Edit my_wan.yaml to set:
+# - dataset.path: Path to your WebDataset directory
+# - train.global_batch_size/micro_batch_size: Keep micro_batch_size=1
+# - model.tensor_model_parallel_size / model.context_parallel_size: Based on GPUs
+# - checkpoint.save and checkpoint.load: Checkpoint directory
+```
+
+Then run:
+
+```bash
+NVTE_FUSED_ATTN=1 uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+  examples/megatron/recipes/wan/pretrain_wan.py \
   --training-mode pretrain \
-  model.tensor_model_parallel_size=1 \
-  model.pipeline_model_parallel_size=1 \
-  model.context_parallel_size=4 \
-  model.crossattn_emb_size=1536 \
-  model.hidden_size=1536 \
-  model.ffn_hidden_size=8960 \
-  model.num_attention_heads=12 \
-  model.num_layers=30 \
-  model.qkv_format=thd \
-  dataset.path=/path/to/dataset \
-  checkpoint.save=/path/to/checkpoint_dir \
-  checkpoint.load=/path/to/checkpoint_dir \
-  checkpoint.load_optim=true \
-  checkpoint.save_interval=200 \
-  optimizer.lr=5e-6 \
-  optimizer.min_lr=5e-6 \
-  train.eval_iters=0 \
-  scheduler.lr_decay_style=constant \
-  scheduler.lr_warmup_iters=0 \
-  model.seq_length=2048 \
-  dataset.seq_length=2048 \
-  train.global_batch_size=2 \
+  --config-file examples/megatron/recipes/wan/conf/my_wan.yaml
+```
+
+You can also override any config values from the command line. For example:
+
+```bash
+NVTE_FUSED_ATTN=1 uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+  examples/megatron/recipes/wan/pretrain_wan.py \
+  --config-file examples/megatron/recipes/wan/conf/my_wan.yaml \
+  --training-mode pretrain \
+  dataset.path=/opt/wan_webdataset \
+  train.global_batch_size=8 \
   train.micro_batch_size=1 \
-  dataset.global_batch_size=2 \
-  dataset.micro_batch_size=1 \
-  logger.log_interval=1 \
-  logger.wandb_project="wan" \
-  logger.wandb_exp_name="${EXP_NAME}" \
-  logger.wandb_save_dir="${CHECKPOINT_DIR}"
+  model.tensor_model_parallel_size=2 \
+  model.context_parallel_size=4 \
+  checkpoint.save=/opt/pretrained_checkpoint \
+  checkpoint.load=/opt/pretrained_checkpoint
 ```
 
-#### Finetuning
-- Switch `--training-mode finetune` to enable the finetuning flow-matching setup. Adjust dataset and optimization parameters (learning rate, warmup steps, etc.) as needed for your task and hardware.
+#### 🧪 Quick Start with Mock Dataset
+If you want to run without a real dataset (for debugging or performance measurement), pass `--mock`:
+
+```bash
+NVTE_FUSED_ATTN=1 uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+  examples/megatron/recipes/wan/pretrain_wan.py \
+  --config-file examples/megatron/recipes/wan/conf/wan_1_3B.yaml \
+  --training-mode pretrain \
+  --mock
+```
+
+You may adjust mock shapes (`F_latents`, `H_latents`, `W_latents`) and packing behavior (`number_packed_samples`) in `WanMockDataModuleConfig` (see `dfm/src/megatron/recipes/wan/wan.py`) to simulate different data scenarios.
+
+---
+
+### 🎬 Inference
+
+After training, users can run inferencing with `examples/megatron/recipes/wan/inference_wan.py`. Set `--checkpoint_step` to use specific checkpoint for inference. Set `--sizes` and `--frame_nums` to specify video shape (frames, height, width). Set `--sample_steps` (default to 50) for number of noise diffusion steps.
 
-### Inference
 ```bash
-NVTE_FUSED_ATTN=1 torchrun --nproc_per_node=1 examples/megatron/recipes/wan/inference_wan.py  \
+NVTE_FUSED_ATTN=1 uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node 1 \
+  examples/megatron/recipes/wan/inference_wan.py  \
   --task t2v-1.3B \
   --sizes 480*832 \
   --checkpoint_dir /path/to/checkpoint \
   --checkpoint_step 0 \
   --frame_nums 81 \
   --prompts "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
-  --tensor_parallel_size 1 \
-  --context_parallel_size 1 \
-  --pipeline_parallel_size 1 \
-  --sequence_parallel False \
-  --base_seed 42 \
   --sample_steps 50
 ```
 
-### Parallelism Support
-The table below shows current parallelisms support for corresponding Wan model size.
+Note: Current inference path is single-GPU. Parallel inference is not yet supported.
 
-  | Model | Data Parallel | Tensor Parallel | Sequence Parallel | Pipeline Parallel | Context Parallel | FSDP |
-  |---|---|---|---|---|---|---|
-  | **1.3B** | ✅ | ✅ | ✅ |  | ✅ |  |
-  | **14B**  | ✅ | ✅ | ✅ |  | ✅ |  |
+---
 
+### ⚡ Parallelism Support
 
-### Performance
-The table below shows performances of corresponding Wan model size on a variety of Nvidia hardware (measured by TFLOPs/GPU).
+The table below shows current parallelism support for different model sizes:
 
-  | Model | H100 | GB200 | GB300 |
-  |---|---|---|---|
-  | **1.3B** |  |    |  |
-  | **14B** | 308 |  790  | 1000 |
+| Model | Data Parallel | Tensor Parallel | Sequence Parallel | Context Parallel |
+|---|---|---|---|---|
+| 1.3B | ✅ | ✅ | ✅ | ✅ |
+| 14B  | ✅ | ✅ | ✅ | ✅ |
 
 
 ### Citation
diff --git a/examples/megatron/recipes/wan/conf/wan_14B.yaml b/examples/megatron/recipes/wan/conf/wan_14B.yaml
new file mode 100644
index 00000000..0a1a0149
--- /dev/null
+++ b/examples/megatron/recipes/wan/conf/wan_14B.yaml
@@ -0,0 +1,42 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Example override file
+
+# To override a parameter, ensure the structure matches the ConfigContainer
+# and its sub-configurations (e.g., model, train, etc.)
+# Top-level ConfigContainer fields are dataclasses themselves
+
+model:
+
+  crossattn_emb_size: 5120
+  hidden_size: 5120
+  ffn_hidden_size: 13824
+  num_attention_heads: 40
+  num_layers: 40
+  tensor_model_parallel_size: 2
+  pipeline_model_parallel_size: 1
+  context_parallel_size: 4
+  sequence_parallel: true
+  recompute_granularity: full
+  recompute_method: uniform
+  recompute_num_layers: 1
+
+train:
+  global_batch_size: 1
+  micro_batch_size: 1
+
+dataset:
+  global_batch_size: 1
+  micro_batch_size: 1
diff --git a/examples/megatron/recipes/wan/conf/wan_1_3B.yaml b/examples/megatron/recipes/wan/conf/wan_1_3B.yaml
new file mode 100644
index 00000000..89a15d4b
--- /dev/null
+++ b/examples/megatron/recipes/wan/conf/wan_1_3B.yaml
@@ -0,0 +1,38 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Example override file
+
+# To override a parameter, ensure the structure matches the ConfigContainer
+# and its sub-configurations (e.g., model, train, etc.)
+# Top-level ConfigContainer fields are dataclasses themselves
+
+model:
+  crossattn_emb_size: 1536
+  hidden_size: 1536
+  ffn_hidden_size: 8960
+  num_attention_heads: 12
+  num_layers: 30
+  tensor_model_parallel_size: 1
+  pipeline_model_parallel_size: 1
+  context_parallel_size: 8
+  sequence_parallel: false
+
+train:
+  global_batch_size: 2
+  micro_batch_size: 1
+
+dataset:
+  global_batch_size: 2
+  micro_batch_size: 1

From 62457152060428fa4d36692e9678da70aac959ff Mon Sep 17 00:00:00 2001
From: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
Date: Fri, 21 Nov 2025 07:44:20 -0800
Subject: [PATCH 3/4] update README wan

---
 examples/megatron/recipes/wan/README.md | 74 ++++++++++---------------
 1 file changed, 30 insertions(+), 44 deletions(-)

diff --git a/examples/megatron/recipes/wan/README.md b/examples/megatron/recipes/wan/README.md
index 29d32675..45b4b95b 100644
--- a/examples/megatron/recipes/wan/README.md
+++ b/examples/megatron/recipes/wan/README.md
@@ -1,22 +1,19 @@
 ## 🚀 Megatron WAN
 
 ### 📋 Overview
-An open-source implementation of [WAN 2.1](https://github.com/Wan-Video/Wan2.1) (large-scale text-to-video/image generative models) built on top of [Megatron-Core](https://github.com/NVIDIA/Megatron-LM) and [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)for scalable and efficient training. It supports advanced parallelism strategies (data, tensor, sequence, and context parallelism) and optimized kernels (e.g., Transformer Engine fused attention).
+An open-source implementation of [WAN 2.1](https://github.com/Wan-Video/Wan2.1) (large-scale text-to-video/image generative models) built on top of [Megatron-Core](https://github.com/NVIDIA/Megatron-LM) and [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) for scalable and efficient training. It supports advanced parallelism strategies (data, tensor, sequence, and context parallelism) and optimized kernels (e.g., Transformer Engine fused attention).
 
 ---
 
 ### 📦 Dataset Preparation
-This recipe uses NVIDIA's Megatron-Energon as an efficient multi-modal data loader. Datasets should be in the WebDataset-compatible format (typically sharded `.tar` archives). Energon supports large-scale distributed loading, sharding, and sampling for video-text and image-text pairs.
-
-- Set `dataset.path` to your WebDataset directory or shard pattern.
-- See Megatron-Energon docs for format details, subflavors, and advanced options.
+This recipe uses NVIDIA's [Megatron-Energon](https://github.com/NVIDIA/Megatron-Energon) as an efficient multi-modal data loader. Datasets should be in the WebDataset-compatible format (typically sharded `.tar` archives). Energon supports large-scale distributed loading, sharding, and sampling for video-text and image-text pairs. Set `dataset.path` to your WebDataset directory or shard pattern. See Megatron-Energon docs for format details, subflavors, and advanced options.
 
 If you do not have a dataset yet or only need to validate performance/plumbing, see the "Quick Start with Mock Dataset" section below.
 
 ---
 
 #### 🗂️ Dataset Preparation Example
-Starting with a directory containing raw .mp4 videos and their corresponding metadata .json files containing captions, we’ll turn the data into WAN-ready WebDataset shards using our helper script, and then ask Energon to process those shards and create its metadata. After this, you can point `dataset.path` at the output folder and start training.
+Starting with a directory containing raw .mp4 videos and their corresponding .json metadata files containing captions, you can turn the data into WAN-ready WebDataset shards using our helper script. We then use Energon to process those shards and create its metadata. After this, you can set training script's `dataset.path` argument to the output processed data folder and start training.
 
 ```bash
 # 1) Define your input (raw videos) and output (WebDataset shards) folders. For example:
@@ -27,18 +24,18 @@ DATASET_PATH=/opt/wan_webdataset      # output WebDataset shards
 export HF_TOKEN=<your_huggingface_token>
 
 # 3) Create WAN shards with latents + text embeddings
-# Wan's VAE encoder and T5 encoder is used to extract videos' latents and caption embeddings
-#    --height/--width: arguments control resize target (832x480 is one supported option for 1.3B and 14B model)
-#    --center-crop: arguments for center crop to exact target size after resize
-uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+# Wan's VAE encoder and T5 encoder is used to extract videos' latents and caption embeddings offline before training, using the following core arugments:
+#    --height/--width: control resize target (832x480 is supported for both 1.3B and 14B model)
+#    --center-crop: run center crop to exact target size after resize
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node 1 \
   examples/megatron/recipes/wan/prepare_energon_dataset_wan.py \
   --video_folder "${DATASET_SRC}" \
   --output_dir "${DATASET_PATH}" \
-  --model "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"" \
+  --model "Wan-AI/Wan2.1-T2V-1.3B-Diffusers" \
   --height 480 --width 832 \
   --center-crop
 
-# 4) Ask Energon to process shards and create its metadata/spec
+# 4) Use Energon to process shards and create its metadata/spec
 energon prepare "${DATASET_PATH}"
 # In the interactive prompts:
 # - Enter a train/val/test split, e.g., "8,1,1"
@@ -52,9 +49,7 @@ What gets produced:
   - json: contain useful side-info (text caption, sizes, processing choices, etc.)
 - Energon writes a `.nv-meta` directory with dataset info and a `dataset.yaml` you can version/control.
 
-Next steps:
-- Point your WAN config (or CLI overrides) at `dataset.path=${DATASET_PATH}`
-- You’re ready to launch pretraining
+You’re ready to launch training. In the training config, we will point the WAN config (or CLI overrides) to the processed data output direcotry as `dataset.path=${DATASET_PATH}`.
 
 ---
 
@@ -76,15 +71,14 @@ Multiple parallelism techniques including tensor, sequence, and context parallel
 
 WAN training is driven by `examples/megatron/recipes/wan/pretrain_wan.py`, which supports both a YAML config file and CLI overrides. 
 
-The script exposes a `--training-mode` with `pretrain` and `finetune` presets for flow-matching hyperparameters. As a starting point for experiments, this presets specify that pretraining uses noisier, biased sampling (e.g., logit-normal, higher logit_std, lower flow_shift) for stability and broad learning, while finetuning uses uniform, lower-noise settings (e.g., uniform sampling, lower logit_std, higher flow_shift) to refine details and improve quality.
+The script exposes a `--training-mode` with `pretrain` and `finetune` presets for flow-matching hyperparameters as a starting point for experiments. This presets specify that pretraining uses noisier, biased sampling (e.g., logit-normal, higher logit_std, lower flow_shift) for stability and broad learning, while finetuning uses uniform, lower-noise settings (e.g., uniform sampling, lower logit_std, higher flow_shift) to refine details and improve quality.
 
-Notes:
-- If you use `logger.wandb_project` and `logger.wandb_exp_name`, export `WANDB_API_KEY`.
-- Checkpointing is controlled via the `checkpoint.*` section. Use the same path for `save` and `load` to resume training.
+**Notes**: If you use `logger.wandb_project` and `logger.wandb_exp_name`, export `WANDB_API_KEY`.
 
 #### Pretraining script example
 
-We provide example scripts for running 1.3B and 14B model sizes on mock dataset (see `wan_1_3B.yaml` and `wan_14B.yaml` under `examples/megatron/recipes/wan/conf`). From these starting points, users can set their own configuration by copy one of the example override configs and update it with your settings (e.g., with actual processed data, and specific configurations based on available hardware):
+We provide example scripts for running 1.3B and 14B model sizes on mock dataset (see `wan_1_3B.yaml` and `wan_14B.yaml` under `examples/megatron/recipes/wan/conf`). From these starting points, users can set their own configuration by copy one of the example override configs and update it with your settings (e.g., with actual processed data path, and specific configurations based on available hardware, etc.). Users can learn more about arugments detail at [Megatron-Bridge docs](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/docs/megatron-lm-to-megatron-bridge.md).
+
 
 ```bash
 cp examples/megatron/recipes/wan/conf/wan_1_3B.yaml examples/megatron/recipes/wan/conf/my_wan.yaml
@@ -98,7 +92,7 @@ cp examples/megatron/recipes/wan/conf/wan_1_3B.yaml examples/megatron/recipes/wa
 Then run:
 
 ```bash
-NVTE_FUSED_ATTN=1 uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
   examples/megatron/recipes/wan/pretrain_wan.py \
   --training-mode pretrain \
   --config-file examples/megatron/recipes/wan/conf/my_wan.yaml
@@ -107,7 +101,7 @@ NVTE_FUSED_ATTN=1 uv run --group megatron-bridge python -m torch.distributed.run
 You can also override any config values from the command line. For example:
 
 ```bash
-NVTE_FUSED_ATTN=1 uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
   examples/megatron/recipes/wan/pretrain_wan.py \
   --config-file examples/megatron/recipes/wan/conf/my_wan.yaml \
   --training-mode pretrain \
@@ -116,15 +110,15 @@ NVTE_FUSED_ATTN=1 uv run --group megatron-bridge python -m torch.distributed.run
   train.micro_batch_size=1 \
   model.tensor_model_parallel_size=2 \
   model.context_parallel_size=4 \
-  checkpoint.save=/opt/pretrained_checkpoint \
-  checkpoint.load=/opt/pretrained_checkpoint
+  checkpoint.save=/opt/pretrained_checkpoints \
+  checkpoint.load=/opt/pretrained_checkpoints
 ```
 
 #### 🧪 Quick Start with Mock Dataset
 If you want to run without a real dataset (for debugging or performance measurement), pass `--mock`:
 
 ```bash
-NVTE_FUSED_ATTN=1 uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
   examples/megatron/recipes/wan/pretrain_wan.py \
   --config-file examples/megatron/recipes/wan/conf/wan_1_3B.yaml \
   --training-mode pretrain \
@@ -140,18 +134,18 @@ You may adjust mock shapes (`F_latents`, `H_latents`, `W_latents`) and packing b
 After training, users can run inferencing with `examples/megatron/recipes/wan/inference_wan.py`. Set `--checkpoint_step` to use specific checkpoint for inference. Set `--sizes` and `--frame_nums` to specify video shape (frames, height, width). Set `--sample_steps` (default to 50) for number of noise diffusion steps.
 
 ```bash
-NVTE_FUSED_ATTN=1 uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node 1 \
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node 1 \
   examples/megatron/recipes/wan/inference_wan.py  \
   --task t2v-1.3B \
-  --sizes 480*832 \
-  --checkpoint_dir /path/to/checkpoint \
-  --checkpoint_step 0 \
   --frame_nums 81 \
+  --sizes 480*832 \
+  --checkpoint_dir /opt/pretrained_checkpoints \
+  --checkpoint_step 10000 \
   --prompts "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
   --sample_steps 50
 ```
 
-Note: Current inference path is single-GPU. Parallel inference is not yet supported.
+**Note**: Current inference path is single-GPU. Parallel inference is not yet supported.
 
 ---
 
@@ -159,19 +153,11 @@ Note: Current inference path is single-GPU. Parallel inference is not yet suppor
 
 The table below shows current parallelism support for different model sizes:
 
-| Model | Data Parallel | Tensor Parallel | Sequence Parallel | Context Parallel |
-|---|---|---|---|---|
-| 1.3B | ✅ | ✅ | ✅ | ✅ |
-| 14B  | ✅ | ✅ | ✅ | ✅ |
+| Model | Data Parallel | Tensor Parallel | Sequence Parallel | Context Parallel | FSDP |
+|---|---|---|---|---|---|
+| 1.3B | ✅ | ✅ | ✅ | ✅ |Coming Soon|
+| 14B  | ✅ | ✅ | ✅ | ✅ |Coming Soon|
 
 
-### Citation
-```bibtex
-@article{wan2.1,
-  title   = {Wan: Open and Advanced Large‐Scale Video Generative Models},
-  author  = {Wan Team},
-  year    = {2025},
-  note    = {Open­source video foundation model series (Wan 2.1), https://github.com/Wan-Video/Wan2.1/}
-}
-```
-
+### References
+Wan Team. (2025). Wan: Open and advanced large-scale video generative models (Wan 2.1). GitHub. https://github.com/Wan-Video/Wan2.1/
\ No newline at end of file

From c3b202e3b7767b794c6842ba013ad551a4afe608 Mon Sep 17 00:00:00 2001
From: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
Date: Fri, 21 Nov 2025 12:09:51 -0800
Subject: [PATCH 4/4] relocate teadme

---
 .../wan/README.md => docs/megatron/recipes/wan/wan2.1.md      | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
 rename examples/megatron/recipes/wan/README.md => docs/megatron/recipes/wan/wan2.1.md (99%)

diff --git a/examples/megatron/recipes/wan/README.md b/docs/megatron/recipes/wan/wan2.1.md
similarity index 99%
rename from examples/megatron/recipes/wan/README.md
rename to docs/megatron/recipes/wan/wan2.1.md
index 45b4b95b..cd84dabe 100644
--- a/examples/megatron/recipes/wan/README.md
+++ b/docs/megatron/recipes/wan/wan2.1.md
@@ -69,7 +69,7 @@ This recipe leverages sequence packing to maximize throughput. When a batch cont
 
 Multiple parallelism techniques including tensor, sequence, and context parallelism are supported and configurable per your hardware.
 
-WAN training is driven by `examples/megatron/recipes/wan/pretrain_wan.py`, which supports both a YAML config file and CLI overrides. 
+Wan training is driven by `examples/megatron/recipes/wan/pretrain_wan.py`, which supports both a YAML config file and CLI overrides.
 
 The script exposes a `--training-mode` with `pretrain` and `finetune` presets for flow-matching hyperparameters as a starting point for experiments. This presets specify that pretraining uses noisier, biased sampling (e.g., logit-normal, higher logit_std, lower flow_shift) for stability and broad learning, while finetuning uses uniform, lower-noise settings (e.g., uniform sampling, lower logit_std, higher flow_shift) to refine details and improve quality.
 
@@ -160,4 +160,4 @@ The table below shows current parallelism support for different model sizes:
 
 
 ### References
-Wan Team. (2025). Wan: Open and advanced large-scale video generative models (Wan 2.1). GitHub. https://github.com/Wan-Video/Wan2.1/
\ No newline at end of file
+Wan Team. (2025). Wan: Open and advanced large-scale video generative models (Wan 2.1). GitHub. https://github.com/Wan-Video/Wan2.1/