Skip to content

docs: update the finetune guide#1678

Merged
akoumpa merged 1 commit intomainfrom
akoumparouli/docs_refine_finetune
Apr 5, 2026
Merged

docs: update the finetune guide#1678
akoumpa merged 1 commit intomainfrom
akoumparouli/docs_refine_finetune

Conversation

@akoumpa
Copy link
Copy Markdown
Contributor

@akoumpa akoumpa commented Apr 3, 2026

… fields more

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

… fields more

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa akoumpa added the docs-only With great power comes great responsibility. label Apr 3, 2026
@akoumpa akoumpa marked this pull request as ready for review April 4, 2026 23:37
@akoumpa akoumpa merged commit 0dea775 into main Apr 5, 2026
4 checks passed
@akoumpa akoumpa deleted the akoumparouli/docs_refine_finetune branch April 5, 2026 01:24
hemildesai added a commit that referenced this pull request Apr 6, 2026
The docs finetune guide (added in #1678) uses :::{details} which
requires the html_admonition myst extension. Without it, sphinx
--fail-on-warning rejects the unknown directive.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
hemildesai added a commit that referenced this pull request Apr 6, 2026
The docs finetune guide (added in #1678) uses :::{details} which
requires the html_admonition myst extension. Without it, sphinx
--fail-on-warning rejects the unknown directive.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
hemildesai added a commit that referenced this pull request Apr 7, 2026
#1684)

* fix: resolve PT 2.11 DeviceMesh deprecation warnings and unify EP mesh

Two PyTorch 2.11 deprecation warnings fired on every training run:
1. `_mesh_resources.get_root_mesh()` deprecated in favor of `DeviceMesh._get_root_mesh()`
2. `root_mesh["flattened_dim"]` deprecated for dims created via `_flatten()`

Additionally, the MoE mesh was created as a standalone `init_device_mesh` call
separate from the main device mesh, requiring a redundant global collective and
making TP+EP coexistence impossible.

Changes:
- Add `get_flat_mesh()` and `get_submesh()` utilities in mesh_utils.py that
  access `_flatten()` results directly via `_flatten_mapping`, and construct
  mixed-dim submeshes via `_unflatten()` from a parent flattened mesh
- Replace standalone `_create_moe_mesh()` with `_unflatten()` from the root
  mesh's non-pp dims, deriving EP process groups from the same mesh hierarchy
- EP mesh now spans dp, cp, and tp groups (matching the old standalone mesh
  semantics and enabling future TP+EP support)
- Consolidate `state_dict_utils.get_submesh` as a re-export of the shared
  `mesh_utils.get_submesh`
- Update all callers: base_recipe, parallelizer (FSDP2 + MoE), vlm/finetune,
  optim/utils, mesh.py axis size lookups

Validated with:
- 918 unit tests passing
- Qwen3 MoE 30B EP=8 (full + LoRA), LLaMA 3.1 8B PP=2 end-to-end training
- Multi-process verification that unified EP groups match standalone mesh
- Zero deprecation warnings across all runs

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: address review comments and linting failures

- Remove unused dp_cp_mesh assignment (ruff F841)
- Fix import ordering in parallelizer.py (ruff I001)
- Keep cross-component imports lazy to satisfy import-linter rules
  (moe -> distributed, optim -> distributed)
- Harden get_submesh size matching with try/except on _unflatten
  to handle ambiguous size collisions
- Consolidate state_dict_utils.get_submesh as thin wrapper delegating
  to mesh_utils.get_submesh

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: validate get_submesh parent match with process group check

Address claude-bot review: size-only matching in get_submesh could pick
the wrong parent flattened mesh if two entries have equal total size.
After _unflatten, now validates that process groups for any root-mesh dim
in the result match the root mesh's groups for that dim.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: resolve import linting violations and restore dp_mesh test assertions

- Revert state_dict_utils.get_submesh to inline impl (avoid cross-component
  import from moe -> distributed that breaks import-linter)
- Revert optim/utils.py to original approach (avoid optim -> distributed import)
- Remove stale lazy import in WanParallelizationStrategy
- Restore exact dp_mesh assertions in Wan strategy tests by monkeypatching
  get_submesh to return a known sentinel object

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: raise KeyError instead of falling back to deprecated API

Replace fallback paths in get_flat_mesh and get_submesh that would
silently trigger the PT 2.11 deprecation warning with explicit KeyError.
If a dim is not found in mesh_dim_names or _flatten_mapping, it is a
caller error rather than something to silently degrade on.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: update test mocks with mesh_dim_names and _flatten_mapping for get_submesh

Test mocks for DeviceMesh now include mesh_dim_names, _flatten_mapping,
and _get_root_mesh so that get_flat_mesh/get_submesh can resolve dims.
Also patch dist.get_process_group_ranks in strategy integration tests
for the get_submesh validation step.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: rename misleading root_dim_names variable to mesh_dim_names

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: address review — validate all dims in get_submesh and fix indentation

- Validate process groups for ALL requested dims (not just mesh dims)
  by using get_flat_mesh for both mesh and flattened dim lookups
- Fix 2-space indentation in test_strategy_integration.py

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: remove unused mesh_dim_names variable

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: handle string key lookups in FakeWorldMesh for get_flat_mesh

FakeWorldMesh.__getitem__ now handles both string "dp" and tuple
("dp",) lookups, since get_flat_mesh passes dim names as strings.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: enable html_admonition myst extension for details directive

The docs finetune guide (added in #1678) uses :::{details} which
requires the html_admonition myst extension. Without it, sphinx
--fail-on-warning rejects the unknown directive.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: replace unsupported details directive with dropdown from sphinx-design

The {details} directive doesn't exist in myst-parser. Replace with
{dropdown} from sphinx-design (already in extensions) which provides
the same collapsible UI. Also revert the unnecessary html_admonition
extension addition.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
edjson pushed a commit to edjson/Automodel that referenced this pull request Apr 17, 2026
NVIDIA-NeMo#1684)

* fix: resolve PT 2.11 DeviceMesh deprecation warnings and unify EP mesh

Two PyTorch 2.11 deprecation warnings fired on every training run:
1. `_mesh_resources.get_root_mesh()` deprecated in favor of `DeviceMesh._get_root_mesh()`
2. `root_mesh["flattened_dim"]` deprecated for dims created via `_flatten()`

Additionally, the MoE mesh was created as a standalone `init_device_mesh` call
separate from the main device mesh, requiring a redundant global collective and
making TP+EP coexistence impossible.

Changes:
- Add `get_flat_mesh()` and `get_submesh()` utilities in mesh_utils.py that
  access `_flatten()` results directly via `_flatten_mapping`, and construct
  mixed-dim submeshes via `_unflatten()` from a parent flattened mesh
- Replace standalone `_create_moe_mesh()` with `_unflatten()` from the root
  mesh's non-pp dims, deriving EP process groups from the same mesh hierarchy
- EP mesh now spans dp, cp, and tp groups (matching the old standalone mesh
  semantics and enabling future TP+EP support)
- Consolidate `state_dict_utils.get_submesh` as a re-export of the shared
  `mesh_utils.get_submesh`
- Update all callers: base_recipe, parallelizer (FSDP2 + MoE), vlm/finetune,
  optim/utils, mesh.py axis size lookups

Validated with:
- 918 unit tests passing
- Qwen3 MoE 30B EP=8 (full + LoRA), LLaMA 3.1 8B PP=2 end-to-end training
- Multi-process verification that unified EP groups match standalone mesh
- Zero deprecation warnings across all runs

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: address review comments and linting failures

- Remove unused dp_cp_mesh assignment (ruff F841)
- Fix import ordering in parallelizer.py (ruff I001)
- Keep cross-component imports lazy to satisfy import-linter rules
  (moe -> distributed, optim -> distributed)
- Harden get_submesh size matching with try/except on _unflatten
  to handle ambiguous size collisions
- Consolidate state_dict_utils.get_submesh as thin wrapper delegating
  to mesh_utils.get_submesh

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: validate get_submesh parent match with process group check

Address claude-bot review: size-only matching in get_submesh could pick
the wrong parent flattened mesh if two entries have equal total size.
After _unflatten, now validates that process groups for any root-mesh dim
in the result match the root mesh's groups for that dim.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: resolve import linting violations and restore dp_mesh test assertions

- Revert state_dict_utils.get_submesh to inline impl (avoid cross-component
  import from moe -> distributed that breaks import-linter)
- Revert optim/utils.py to original approach (avoid optim -> distributed import)
- Remove stale lazy import in WanParallelizationStrategy
- Restore exact dp_mesh assertions in Wan strategy tests by monkeypatching
  get_submesh to return a known sentinel object

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: raise KeyError instead of falling back to deprecated API

Replace fallback paths in get_flat_mesh and get_submesh that would
silently trigger the PT 2.11 deprecation warning with explicit KeyError.
If a dim is not found in mesh_dim_names or _flatten_mapping, it is a
caller error rather than something to silently degrade on.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: update test mocks with mesh_dim_names and _flatten_mapping for get_submesh

Test mocks for DeviceMesh now include mesh_dim_names, _flatten_mapping,
and _get_root_mesh so that get_flat_mesh/get_submesh can resolve dims.
Also patch dist.get_process_group_ranks in strategy integration tests
for the get_submesh validation step.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: rename misleading root_dim_names variable to mesh_dim_names

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: address review — validate all dims in get_submesh and fix indentation

- Validate process groups for ALL requested dims (not just mesh dims)
  by using get_flat_mesh for both mesh and flattened dim lookups
- Fix 2-space indentation in test_strategy_integration.py

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: remove unused mesh_dim_names variable

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: handle string key lookups in FakeWorldMesh for get_flat_mesh

FakeWorldMesh.__getitem__ now handles both string "dp" and tuple
("dp",) lookups, since get_flat_mesh passes dim names as strings.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: enable html_admonition myst extension for details directive

The docs finetune guide (added in NVIDIA-NeMo#1678) uses :::{details} which
requires the html_admonition myst extension. Without it, sphinx
--fail-on-warning rejects the unknown directive.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: replace unsupported details directive with dropdown from sphinx-design

The {details} directive doesn't exist in myst-parser. Replace with
{dropdown} from sphinx-design (already in extensions) which provides
the same collapsible UI. Also revert the unnecessary html_admonition
extension addition.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
edjson pushed a commit to edjson/Automodel that referenced this pull request Apr 18, 2026
NVIDIA-NeMo#1684)

* fix: resolve PT 2.11 DeviceMesh deprecation warnings and unify EP mesh

Two PyTorch 2.11 deprecation warnings fired on every training run:
1. `_mesh_resources.get_root_mesh()` deprecated in favor of `DeviceMesh._get_root_mesh()`
2. `root_mesh["flattened_dim"]` deprecated for dims created via `_flatten()`

Additionally, the MoE mesh was created as a standalone `init_device_mesh` call
separate from the main device mesh, requiring a redundant global collective and
making TP+EP coexistence impossible.

Changes:
- Add `get_flat_mesh()` and `get_submesh()` utilities in mesh_utils.py that
  access `_flatten()` results directly via `_flatten_mapping`, and construct
  mixed-dim submeshes via `_unflatten()` from a parent flattened mesh
- Replace standalone `_create_moe_mesh()` with `_unflatten()` from the root
  mesh's non-pp dims, deriving EP process groups from the same mesh hierarchy
- EP mesh now spans dp, cp, and tp groups (matching the old standalone mesh
  semantics and enabling future TP+EP support)
- Consolidate `state_dict_utils.get_submesh` as a re-export of the shared
  `mesh_utils.get_submesh`
- Update all callers: base_recipe, parallelizer (FSDP2 + MoE), vlm/finetune,
  optim/utils, mesh.py axis size lookups

Validated with:
- 918 unit tests passing
- Qwen3 MoE 30B EP=8 (full + LoRA), LLaMA 3.1 8B PP=2 end-to-end training
- Multi-process verification that unified EP groups match standalone mesh
- Zero deprecation warnings across all runs

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: address review comments and linting failures

- Remove unused dp_cp_mesh assignment (ruff F841)
- Fix import ordering in parallelizer.py (ruff I001)
- Keep cross-component imports lazy to satisfy import-linter rules
  (moe -> distributed, optim -> distributed)
- Harden get_submesh size matching with try/except on _unflatten
  to handle ambiguous size collisions
- Consolidate state_dict_utils.get_submesh as thin wrapper delegating
  to mesh_utils.get_submesh

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: validate get_submesh parent match with process group check

Address claude-bot review: size-only matching in get_submesh could pick
the wrong parent flattened mesh if two entries have equal total size.
After _unflatten, now validates that process groups for any root-mesh dim
in the result match the root mesh's groups for that dim.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: resolve import linting violations and restore dp_mesh test assertions

- Revert state_dict_utils.get_submesh to inline impl (avoid cross-component
  import from moe -> distributed that breaks import-linter)
- Revert optim/utils.py to original approach (avoid optim -> distributed import)
- Remove stale lazy import in WanParallelizationStrategy
- Restore exact dp_mesh assertions in Wan strategy tests by monkeypatching
  get_submesh to return a known sentinel object

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: raise KeyError instead of falling back to deprecated API

Replace fallback paths in get_flat_mesh and get_submesh that would
silently trigger the PT 2.11 deprecation warning with explicit KeyError.
If a dim is not found in mesh_dim_names or _flatten_mapping, it is a
caller error rather than something to silently degrade on.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: update test mocks with mesh_dim_names and _flatten_mapping for get_submesh

Test mocks for DeviceMesh now include mesh_dim_names, _flatten_mapping,
and _get_root_mesh so that get_flat_mesh/get_submesh can resolve dims.
Also patch dist.get_process_group_ranks in strategy integration tests
for the get_submesh validation step.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: rename misleading root_dim_names variable to mesh_dim_names

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: address review — validate all dims in get_submesh and fix indentation

- Validate process groups for ALL requested dims (not just mesh dims)
  by using get_flat_mesh for both mesh and flattened dim lookups
- Fix 2-space indentation in test_strategy_integration.py

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: remove unused mesh_dim_names variable

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: handle string key lookups in FakeWorldMesh for get_flat_mesh

FakeWorldMesh.__getitem__ now handles both string "dp" and tuple
("dp",) lookups, since get_flat_mesh passes dim names as strings.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: enable html_admonition myst extension for details directive

The docs finetune guide (added in NVIDIA-NeMo#1678) uses :::{details} which
requires the html_admonition myst extension. Without it, sphinx
--fail-on-warning rejects the unknown directive.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: replace unsupported details directive with dropdown from sphinx-design

The {details} directive doesn't exist in myst-parser. Replace with
{dropdown} from sphinx-design (already in extensions) which provides
the same collapsible UI. Also revert the unnecessary html_admonition
extension addition.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Edison <edisonggacc@gmail.com>
linnanwang pushed a commit that referenced this pull request Apr 24, 2026
update the finetune guide to include more information to explain YAML fields more

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
linnanwang pushed a commit that referenced this pull request Apr 24, 2026
#1684)

* fix: resolve PT 2.11 DeviceMesh deprecation warnings and unify EP mesh

Two PyTorch 2.11 deprecation warnings fired on every training run:
1. `_mesh_resources.get_root_mesh()` deprecated in favor of `DeviceMesh._get_root_mesh()`
2. `root_mesh["flattened_dim"]` deprecated for dims created via `_flatten()`

Additionally, the MoE mesh was created as a standalone `init_device_mesh` call
separate from the main device mesh, requiring a redundant global collective and
making TP+EP coexistence impossible.

Changes:
- Add `get_flat_mesh()` and `get_submesh()` utilities in mesh_utils.py that
  access `_flatten()` results directly via `_flatten_mapping`, and construct
  mixed-dim submeshes via `_unflatten()` from a parent flattened mesh
- Replace standalone `_create_moe_mesh()` with `_unflatten()` from the root
  mesh's non-pp dims, deriving EP process groups from the same mesh hierarchy
- EP mesh now spans dp, cp, and tp groups (matching the old standalone mesh
  semantics and enabling future TP+EP support)
- Consolidate `state_dict_utils.get_submesh` as a re-export of the shared
  `mesh_utils.get_submesh`
- Update all callers: base_recipe, parallelizer (FSDP2 + MoE), vlm/finetune,
  optim/utils, mesh.py axis size lookups

Validated with:
- 918 unit tests passing
- Qwen3 MoE 30B EP=8 (full + LoRA), LLaMA 3.1 8B PP=2 end-to-end training
- Multi-process verification that unified EP groups match standalone mesh
- Zero deprecation warnings across all runs

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: address review comments and linting failures

- Remove unused dp_cp_mesh assignment (ruff F841)
- Fix import ordering in parallelizer.py (ruff I001)
- Keep cross-component imports lazy to satisfy import-linter rules
  (moe -> distributed, optim -> distributed)
- Harden get_submesh size matching with try/except on _unflatten
  to handle ambiguous size collisions
- Consolidate state_dict_utils.get_submesh as thin wrapper delegating
  to mesh_utils.get_submesh

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: validate get_submesh parent match with process group check

Address claude-bot review: size-only matching in get_submesh could pick
the wrong parent flattened mesh if two entries have equal total size.
After _unflatten, now validates that process groups for any root-mesh dim
in the result match the root mesh's groups for that dim.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: resolve import linting violations and restore dp_mesh test assertions

- Revert state_dict_utils.get_submesh to inline impl (avoid cross-component
  import from moe -> distributed that breaks import-linter)
- Revert optim/utils.py to original approach (avoid optim -> distributed import)
- Remove stale lazy import in WanParallelizationStrategy
- Restore exact dp_mesh assertions in Wan strategy tests by monkeypatching
  get_submesh to return a known sentinel object

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: raise KeyError instead of falling back to deprecated API

Replace fallback paths in get_flat_mesh and get_submesh that would
silently trigger the PT 2.11 deprecation warning with explicit KeyError.
If a dim is not found in mesh_dim_names or _flatten_mapping, it is a
caller error rather than something to silently degrade on.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: update test mocks with mesh_dim_names and _flatten_mapping for get_submesh

Test mocks for DeviceMesh now include mesh_dim_names, _flatten_mapping,
and _get_root_mesh so that get_flat_mesh/get_submesh can resolve dims.
Also patch dist.get_process_group_ranks in strategy integration tests
for the get_submesh validation step.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: rename misleading root_dim_names variable to mesh_dim_names

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: address review — validate all dims in get_submesh and fix indentation

- Validate process groups for ALL requested dims (not just mesh dims)
  by using get_flat_mesh for both mesh and flattened dim lookups
- Fix 2-space indentation in test_strategy_integration.py

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: remove unused mesh_dim_names variable

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: handle string key lookups in FakeWorldMesh for get_flat_mesh

FakeWorldMesh.__getitem__ now handles both string "dp" and tuple
("dp",) lookups, since get_flat_mesh passes dim names as strings.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: enable html_admonition myst extension for details directive

The docs finetune guide (added in #1678) uses :::{details} which
requires the html_admonition myst extension. Without it, sphinx
--fail-on-warning rejects the unknown directive.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* fix: replace unsupported details directive with dropdown from sphinx-design

The {details} directive doesn't exist in myst-parser. Replace with
{dropdown} from sphinx-design (already in extensions) which provides
the same collapsible UI. Also revert the unnecessary html_admonition
extension addition.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-only With great power comes great responsibility.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants