Skip to content

MoE expert parallelism + sequence parallelism#45408

Merged
3outeille merged 10 commits intorefactor-tp-dtensorfrom
moe-sequence-parallel
Apr 14, 2026
Merged

MoE expert parallelism + sequence parallelism#45408
3outeille merged 10 commits intorefactor-tp-dtensorfrom
moe-sequence-parallel

Conversation

@3outeille
Copy link
Copy Markdown
Member

Summary

  • Extends the TPStyle API (from TP refactor for FSDP + TP integration #45028) with MoE expert parallelism and sequence parallelism support
  • Adds PackedColwiseParallel, MoEExpertsParallel, PrepareModuleInputOutput, _AllReduceBackward custom ParallelStyle subclasses
  • Extends TPStyle with moe_experts, packed_colwise, activation, module, loss_parallel kinds
  • _StridedShard handling in core_model_loading.py for interleaved gate_up_proj weights
  • MoE model configs for mixtral, deepseek_v3, qwen3 with sequence parallelism plans

Part of the distributed training API chain: #44989

Chain: main ← #44989 ← #44083 ← #44974 ← #45028 ← this PR ← orchestration+save PR

Review question

Are the custom ParallelStyle subclasses correct for expert sharding + sequence parallelism?

Test plan

  • Verify MoE expert sharding produces correct DTensor placements
  • Test sequence parallelism with allgather/split hooks
  • Run existing TP mixin tests to ensure no regression

- Add PackedColwiseParallel for fused gate_up_proj weights
- Add MoEExpertsParallel with per-expert DTensor sharding
- Add PrepareModuleInputOutput for SP allgather/split hooks
- Add _AllReduceBackward for MoE routing weight gradients
- Extend TPStyle with moe_experts, packed_colwise, activation, module kinds
- _StridedShard handling in core_model_loading for interleaved weights
- MoE model configs: mixtral, deepseek_v3, qwen3 with SP plans
- DTensor rotary_pos_emb guard for mixtral
@3outeille 3outeille force-pushed the moe-sequence-parallel branch from 24ca327 to 7f297e0 Compare April 14, 2026 13:44
3outeille and others added 6 commits April 14, 2026 14:24
# Conflicts:
#	src/transformers/integrations/tensor_parallel.py
# Conflicts:
#	src/transformers/integrations/tensor_parallel.py
The _IdentityOp class (added by PR #44983) was accidentally deleted
during the MoE expert parallelism work. It is needed by
finegrained_fp8.py and metal_quantization.py as a pass-through
reverse_op for dequantize operations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@3outeille 3outeille force-pushed the moe-sequence-parallel branch from 5031188 to 01866b8 Compare April 14, 2026 15:37
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: deepseek_v3, dots1, mixtral, nanochat, qwen3, qwen3_5, qwen3_5_moe, qwen3_moe, qwen3_next, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe, youtu

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* from_pretrained orchestration + save/load

- Add gather_full_state_dict() for DTensor→full tensor saving
- Add convert_strided_to_shard() / restore_strided_from_shard() for DCP
- Add _redistribute_dtensor() helper
- Full distributed_config integration in from_pretrained/save_pretrained
- Rename apply_fsdp2 → apply_fully_shard_data_parallel
- save_optimizer() / load_optimizer() in distributed/utils
- Trainer integration with distributed_config
- Updated FSDP and TP tests for new orchestration API
- DTensor shard-on-read test updates

* revert distributed utils

* eaaea

* all tests for core modeling are passing

* populate import from init for tp

* ruff

* ruff
@3outeille 3outeille merged commit 7ca7911 into refactor-tp-dtensor Apr 14, 2026
12 of 28 checks passed
@3outeille 3outeille deleted the moe-sequence-parallel branch April 14, 2026 16:12
@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45408&sha=bbf3ab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants