MoE expert parallelism + sequence parallelism by 3outeille · Pull Request #45408 · huggingface/transformers

3outeille · 2026-04-13T14:25:08Z

Summary

Extends the TPStyle API (from TP refactor for FSDP + TP integration #45028) with MoE expert parallelism and sequence parallelism support
Adds PackedColwiseParallel, MoEExpertsParallel, PrepareModuleInputOutput, _AllReduceBackward custom ParallelStyle subclasses
Extends TPStyle with moe_experts, packed_colwise, activation, module, loss_parallel kinds
_StridedShard handling in core_model_loading.py for interleaved gate_up_proj weights
MoE model configs for mixtral, deepseek_v3, qwen3 with sequence parallelism plans

Part of the distributed training API chain: #44989

Chain: main ← #44989 ← #44083 ← #44974 ← #45028 ← this PR ← orchestration+save PR

Review question

Are the custom ParallelStyle subclasses correct for expert sharding + sequence parallelism?

Test plan

Verify MoE expert sharding produces correct DTensor placements
Test sequence parallelism with allgather/split hooks
Run existing TP mixin tests to ensure no regression

- Add PackedColwiseParallel for fused gate_up_proj weights - Add MoEExpertsParallel with per-expert DTensor sharding - Add PrepareModuleInputOutput for SP allgather/split hooks - Add _AllReduceBackward for MoE routing weight gradients - Extend TPStyle with moe_experts, packed_colwise, activation, module kinds - _StridedShard handling in core_model_loading for interleaved weights - MoE model configs: mixtral, deepseek_v3, qwen3 with SP plans - DTensor rotary_pos_emb guard for mixtral

# Conflicts: # src/transformers/integrations/tensor_parallel.py

The _IdentityOp class (added by PR #44983) was accidentally deleted during the MoE expert parallelism work. It is needed by finegrained_fp8.py and metal_quantization.py as a pass-through reverse_op for dequantize operations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-14T15:43:05Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: deepseek_v3, dots1, mixtral, nanochat, qwen3, qwen3_5, qwen3_5_moe, qwen3_moe, qwen3_next, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe, youtu

HuggingFaceDocBuilderDev · 2026-04-14T15:52:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* from_pretrained orchestration + save/load - Add gather_full_state_dict() for DTensor→full tensor saving - Add convert_strided_to_shard() / restore_strided_from_shard() for DCP - Add _redistribute_dtensor() helper - Full distributed_config integration in from_pretrained/save_pretrained - Rename apply_fsdp2 → apply_fully_shard_data_parallel - save_optimizer() / load_optimizer() in distributed/utils - Trainer integration with distributed_config - Updated FSDP and TP tests for new orchestration API - DTensor shard-on-read test updates * revert distributed utils * eaaea * all tests for core modeling are passing * populate import from init for tp * ruff * ruff

github-actions · 2026-04-14T16:25:21Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45408&sha=bbf3ab

3outeille mentioned this pull request Apr 13, 2026

🚨 Distributed training API #44989

Draft

3outeille force-pushed the refactor-tp-dtensor branch from 34a5085 to eb428cc Compare April 14, 2026 09:54

3outeille force-pushed the moe-sequence-parallel branch from e04c7d9 to 24ca327 Compare April 14, 2026 09:54

3outeille force-pushed the refactor-tp-dtensor branch from eb428cc to e0c4e06 Compare April 14, 2026 13:44

Merge branch 'refactor-tp-dtensor' into moe-sequence-parallel

7f297e0

3outeille force-pushed the moe-sequence-parallel branch from 24ca327 to 7f297e0 Compare April 14, 2026 13:44

3outeille and others added 6 commits April 14, 2026 14:24

Fix ruff linting and formatting

4f350d2

Merge branch 'refactor-tp-dtensor' into moe-sequence-parallel

30a1586

# Conflicts: # src/transformers/integrations/tensor_parallel.py

Fix ruff formatting in core_model_loading.py

4459197

Merge branch 'refactor-tp-dtensor' into moe-sequence-parallel

299920b

# Conflicts: # src/transformers/integrations/tensor_parallel.py

Backport new TP/FSDP API + fix DTensor imports in Copied-from models

01866b8

3outeille force-pushed the moe-sequence-parallel branch from 5031188 to 01866b8 Compare April 14, 2026 15:37

Merge branch 'refactor-tp-dtensor' into moe-sequence-parallel

23d3e8c

3outeille merged commit 7ca7911 into refactor-tp-dtensor Apr 14, 2026
12 of 28 checks passed

3outeille deleted the moe-sequence-parallel branch April 14, 2026 16:12

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE expert parallelism + sequence parallelism#45408

MoE expert parallelism + sequence parallelism#45408
3outeille merged 10 commits intorefactor-tp-dtensorfrom
moe-sequence-parallel

3outeille commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 14, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

3outeille commented Apr 13, 2026

Summary

Review question

Test plan

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 14, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants