Add distributed training CI by 3outeille · Pull Request #42765 · huggingface/transformers

3outeille · 2025-12-10T11:06:27Z

it needs several PR:

Add FSDP v2 (with dtensor for now) natively to Transformers (not rely on transformer anymore) => needs to compare correctness through DDP
Check save/load of TP/FSDP/FSDP+TP
distributed training ci
- use gloo backend
- no subprocess
- check if it bitwise convergent in terms of loss & grad_norm
- need to add FSDP torchtitan + TP from transformers
- Fix related bug

…st replicate TrainingTesterMixin for now)

…ct.toml, and update tests_fetcher.py

HuggingFaceDocBuilderDev · 2025-12-10T11:25:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ted tests

…FSDP usage)

github-actions · 2025-12-10T16:03:16Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42765&sha=7b744c

- Introduced `TrainingConfigMixin` to share hyperparameters between `TrainingTesterMixin` and `TrainingDistributedTesterMixin`. - Updated `TrainingDistributedTesterMixin` to inherit from `TrainingConfigMixin` and adjusted training parameters for faster distributed tests. - Enhanced documentation for clarity on the purpose of each mixin.

Add distributed training CI job to CircleCI configuration

216b909

3outeille mentioned this pull request Dec 10, 2025

FSDP2 training hangs during backward pass with MoE models when some experts are not activated #41881

Closed

4 tasks

3outeille added 3 commits December 10, 2025 11:13

update naming

4c062ae

Add TrainingDistributedTesterMixin for distributed training tests (ju…

46ffe62

…st replicate TrainingTesterMixin for now)

Add markers for distributed training tests in conftest.py and pyproje…

5bfe881

…ct.toml, and update tests_fetcher.py

3outeille added 8 commits December 10, 2025 12:56

Add is_training_distributed_test decorator allowing conditional skipping

de58c0e

can now run a simple hello world in distributed setting on cpu

084c396

easier way to gridseach different FSDP x TP configuration of distribu…

868c38d

…ted tests

add 2D device mesh

61d3ee7

Refactor global_wrapper to use device mesh for distributed training

e14a25c

instantiate model and begin fsdp

e1a415e

Improve logging to include rank when distributed training is initialized

e2221a5

undo fsdp as it is not prio right now (it requires uniformization of …

7b744c3

…FSDP usage)

Merge branch 'main' into v5-distributed-training-ci

7106d1c

3outeille mentioned this pull request Feb 2, 2026

refactor + robusts tests for Tensor Parallel #42809

Merged

3outeille and others added 4 commits February 2, 2026 16:15

add tp=2 test training

acf75f0

Merge branch 'main' into v5-distributed-training-ci

6fb7c3d

Merge branch 'main' into v5-distributed-training-ci

9cbdff4

3outeille mentioned this pull request Feb 17, 2026

🚨 fix + tests dense & MoE TP all reduce (decoder only) #43722

Merged

3outeille mentioned this pull request Mar 26, 2026

🚨 Distributed training API #44989

Draft

evalstate mentioned this pull request Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed training CI#42765

Add distributed training CI#42765
3outeille wants to merge 17 commits intomainfrom
v5-distributed-training-ci

3outeille commented Dec 10, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Dec 10, 2025

Uh oh!

github-actions Bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

3outeille commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Dec 10, 2025

Uh oh!

github-actions Bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

3outeille commented Dec 10, 2025 •

edited

Loading