FSDP2 native support in transformers by 3outeille · Pull Request #44083 · huggingface/transformers

3outeille · 2026-02-17T10:57:06Z

TODO:
fsdp => faire comme tp en mode fsdp_plan manual qui devient l'auto par défaut

This PR introduces first-class FSDP2 (Fully Sharded Data Parallel v2) support directly in Transformers, bypassing the need for Accelerate's FSDP wrapper. It covers the full lifecycle: model distribution, training, checkpointing, and CI testing across dozens of models.

A standalone script for usage and expected throughput will be available at https://github.com/huggingface/distributed-training-cookbook

TODO: add example train_fsdp.py in the PR

The correctness has been tested to match the Torchtitan implementation (cf https://github.com/huggingface/torchtitan/blob/sanity-check-fsdp/torchtitan/experiments/debug_fsdp/README.md)

NOTE: Transformers modeling do take more memory for some reasons. To investigate later

1. Native FSDP2 Integration (`src/transformers/integrations/fsdp.py`)

The core addition is a new FSDP integration module that provides:

initialize_fsdp() -- Sets up the DeviceMesh and process group for FSDP2 (requires PyTorch >= 2.5). Handles automatic backend detection (NCCL, GLOO, XCCL, etc.) and device assignment.
apply_fsdp2() with two modes:
- Auto mode ({"mode": "auto"}) -- Automatically discovers transformer block classes (DecoderLayer, EncoderLayer, etc.), shards input embeddings, all transformer blocks, and groups the final norm + output head together. Supports optional cpu_offload and mixed_precision policies.
- Manual mode ({"mode": "manual", "modules": {...}}) -- Lets users specify exactly which modules to shard, with per-module options like "free_full_weight", "keep_full_weight", "cpu_offload", and "mixed_precision".
Smart block detection (get_transformer_block_classes()) -- Finds transformer block classes by name pattern and filters out nested blocks (e.g., MoeBlock inside DecoderLayer) to only FSDP-wrap the outermost ones. This enables MoE model support.
Tied weight handling -- Properly handles weight tying (e.g., lm_head.weight == embed_tokens.weight) by grouping tied modules and re-tying after fully_shard replaces parameters with DTensors.

2. FSDP2-Aware Save/Load via DCP + Safetensors

save_fsdp_model() -- Saves FSDP2 model weights using PyTorch's Distributed Checkpoint (DCP) with HuggingFaceStorageWriter, enabling parallel distributed save with automatic consolidation into standard HF-compatible safetensors files.
save_pretrained() integration -- PreTrainedModel.save_pretrained() now detects FSDP2 models (_is_fsdp_managed_module) and automatically routes to the DCP save path.
from_pretrained() integration -- Accepts new fsdp_plan and fsdp_device_mesh kwargs. After loading weights, it applies FSDP2 distribution via distribute_fsdp_model().

3. Comprehensive FSDP Test Suite (`tests/test_fsdp_mixin.py`)

A new FSDPTesterMixin class is added to the standard test infrastructure, automatically inherited by all CausalLMModelTest classes. It includes 7 batched subtests per model, all run on CPU through gloo backend + mp.spawn:

Test	What it validates
`sharding_structure_untied/tied`	Correct FSDP wrapping targets match expectations
`auto_plan_vs_ddp` (untied/tied)	FSDP2 auto mode produces identical losses, grad norms, and final weights as DDP
`manual_plan_vs_ddp` (untied/tied)	FSDP2 manual mode matches DDP
`save_load`	Save via `save_pretrained` + reload via `from_pretrained` produces bit-exact weights

The tests also validate checkpoint resumability: train for N/2 steps, save checkpoint (model via DCP+safetensors, optimizer+RNG via distcp), load into a fresh model, continue training, and verify the full trace matches an uninterrupted DDP run.

4. CI Test for Broad Model Coverage

Two bash scripts run the FSDP mixin tests across many models in parallel:

Dense models: Tests ~10 active models (GPT-2, Qwen3, Phi, Llama, ModernBERT-decoder, OLMo3, Phi3, Mistral, LFM2, Qwen3.5) out of 40 total, ranked by HuggingFace Hub downloads.
MoE models: Tests ~10 active MoE models (GPT-OSS, GLM-MoE-DSA, Qwen3-MoE, GLM4-MoE-Lite, Qwen3.5-MoE, DeepSeek-V2, Qwen3-Next, Mixtral, Qwen2-MoE, PhiMoE) out of 24 total.

HuggingFaceDocBuilderDev · 2026-02-17T11:05:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Thanks, left a couple of comments !

SunMarc · 2026-03-11T15:27:18Z

@@ -0,0 +1,861 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.


let's create a new folder called tests/training and put those there instead. It will be better I think

SunMarc · 2026-03-11T15:27:46Z

+#!/bin/bash
+
+# Script to run all FSDP mixin tests for dense models in parallel.
+# Work in tandem with a special test_fsdp_mixin.py that batches all 11 distributed tests in a single mp.spawn. (will not be committed)
+# Uses concurrency-limited dispatch: multiple models share GPU pairs since test models are tiny (~7 MiB).


even the fsdp folder we can move that there

SunMarc · 2026-03-11T15:44:12Z

+    fsdp_plan:
+        Explicit FSDP config dict with a required "mode" key.
+
+        Auto mode:
+        fsdp_plan = {"mode": "auto"}
+
+        Auto mode with optional policies:
+        fsdp_plan = {"mode": "auto", "cpu_offload": False, "mixed_precision": True}
+
+        Manual mode:
+        fsdp_plan = {
+            "mode": "manual",
+            "modules": {
+                "model.embed_tokens": ["free_full_weight"],
+                "model.layers.0.self_attn": ["free_full_weight", "cpu_offload", "mixed_precision"],
+                "model.layers.0.mlp": ["free_full_weight"],
+                "model.norm": ["keep_full_weight"],
+                "lm_head": ["keep_full_weight"],
+            },
+        }
+    """


maybe it could make sense to have a nice dataclass for fsdp_plan instead of a dict ?

Yep, I think the idea was to stay simple like the pp / tp plan. But for fsdp we might want more control.

SunMarc · 2026-03-11T15:54:07Z

+def apply_fsdp2(
+    model,
+    device_mesh,
+    fsdp_plan: dict[str, Any] | None,
+):
+    """


It would be nice to check how it integrates well with trainer. We can pass fsdp and fsdp_config in training_args and we would have to do the mapping to create the correct fsdp_plan + we need to call apply_fsdp2 instead of prepare() on the model.

cursor · 2026-03-12T11:00:32Z

 from packaging import version

+#TODO(3outeille): guarding to protect against missing import
+from torch.distributed.device_mesh import init_device_mesh


Unconditional torch imports break non-torch environments

High Severity

Top-level import torch, import torch.distributed, import torch.multiprocessing, and from torch.distributed.device_mesh import init_device_mesh are added unconditionally at module level. This will cause an ImportError for anyone importing from testing_utils in an environment without PyTorch (or with an older PyTorch lacking device_mesh). The existing file already uses is_torch_available() guards for other torch imports — these new ones need the same treatment.

cursor · 2026-03-12T11:00:32Z

 )
 from .test_pipeline_mixin import PipelineTesterMixin
 from .test_tensor_parallel_mixin import TensorParallelTesterMixin
+from .test_fsdp_mixin import FSDPTesterMixin


Duplicate FSDPTesterMixin import in causal_lm_tester

Low Severity

from .test_fsdp_mixin import FSDPTesterMixin appears twice — once at line 32 and again at line 43. One of these is redundant and was likely left in by mistake during development.

Additional Locations (1)

tests/causal_lm_tester.py#L31-L32

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

ArthurZucker

Will look at the tests next time but my main comment:

You are not integrated with the core-model-loading API noo?

My understanding is that you should shard the weights exactly the same way we do for TP, maybe push further as you want to shard ALL layers, then when running the forward, each process materialize the weights locally (GP0-8 end up with the same full tensor) and discard it keeping only its slice.

We need 2 points:

TENSOR_PARALLEL_LAYERS integration in core_model_loading needs to support fsdp. This is what's gonna be responsible for loading the weights
distribute_module, which needs tto happen before the load, and is responsible for attaching tthe appropriate hooks for fsdp2.
THis also needs to be explained like the above comment: what is fsdp? -> 1 sharding plan 2. hook plan.

Now if you have to rely on DTensor, you might need to change set_param? or you just apply dtensor conversion post all the loading.

The most important is to test say MIxtral with the dynamic weight loader. The way I see it you'll load all weights on all layers, then shard (discard) some of them.

ArthurZucker · 2026-03-19T14:08:03Z

+        if not dist.is_initialized():
+            try:
+                rank = int(os.environ["RANK"])
+                local_rank = int(os.environ["LOCAL_RANK"])
+                world_size = int(os.environ["WORLD_SIZE"])
+
+                backend_map = {"cuda": "nccl", "cpu": "gloo", "xpu": "xccl", "hpu": "hccl"}
+                backend = backend_map.get(device_type)
+                if device_type == "cpu" and int(os.environ.get("CCL_WORKER_COUNT", "0")):
+                    backend = "ccl"
+                if device_type == "xpu" and not is_torch_greater_or_equal("2.8", accept_dev=True):
+                    backend = "ccl"
+
+                dist.init_process_group(backend=backend, rank=rank, world_size=world_size)
+                if device_type != "cpu":
+                    current_device.set_device(local_rank)
+
+            except Exception as e:
+                raise OSError(
+                    "We tried to initialize torch.distributed for you, but it failed. Make "
+                    "sure you init torch distributed in your script to use `fsdp_plan`."
+                ) from e
+
+        if device_type != "cpu":
+            current_device.set_device(int(os.environ["LOCAL_RANK"]))
+            index = current_device.current_device()
+            fsdp_device = torch.device(device_type, index)
+            device_map = fsdp_device
+        else:
+            fsdp_device = torch.device(device_type)
+            device_map = device_type or {}
+
+        fsdp_size = dist.get_world_size()
+        device_mesh = torch.distributed.init_device_mesh(fsdp_device.type, (fsdp_size,), mesh_dim_names=("dp_shard",))


can't we re-use the func we defined in tensor_parallel ?

ArthurZucker · 2026-03-19T14:09:01Z

+    """
+    Identifies transformer block classes in a model for FSDP wrapping.
+    These are typically the repeated layers that benefit from FSDP sharding.
+
+    Returns a set of module classes that should be wrapped with fully_shard().
+    """


pretty sure you can just check if the layer is GradientCheckpointingLayer 😉

transformers/src/transformers/modeling_layers.py

Line 34 in 3491f11

class GradientCheckpointingLayer(nn.Module):

we make all blocks inherit from this!

ArthurZucker · 2026-03-19T14:11:14Z

+            logger.debug(f"Applied fully_shard to {name} ({type(module).__name__})")
+
+
+def _find_final_norm(model, decoder_layer_names):


base_model_pp_plan = { "embed_tokens": (["input_ids"], ["inputs_embeds"]), "layers": (["hidden_states", "attention_mask"], ["hidden_states"]), "norm": (["hidden_states"], ["hidden_states"]), }

this looks like something we can define in metadata / take from PP no?

ArthurZucker · 2026-03-19T14:11:53Z

+    # Untied: [final_norm, lm_head]
+    # Tied:   [final_norm, embed_tokens] - embed_tokens.weight IS lm_head.weight.


can be taken from:

_pp_plan = {"lm_head": (["hidden_states"], ["logits"])}

transformers/src/transformers/models/llama/modeling_llama.py

Line 432 in 3491f11

_pp_plan = {"lm_head": (["hidden_states"], ["logits"])}

No?

ArthurZucker · 2026-03-19T14:13:24Z

+    return strategy != "keep_full_weight", mp_policy, offload_policy
+
+
+def _iter_manual_plan_targets(model, pattern, name_to_module, already_sharded_names):


can you document what this does?
not entirely sure we need it, but if we do, probably something TP plan or PP plan or Dtype plan are gonna be using ?

ArthurZucker · 2026-03-19T14:14:12Z

+    fsdp_plan:
+        Explicit FSDP config dict with a required "mode" key.
+
+        Auto mode:
+        fsdp_plan = {"mode": "auto"}
+
+        Auto mode with optional policies:
+        fsdp_plan = {"mode": "auto", "cpu_offload": False, "mixed_precision": True}
+
+        Manual mode:
+        fsdp_plan = {
+            "mode": "manual",
+            "modules": {
+                "model.embed_tokens": ["free_full_weight"],
+                "model.layers.0.self_attn": ["free_full_weight", "cpu_offload", "mixed_precision"],
+                "model.layers.0.mlp": ["free_full_weight"],
+                "model.norm": ["keep_full_weight"],
+                "lm_head": ["keep_full_weight"],
+            },
+        }
+    """


Yep, I think the idea was to stay simple like the pp / tp plan. But for fsdp we might want more control.

kashif · 2026-03-20T13:46:43Z

Based on my CP experiments:

Accept fsdp_plan="auto" string shorthand: Currently fsdp_plan={"mode": "auto"} works but fsdp_plan="auto" raises a confusing error. Since tp_plan="auto" is accepted as a string, fsdp_plan should be consistent:

  # _parse_fsdp_plan_mode could normalize strings:
  if isinstance(fsdp_plan, str):
      fsdp_plan = {"mode": fsdp_plan}

Document CP + FSDP2 device mesh pattern: When using Context Parallelism with native FSDP2, the fsdp_device_mesh must be the flattened dp_cp mesh, not just the dp submesh otherwise FSDP2 doesn't shard parameters across CP ranks and you OOM. This is non-obvious:

  # Users need to do this:
  world_mesh = init_device_mesh("cuda", (dp_size, cp_size), mesh_dim_names=("dp", "cp"))
  fsdp_mesh = world_mesh["dp", "cp"]._flatten(mesh_dim_name="dp_cp")
  model = AutoModelForCausalLM.from_pretrained(..., fsdp_device_mesh=fsdp_mesh, fsdp_plan={"mode": "auto"})

The PR's docstring or the 3D_parallel.py example should show this pattern.

Add a CP example using native FSDP2: The current 3D_parallel.py uses FSDP1 (FullyShardedDataParallel with NO_SHARD). An example showing native FSDP2 (fsdp_plan) + CP would be good to have.
Note about CP rotation method: The default CP rotation (allgather) requires allgather_into_tensor_coalesced which may not be available in all PyTorch builds. The alltoall rotation works universally. Worth noting in the docs or defaulting to alltoall.

…dConfig) - Expand DistributedConfig with tp_size, tp_plan, fsdp_size, fsdp_plan - Add init_device_mesh() for building 2D DeviceMesh from DistributedConfig - Reuse apply_fsdp2() from PR #44083 for FSDP2 fully_shard wrapping - Rewire from_pretrained with two clean separated paths: 1. distributed_config → native torch.distributed (no accelerate) 2. Everything else → accelerate (unchanged) - Export DistributedConfig from top-level transformers package - Add unit tests for DistributedConfig

* feat: from_pretrained distributed refactor (FSDP2 + TP via DistributedConfig) - Expand DistributedConfig with tp_size, tp_plan, fsdp_size, fsdp_plan - Add init_device_mesh() for building 2D DeviceMesh from DistributedConfig - Reuse apply_fsdp2() from PR #44083 for FSDP2 fully_shard wrapping - Rewire from_pretrained with two clean separated paths: 1. distributed_config → native torch.distributed (no accelerate) 2. Everything else → accelerate (unchanged) - Export DistributedConfig from top-level transformers package - Add unit tests for DistributedConfig * Convert DistributedConig to dict for JSON serialization * some fixes * linting * linting * freaking linting again * some fixes for CI * linting * fix tests * linting * fix tp tests

- Add apply_fully_shard_data_parallel() with auto/manual mode block detection - FSDP vs DDP loss/grad parity tests - Distributed test helpers (testing_utils.py) - is_fsdp_enabled(), is_fsdp_managed_module() utilities - Minimal FSDP hooks in from_pretrained - FSDP-aware flash attention check

…mers into fsdp-vs-ddp

- train_fsdp_tp.py: minimal FSDP+TP training example - train_fsdp_tp_torchtitan_style.py: torchtitan-style training example - verify_loading.py: save/load roundtrip verification - run_compare.sh: FSDP+TP vs FSDP-only comparison - run_verify_all.sh: run verification across all modes - tmp_generate.py: quick generation test

…sformers into distributed_api

github-actions · 2026-04-14T09:48:55Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: clap, deit

- Re-export is_fsdp_enabled and is_fsdp_managed_module from integrations/fsdp.py (moved to distributed/utils.py) - Remove unused # type: ignore comments in generation/utils.py

github-actions · 2026-04-14T13:58:30Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44083&sha=37dcc1

3outeille changed the title ~~Add distributed training CI job to CircleCI configuration~~ FSDP native support in transformers Feb 17, 2026

3outeille changed the base branch from v5-distributed-training-ci to main March 11, 2026 11:18

3outeille marked this pull request as ready for review March 11, 2026 13:51

github-actions Bot requested review from SunMarc and ydshieh March 11, 2026 13:51

SunMarc reviewed Mar 11, 2026

View reviewed changes

stevhliu mentioned this pull request Mar 11, 2026

[docs] distributed training #44420

Open

cursor Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread src/transformers/integrations/fsdp.py

3outeille changed the title ~~FSDP native support in transformers~~ FSDP2 native support in transformers Mar 18, 2026

3outeille requested review from ArthurZucker and SunMarc and removed request for ydshieh March 18, 2026 20:58

ArthurZucker reviewed Mar 19, 2026

View reviewed changes

init

7c84339

3outeille mentioned this pull request Mar 25, 2026

🚨 Distributed training API #44989

Draft

Merge branch 'main' into distributed_api

69bc48e

3outeille changed the base branch from main to distributed_api March 25, 2026 15:42

ArthurZucker changed the base branch from distributed_api to main April 10, 2026 08:35

3outeille changed the base branch from main to distributed_api April 13, 2026 13:33

3outeille and others added 3 commits April 13, 2026 15:33

Merge branch 'main' into distributed_api

b7ec958

Merge remote-tracking branch 'origin/main' into distributed_api

45a01a5

3outeille force-pushed the fsdp-vs-ddp branch from 978ac87 to a5c2554 Compare April 13, 2026 14:11

3outeille and others added 9 commits April 13, 2026 16:34

Merge branch 'main' into distributed_api

eeefc9e

Merge branch 'distributed_api' into fsdp-vs-ddp

9038475

revert some files

abfd57e

Merge branch 'fsdp-vs-ddp' of https://github.com/huggingface/transfor…

23a2c05

…mers into fsdp-vs-ddp

Merge branch 'distributed_api' of https://github.com/huggingface/tran…

e783231

…sformers into distributed_api

Remove train_fsdp_tp_torchtitan_style.py

34db840

unify the utils for fsdp

6f9e2b6

Merge branch 'distributed_api' into fsdp-vs-ddp

5e017cf

3outeille force-pushed the fsdp-vs-ddp branch from 864e9fa to 7f6cd3d Compare April 14, 2026 09:53

Fix CI: re-export moved FSDP utils + remove stale type: ignore

37dcc14

- Re-export is_fsdp_enabled and is_fsdp_managed_module from integrations/fsdp.py (moved to distributed/utils.py) - Remove unused # type: ignore comments in generation/utils.py

3outeille force-pushed the fsdp-vs-ddp branch from 7f6cd3d to 37dcc14 Compare April 14, 2026 13:44

3outeille changed the base branch from distributed_api to main April 28, 2026 07:54

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

		@@ -0,0 +1,861 @@
		# Copyright 2025 The HuggingFace Team. All rights reserved.

		logger.debug(f"Applied fully_shard to {name} ({type(module).__name__})")


		def _find_final_norm(model, decoder_layer_names):

		# Untied: [final_norm, lm_head]
		# Tied: [final_norm, embed_tokens] - embed_tokens.weight IS lm_head.weight.

		return strategy != "keep_full_weight", mp_policy, offload_policy


		def _iter_manual_plan_targets(model, pattern, name_to_module, already_sharded_names):

Conversation

3outeille commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Native FSDP2 Integration (src/transformers/integrations/fsdp.py)

2. FSDP2-Aware Save/Load via DCP + Safetensors

3. Comprehensive FSDP Test Suite (tests/test_fsdp_mixin.py)

4. CI Test for Broad Model Coverage

Uh oh!

HuggingFaceDocBuilderDev commented Feb 17, 2026

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot Mar 12, 2026

Choose a reason for hiding this comment

Unconditional torch imports break non-torch environments

Uh oh!

cursor Bot Mar 12, 2026

Choose a reason for hiding this comment

Duplicate FSDPTesterMixin import in causal_lm_tester

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kashif commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

3outeille commented Feb 17, 2026 •

edited

Loading

1. Native FSDP2 Integration (`src/transformers/integrations/fsdp.py`)

3. Comprehensive FSDP Test Suite (`tests/test_fsdp_mixin.py`)

kashif commented Mar 20, 2026 •

edited

Loading