[docs] distributed training by stevhliu · Pull Request #44420 · huggingface/transformers

stevhliu · 2026-03-03T22:41:59Z

removes "Number of accelerators" section from "Accelerator selection" guide since this is probably pretty commonly known
add a new "DDP" guide
refactored "Accelerate" guide with a more focused overview of what it is and how to configure it (only covers Trainer now instead of raw PyTorch loop). this guide links out to more detailed FSDP/DDP guides for full config-specific settings. also documents some AcceleratorConfig settings
refactored "FSDP" guide with cleaner table and auto_wrap explanation and detailed explanation of what the config settings are

SunMarc

Thanks ! Left a comment but overall, we will move on to FSDPv2 very soon. I'm thinking about changing the default to that soon so we should update the docs to reflect fsdpv2 arguments.

SunMarc · 2026-04-23T12:24:03Z

+## Sharding strategies

-Always start by running the [accelerate config](https://hf.co/docs/accelerate/package_reference/cli#accelerate-config) command to help Accelerate set up the correct distributed training environment.
+Pass one of the sharding strategies below to [fsdp](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp).

-```bash
-accelerate config
-```
+| strategy | description |
+|---|---|
+| `full_shard` | shard parameters, gradients, and optimizer states |
+| `shard_grad_op` | shard gradients and optimizer states |
+| `no_shard` | DDP |
+| `hybrid_shard` | full shard within a node, replicate across nodes |
+| `hybrid_shard_zero2` | shard gradients and optimizer states within a node, replicate across nodes |
+| `offload` | CPU offload (combine with `full_shard` or `shard_grad_op`) |

-The section below discusses some of the more important FSDP configuration options. Learn more about other available options in the [fsdp_config](https://hf.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config) parameter.
+Always combine a sharding strategy with `auto_wrap` to enable the auto-wrapping policy like `fsdp="full_shard auto_wrap"`. Without `auto_wrap`, the entire model is one FSDP unit and you lose the memory benefit of sharding.


Let's update the docs for fsdpv2 args only. I feel like this could be a nicer transition instead of keeping the old arguments. We don't have this arg anymore. We only have reshard_after_forward now.

SunMarc · 2026-04-23T12:36:47Z

 compute_environment: LOCAL_MACHINE
-debug: false
 distributed_type: FSDP
-downcast_bf16: 'no'
 fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
-  fsdp_forward_prefetch: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
-  fsdp_transformer_layer_cls_to_wrap: BertLayer
+  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_use_orig_params: true


update this for fsdpv2 spec

HuggingFaceDocBuilderDev · 2026-04-24T20:40:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stevhliu requested a review from SunMarc March 3, 2026 23:05

stevhliu mentioned this pull request Mar 16, 2026

[docs] n-d parallelism #44775

Draft

3 tasks

SunMarc reviewed Apr 23, 2026

View reviewed changes

stevhliu added 4 commits April 23, 2026 08:42

accelerate, ddp, fsdp

8289f3a

update

f0edbe2

feedback

c8a8c09

feedback

90c1b9c

stevhliu force-pushed the distrib-training branch from fa428ef to 90c1b9c Compare April 24, 2026 20:28

fix

4fdc3cd

stevhliu requested a review from SunMarc April 24, 2026 20:42

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] distributed training#44420

[docs] distributed training#44420
stevhliu wants to merge 5 commits intohuggingface:mainfrom
stevhliu:distrib-training

stevhliu commented Mar 3, 2026 •

edited

Loading

Uh oh!

SunMarc left a comment

Uh oh!

Uh oh!

SunMarc Apr 23, 2026

Uh oh!

SunMarc Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stevhliu commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SunMarc Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

SunMarc Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stevhliu commented Mar 3, 2026 •

edited

Loading