Skip to content

[docs] distributed training#44420

Open
stevhliu wants to merge 5 commits intohuggingface:mainfrom
stevhliu:distrib-training
Open

[docs] distributed training#44420
stevhliu wants to merge 5 commits intohuggingface:mainfrom
stevhliu:distrib-training

Conversation

@stevhliu
Copy link
Copy Markdown
Member

@stevhliu stevhliu commented Mar 3, 2026

  • removes "Number of accelerators" section from "Accelerator selection" guide since this is probably pretty commonly known
  • add a new "DDP" guide
  • refactored "Accelerate" guide with a more focused overview of what it is and how to configure it (only covers Trainer now instead of raw PyTorch loop). this guide links out to more detailed FSDP/DDP guides for full config-specific settings. also documents some AcceleratorConfig settings
  • refactored "FSDP" guide with cleaner table and auto_wrap explanation and detailed explanation of what the config settings are

@stevhliu stevhliu requested a review from SunMarc March 3, 2026 23:05
@stevhliu stevhliu mentioned this pull request Mar 16, 2026
3 tasks
Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Left a comment but overall, we will move on to FSDPv2 very soon. I'm thinking about changing the default to that soon so we should update the docs to reflect fsdpv2 arguments.

Comment thread docs/source/en/fsdp.md Outdated
Comment thread docs/source/en/fsdp.md Outdated
Comment on lines +51 to +64
## Sharding strategies

Always start by running the [accelerate config](https://hf.co/docs/accelerate/package_reference/cli#accelerate-config) command to help Accelerate set up the correct distributed training environment.
Pass one of the sharding strategies below to [fsdp](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp).

```bash
accelerate config
```
| strategy | description |
|---|---|
| `full_shard` | shard parameters, gradients, and optimizer states |
| `shard_grad_op` | shard gradients and optimizer states |
| `no_shard` | DDP |
| `hybrid_shard` | full shard within a node, replicate across nodes |
| `hybrid_shard_zero2` | shard gradients and optimizer states within a node, replicate across nodes |
| `offload` | CPU offload (combine with `full_shard` or `shard_grad_op`) |

The section below discusses some of the more important FSDP configuration options. Learn more about other available options in the [fsdp_config](https://hf.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config) parameter.
Always combine a sharding strategy with `auto_wrap` to enable the auto-wrapping policy like `fsdp="full_shard auto_wrap"`. Without `auto_wrap`, the entire model is one FSDP unit and you lose the memory benefit of sharding.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update the docs for fsdpv2 args only. I feel like this could be a nicer transition instead of keeping the old arguments. We don't have this arg anymore. We only have reshard_after_forward now.

Comment thread docs/source/en/accelerate.md Outdated
Comment on lines 30 to 41
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_cpu_ram_efficient_loading: true
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: BertLayer
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_use_orig_params: true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this for fsdpv2 spec

Comment thread docs/source/en/fsdp.md
Comment thread docs/source/en/accelerate.md
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants