🚨 Distributed training API by 3outeille · Pull Request #44989 · huggingface/transformers

3outeille · 2026-03-25T09:10:02Z

Distributed Training API

Goal

# torchrun --nproc_per_node=4 train_fsdp_tp.py

import os
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed import DistributedConfig
from transformers.distributed.utils import save_optimizer

def build_packed_dataset(dataset_name, tokenizer, seq_len, dp_rank, dp_world_size):
    """Stream + tokenize + greedy-pack documents into fixed-length (input, label) windows."""
    ds = load_dataset(dataset_name, name="en", split="train", streaming=True)
    ds = ds.shard(num_shards=dp_world_size, index=dp_rank)
    buf, w = [], seq_len + 1

    def pack(batch):
        for t in batch["text"]:
            buf.extend(tokenizer(t)["input_ids"])
        ids, lbls = [], []
        while len(buf) >= w:
            ids.append(buf[:seq_len]); lbls.append(buf[1:w]); del buf[:w]
        return {"input_ids": ids, "labels": lbls}

    ds = ds.map(pack, batched=True, remove_columns=ds.column_names)
    return ds.with_format("torch")

if __name__ == "__main__":

    model_name = "Isotonic/TinyMixtral-4x248M-MoE"
    num_steps, lr = 50, 3e-4
    save_dir = "./checkpoints"

    torch.distributed.init_process_group(backend="nccl")

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        distributed_config=DistributedConfig(tp_size=2, fsdp_size=2),
        torch_dtype=torch.bfloat16,
    )

    rank = torch.distributed.get_rank()
    dp_rank = model.device_mesh["fsdp"].get_local_rank()
    dp_world_size = model.device_mesh["fsdp"].size()
    
    dataset = build_packed_dataset("allenai/c4", tokenizer, 512, dp_rank=dp_rank, dp_world_size=dp_world_size)
    dataloader = iter(DataLoader(dataset, batch_size=4))

    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    
    model.train()
    for step, batch in enumerate(dataloader):
        if step >= num_steps:
            break
        input_ids = batch["input_ids"].to(f"cuda:{dp_rank}")
        labels = input_ids.clone()
        labels[labels == tokenizer.pad_token_id] = -100

        loss = model(input_ids, labels=labels).loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if rank == 0 and step % 10 == 0:
            print(f"Step {step:>4d} | Loss: {loss.item():.4f}")

    model.save_pretrained(save_dir)
    save_optimizer(optimizer, os.path.join(save_dir, "optimizer"))
    if rank == 0:
        tokenizer.save_pretrained(save_dir)
        print(f"Saved to {save_dir}")

    torch.distributed.destroy_process_group()

PR Chain

main ← #44989 (distributed_api)
- ← #44083 (FSDP2)
  - ← #44974 (Config + shard-on-read)
    - ← #45028 (TPStyle + dense TP)
      - ← #45408 (MoE + SP)
        
        ← #45409 (orchestration + save)

Review order	PR	Branch	Content
1st	#45409	`orchestration-save-load` → `moe-sequence-parallel`	`from_pretrained` orchestration, `gather_full_state_dict()`, save/load roundtrip
2nd	#45408	`moe-sequence-parallel` → `refactor-tp-dtensor`	`PackedColwiseParallel`, `MoEExpertsParallel`, sequence parallelism, MoE configs (mixtral, deepseek_v3, qwen3)
3rd	#45028	`refactor-tp-dtensor` → `fsdp-core-model-loading`	`TPStyle` API, `apply_tensor_parallel()`, dense model configs (llama, mistral, qwen2, phi, glm)
4th	#44974	`fsdp-core-model-loading` → `fsdp-vs-ddp`	`DistributedConfig`, `DtensorShardOperation`, shard-on-read loading
5th	#44083	`fsdp-vs-ddp` → `distributed_api`	FSDP2 `fully_shard` integration, auto/manual mode, FSDP vs DDP parity tests

HuggingFaceDocBuilderDev · 2026-03-25T09:19:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2026-03-25T16:09:17Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44989&sha=69bc48

- train_fsdp_tp.py: minimal FSDP+TP training example - train_fsdp_tp_torchtitan_style.py: torchtitan-style training example - verify_loading.py: save/load roundtrip verification - run_compare.sh: FSDP+TP vs FSDP-only comparison - run_verify_all.sh: run verification across all modes - tmp_generate.py: quick generation test

…sformers into distributed_api

init

7c84339

Merge branch 'main' into distributed_api

69bc48e

3outeille and others added 2 commits April 13, 2026 15:33

Merge branch 'main' into distributed_api

b7ec958

Merge remote-tracking branch 'origin/main' into distributed_api

45a01a5

This was referenced Apr 13, 2026

MoE expert parallelism + sequence parallelism #45408

Merged

from_pretrained orchestration + distributed save/load #45409

Merged

3outeille and others added 4 commits April 13, 2026 16:34

Merge branch 'main' into distributed_api

eeefc9e

Merge branch 'distributed_api' of https://github.com/huggingface/tran…

e783231

…sformers into distributed_api

Remove train_fsdp_tp_torchtitan_style.py

34db840

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚨 Distributed training API#44989

🚨 Distributed training API#44989
3outeille wants to merge 8 commits intomainfrom
distributed_api

3outeille commented Mar 25, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 25, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

3outeille commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Distributed Training API

Goal

PR Chain

Uh oh!

HuggingFaceDocBuilderDev commented Mar 25, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

3outeille commented Mar 25, 2026 •

edited

Loading