Speed up PPO with ZeRO-3 by 10x 🔥 by lewtun · Pull Request #1483 · huggingface/trl

lewtun · 2024-03-26T09:51:09Z

In PPO, text generation is the main bottleneck and especially so with ZeRO-3 where weights are sharded across N devices and need to be gathered for each forward pass.

This PR introduces a new context manager called unwrap_model_for_generation() which does a single gather of the model weights to speed up the ppo.py example script by ~10x relative to naive ZeRO-3 inference. Thank you to @pacman100 for showing me this feature of deepspeed 🙏 !

Note: this context manager is entirely general and can be used in other trainers. For now I've focused on PPO, but happy to roll it out to the other parts of the codebase in follow-up PRs.

As they say, a picture is worth a 1000 words and here's the comparisons against DDP / ZeRO-2 and naive ZeRO-3:

Code to test with

I've checked the script below works with DDP, DDP + LoRA, ZeRO-2, and ZeRO-3:

Inference script

"""
TRANSFORMERS_VERBOSITY=info ACCELERATE_LOG_LEVEL=info accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero3.yaml scratch/fast_ppo.py --batch_size=4 --mini_batch_size=1 --gradient_accumulation_steps=4
"""
from dataclasses import dataclass, field
from typing import Optional

import torch
from accelerate import Accelerator
from datasets import load_dataset
from peft import LoraConfig
from tqdm import tqdm
from transformers import AutoTokenizer, HfArgumentParser

from trl import AutoModelForCausalLMWithValueHead, AutoModelForSeq2SeqLMWithValueHead, PPOConfig, PPOTrainer, set_seed
from trl.core import LengthSampler
from trl.import_utils import is_npu_available, is_xpu_available


tqdm.pandas()


@dataclass
class ScriptArguments:
    use_seq2seq: bool = field(default=False, metadata={"help": "whether to use seq2seq"})
    trust_remote_code: bool = field(default=False, metadata={"help": "Enable `trust_remote_code`"})

    # LoraConfig
    use_peft: bool = field(default=False, metadata={"help": "whether to use peft"})
    lora_alpha: Optional[float] = field(default=16, metadata={"help": "the lora alpha parameter"})
    lora_r: Optional[int] = field(default=16, metadata={"help": "the lora r parameter"})


parser = HfArgumentParser((ScriptArguments, PPOConfig))
args, ppo_config = parser.parse_args_into_dataclasses()

trl_model_class = AutoModelForCausalLMWithValueHead if not args.use_seq2seq else AutoModelForSeq2SeqLMWithValueHead

def build_dataset(config, query_dataset, input_min_text_length=2, input_max_text_length=8):
    """
    Build dataset for training. This builds the dataset from `load_dataset`, one should
    customize this function to train the model on its own dataset.

    Args:
        query_dataset (`str`):
            The name of the dataset to be loaded.

    Returns:
        dataloader (`torch.utils.data.DataLoader`):
            The dataloader for the dataset.
    """
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    tokenizer.pad_token = tokenizer.eos_token
    # load imdb with datasets
    ds = load_dataset(query_dataset, split="train")
    ds = ds.rename_columns({"text": "review"})
    ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)

    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds


# We retrieve the dataloader by calling the `build_dataset` function.
dataset = build_dataset(ppo_config, ppo_config.query_dataset)


def collator(data):
    return {key: [d[key] for d in data] for key in data[0]}


# set seed before initializing value head for deterministic eval
set_seed(ppo_config.seed)

# Now let's build the model, the reference model, and the tokenizer.
if not args.use_peft:
    ref_model = trl_model_class.from_pretrained(ppo_config.model_name, trust_remote_code=args.trust_remote_code)
    device_map = None
    peft_config = None
else:
    peft_config = LoraConfig(
        r=args.lora_r,
        lora_alpha=args.lora_alpha,
        bias="none",
        task_type="CAUSAL_LM",
    )
    ref_model = None
    # Copy the model to each device
    device_map = {"": Accelerator().local_process_index}

model = trl_model_class.from_pretrained(
    ppo_config.model_name,
    trust_remote_code=args.trust_remote_code,
    device_map=device_map,
    peft_config=peft_config,
)


tokenizer = AutoTokenizer.from_pretrained(ppo_config.model_name)

tokenizer.pad_token_id = tokenizer.eos_token_id

ppo_trainer = PPOTrainer(ppo_config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator)


device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    if is_xpu_available():
        device = "xpu:0"
    elif is_npu_available():
        device = "npu:0"
    else:
        device = 0 if torch.cuda.is_available() else "cpu"  # to avoid a `pipeline` bug

generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
    "max_new_tokens": 32,
}

for _epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    # Get response from gpt2
    import time

    start_time = time.time()
    response_tensors, ref_response_tensors = ppo_trainer.generate(
        query_tensors, return_prompt=False, generate_ref_response=True, **generation_kwargs
    )
    generation_time = torch.tensor([time.time() - start_time]).to(ppo_trainer.accelerator.device)

    break

generation_time_gather = ppo_trainer.accelerator.gather(generation_time)
if ppo_trainer.accelerator.is_main_process:
    print(f"Generation time: {generation_time_gather.mean().item():.2f} seconds for {len(query_tensors)} generations")

Addresses the speed issue discussed in #1051

HuggingFaceDocBuilderDev · 2024-03-26T09:55:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lewtun · 2024-03-26T10:40:05Z

+                response = unwrapped_model.generate(input_ids=query_tensor.unsqueeze(dim=0), **generation_kwargs)
+
            if generate_ref_response:
-                with self.optional_peft_ctx():


I've moved the logic for disabling the adapter to unwrap_model_for_generation()

Also, why did we do the adapter disabling for the reference model but not the active model above?

lewtun · 2024-03-26T10:41:23Z

+
+
+@contextmanager
+def unwrap_model_for_generation(


I wasn't sure if this belongs in modeling_base.py or as a utility method here - let me know what you prefer!

younesbelkada

Thanks for 10x-ing PPO Zero-3 🚀

* Speed up PPO by 10x 🔥 * Revert * Clean up * Use relative import * Clean * Fix typing for docs

huggingface/trl#1483

Gives the ability to add and remove the forward hooks in ZeRO 3 by using a context manager. These code changes were taken from a Huggingface [PR](huggingface/trl#1617) and integrated for direct support in DeepSpeed. This is useful in the inference case and the speedup can be observed [here](huggingface/trl#1483). --------- Co-authored-by: root <root@deepspeed-c000004.2d1icxc5dsxehnpuwt3ifc34ph.gvxx.internal.cloudapp.net> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Heyang Qin <heyangqin@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com>

* Speed up PPO by 10x 🔥 * Revert * Clean up * Use relative import * Clean * Fix typing for docs

huggingface/trl#1483 Former-commit-id: 65cd8bd

huggingface/trl#1483

huggingface/trl#1483 Former-commit-id: 5dc43ba8b373d8803bc22d88b3d0d95ef8b9c7f8

lewtun added 2 commits March 26, 2024 09:27

Speed up PPO by 10x 🔥

9a7e216

Revert

fb95861

Clean up

81e1e2f

lewtun requested review from lvwerra, vwxyzjn and younesbelkada March 26, 2024 10:19

lewtun added 2 commits March 26, 2024 10:28

Use relative import

c90cc67

Clean

443702c

lewtun commented Mar 26, 2024

View reviewed changes

Fix typing for docs

82c42c1

younesbelkada approved these changes Apr 8, 2024

View reviewed changes

lewtun merged commit f35b68a into main Apr 8, 2024

lewtun deleted the fast-text-gen branch April 8, 2024 12:30

sngdng mentioned this pull request Apr 17, 2024

Speed up ZeRO-3 generation with DPO #1543

Closed

Shiguang-Guo mentioned this pull request May 3, 2024

Have trouble in ppo example #1618

Closed

lapp0 pushed a commit to lapp0/trl that referenced this pull request May 10, 2024

Speed up PPO with ZeRO-3 by 10x 🔥 (huggingface#1483)

854f8c1

* Speed up PPO by 10x 🔥 * Revert * Clean up * Use relative import * Clean * Fix typing for docs

hiyouga added a commit to hiyouga/LlamaFactory that referenced this pull request May 28, 2024

10x generate in ppo w/ zero3

65cd8bd

huggingface/trl#1483

jomayeri mentioned this pull request Jun 13, 2024

Add and Remove ZeRO 3 Hooks deepspeedai/DeepSpeed#5658

Merged

SunMarc mentioned this pull request Oct 4, 2024

Default synced_gpus to True when using FullyShardedDataParallel huggingface/transformers#33483

Merged

5 tasks

dawidm mentioned this pull request Jan 10, 2025

🧩 PPO/RLOO/OnlineDPO sequence generation: make deepsped 3 weight gathering optional #2557

Merged

4 tasks

LeonEricsson mentioned this pull request Apr 20, 2025

GRPO trainer misuses model and model_wrapped? #3314

Closed

5 tasks

yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025

Speed up PPO with ZeRO-3 by 10x 🔥 (huggingface#1483)

1b5c7dc

* Speed up PPO by 10x 🔥 * Revert * Clean up * Use relative import * Clean * Fix typing for docs

yoonseok312 pushed a commit to pensieve-ai/LLaMA-Factory-vlm that referenced this pull request Apr 29, 2025

10x generate in ppo w/ zero3

7a86bf1

huggingface/trl#1483 Former-commit-id: 65cd8bd

liu-qingyuan pushed a commit to liu-qingyuan/LLaMA-Factory-Megafake that referenced this pull request Jun 6, 2025

10x generate in ppo w/ zero3

5e2e9b8

huggingface/trl#1483 Former-commit-id: 65cd8bd

zhongwei1968 pushed a commit to zhongwei1968/LLaMA-Factory that referenced this pull request Aug 1, 2025

10x generate in ppo w/ zero3

e1e606a

huggingface/trl#1483

nmh21207525 pushed a commit to nmh21207525/pwd_memory_task that referenced this pull request Jan 3, 2026

10x generate in ppo w/ zero3

351b4ef

huggingface/trl#1483 Former-commit-id: 5dc43ba8b373d8803bc22d88b3d0d95ef8b9c7f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up PPO with ZeRO-3 by 10x 🔥#1483

Speed up PPO with ZeRO-3 by 10x 🔥#1483
lewtun merged 6 commits intomainfrom
fast-text-gen

lewtun commented Mar 26, 2024 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 26, 2024

Uh oh!

lewtun Mar 26, 2024 •

edited

Loading

Uh oh!

lewtun Mar 26, 2024

Uh oh!

younesbelkada left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lewtun commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code to test with

Uh oh!

HuggingFaceDocBuilderDev commented Mar 26, 2024

Uh oh!

lewtun Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lewtun Mar 26, 2024

Choose a reason for hiding this comment

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lewtun commented Mar 26, 2024 •

edited

Loading

lewtun Mar 26, 2024 •

edited

Loading