diff --git a/applications/ColossalChat/README.md b/applications/ColossalChat/README.md index 705da8c0fe91..8c1c3eb3910a 100755 --- a/applications/ColossalChat/README.md +++ b/applications/ColossalChat/README.md @@ -7,32 +7,23 @@ ## Table of Contents - [Table of Contents](#table-of-contents) -- [What is ColossalChat and Coati ?](#what-is-colossalchat-and-coati-) +- [What is ColossalChat?](#what-is-colossalchat) - [Online demo](#online-demo) - [Install](#install) - [Install the environment](#install-the-environment) - [Install the Transformers](#install-the-transformers) -- [How to use?](#how-to-use) +- [Introduction](#introduction) - [Supervised datasets collection](#step-1-data-collection) - [RLHF Training Stage1 - Supervised instructs tuning](#rlhf-training-stage1---supervised-instructs-tuning) - [RLHF Training Stage2 - Training reward model](#rlhf-training-stage2---training-reward-model) - [RLHF Training Stage3 - Training model with reinforcement learning by human feedback](#rlhf-training-stage3---proximal-policy-optimization) + - [Alternative Option for RLHF: GRPO](#alternative-option-for-rlhf-group-relative-policy-optimization-grpo) + - [Alternative Option For RLHF: DPO](#alternative-option-for-rlhf-direct-preference-optimization) + - [Alternative Option For RLHF: SimPO](#alternative-option-for-rlhf-simple-preference-optimization-simpo) + - [Alternative Option For RLHF: ORPO](#alternative-option-for-rlhf-odds-ratio-preference-optimization-orpo) + - [Alternative Option For RLHF: KTO](#alternative-option-for-rlhf-kahneman-tversky-optimization-kto) + - [SFT for DeepSeek V3/R1](#sft-for-deepseek-v3) - [Inference Quantization and Serving - After Training](#inference-quantization-and-serving---after-training) -- [Coati7B examples](#coati7b-examples) - - [Generation](#generation) - - [Open QA](#open-qa) - - [Limitation for LLaMA-finetuned models](#limitation) - - [Limitation of dataset](#limitation) -- [Alternative Option For RLHF: DPO](#alternative-option-for-rlhf-direct-preference-optimization) -- [Alternative Option For RLHF: SimPO](#alternative-option-for-rlhf-simple-preference-optimization-simpo) -- [Alternative Option For RLHF: ORPO](#alternative-option-for-rlhf-odds-ratio-preference-optimization-orpo) -- [Alternative Option For RLHF: KTO](#alternative-option-for-rlhf-kahneman-tversky-optimization-kto) -- [O1 Journey](#o1-journey) - - [Inference with Self-refined MCTS](#inference-with-self-refined-mcts) -- [SFT for DeepSeek V3/R1](#sft-for-deepseek-v3) -- [FAQ](#faq) - - [How to save/load checkpoint](#faq) - - [How to train with limited resources](#faq) - [Invitation to open-source contribution](#invitation-to-open-source-contribution) - [Quick Preview](#quick-preview) - [Authors](#authors) @@ -41,9 +32,9 @@ --- -## What Is ColossalChat And Coati ? +## What is ColossalChat? -[ColossalChat](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat) is the project to implement LLM with RLHF, powered by the [Colossal-AI](https://github.com/hpcaitech/ColossalAI) project. +[ColossalChat](https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalChat) is a project to implement LLM with RLHF, powered by the [Colossal-AI](https://github.com/hpcaitech/ColossalAI). Coati stands for `ColossalAI Talking Intelligence`. It is the name for the module implemented in this project and is also the name of the large language model developed by the ColossalChat project. @@ -54,8 +45,6 @@ The Coati package provides a unified large language model framework that has imp - Supervised instructions fine-tuning - Training reward model - Reinforcement learning with human feedback -- Quantization inference -- Fast model deploying - Perfectly integrated with the Hugging Face ecosystem, a high degree of model customization
-#### Step 1: Data Collection
-PPO uses two kind of training data--- the prompt data and the sft data (optional). The first dataset is mandatory, data samples within the prompt dataset ends with a line from "human" and thus the "assistant" needs to generate a response to answer to the "human". Note that you can still use conversation that ends with a line from the "assistant", in that case, the last line will be dropped. Here is an example of the prompt dataset format.
-
-```json
-[
- {"messages":
- [
- {
- "from": "human",
- "content": "what are some pranks with a pen i can do?"
- }
- ]
- },
-]
-```
-
-#### Step 2: Data Preprocessing
-To prepare the prompt dataset for PPO training, simply run [prepare_prompt_dataset.sh](./examples/data_preparation_scripts/prepare_prompt_dataset.sh)
-
-#### Step 3: Training
-You can run the [train_ppo.sh](./examples/training_scripts/train_ppo.sh) to start PPO training. Here are some unique arguments for PPO, please refer to the training configuration section for other training configuration. More detais can be found in [example guideline](./examples/README.md).
-
-```bash
---pretrain $PRETRAINED_MODEL_PATH \
---rm_pretrain $PRETRAINED_MODEL_PATH \ # reward model architectual
---tokenizer_dir $PRETRAINED_TOKENIZER_PATH \
---rm_checkpoint_path $REWARD_MODEL_PATH \ # reward model checkpoint path
---prompt_dataset ${prompt_dataset[@]} \ # List of string, the prompt dataset
---ptx_dataset ${ptx_dataset[@]} \ # List of string, the SFT data used in the SFT stage
---ptx_batch_size 1 \ # batch size for calculate ptx loss
---ptx_coef 0.0 \ # none-zero if ptx loss is enable
---num_episodes 2000 \ # number of episodes to train
---num_collect_steps 1 \
---num_update_steps 1 \
---experience_batch_size 8 \
---train_batch_size 4 \
---accumulation_steps 2
-```
-
-Each episode has two phases, the collect phase and the update phase. During the collect phase, we will collect experiences (answers generated by actor), store those in ExperienceBuffer. Then data in ExperienceBuffer is used during the update phase to update parameter of actor and critic.
-- Without tensor parallelism,
-```
-experience buffer size
-= num_process * num_collect_steps * experience_batch_size
-= train_batch_size * accumulation_steps * num_process
-```
-
-- With tensor parallelism,
-```
-num_tp_group = num_process / tp
-experience buffer size
-= num_tp_group * num_collect_steps * experience_batch_size
-= train_batch_size * accumulation_steps * num_tp_group
-```
-
-## Alternative Option For RLHF: Direct Preference Optimization (DPO)
+### Alternative Option For RLHF: Direct Preference Optimization (DPO)
For those seeking an alternative to Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO) presents a compelling option. DPO, as detailed in this [paper](https://arxiv.org/abs/2305.18290), DPO offers an low-cost way to perform RLHF and usually request less computation resources compares to PPO. Read this [README](./examples/README.md) for more information.
-### DPO Training Stage1 - Supervised Instructs Tuning
-
-Please refer the [sft section](#dpo-training-stage1---supervised-instructs-tuning) in the PPO part.
-
-### DPO Training Stage2 - DPO Training
-#### Step 1: Data Collection & Preparation
-For DPO training, you only need the preference dataset. Please follow the instruction in the [preference dataset preparation section](#rlhf-training-stage2---training-reward-model) to prepare the preference data for DPO training.
-
-#### Step 2: Training
-You can run the [train_dpo.sh](./examples/training_scripts/train_dpo.sh) to start DPO training. More detais can be found in [example guideline](./examples/README.md).
-
-## Alternative Option For RLHF: Simple Preference Optimization (SimPO)
+### Alternative Option For RLHF: Simple Preference Optimization (SimPO)
Simple Preference Optimization (SimPO) from this [paper](https://arxiv.org/pdf/2405.14734) is similar to DPO but it abandons the use of the reference model, which makes the training more efficient. It also adds a reward shaping term called target reward margin to enhance training stability. It also use length normalization to better align with the inference process. Read this [README](./examples/README.md) for more information.
-## Alternative Option For RLHF: Odds Ratio Preference Optimization (ORPO)
+### Alternative Option For RLHF: Odds Ratio Preference Optimization (ORPO)
Odds Ratio Preference Optimization (ORPO) from this [paper](https://arxiv.org/pdf/2403.07691) is a reference model free alignment method that use a mixture of SFT loss and a reinforcement leanring loss calculated based on odds-ratio-based implicit reward to makes the training more efficient and stable. Read this [README](./examples/README.md) for more information.
-## Alternative Option For RLHF: Kahneman-Tversky Optimization (KTO)
+### Alternative Option For RLHF: Kahneman-Tversky Optimization (KTO)
We support the method introduced in the paper [KTO:Model Alignment as Prospect Theoretic Optimization](https://arxiv.org/pdf/2402.01306) (KTO). Which is a aligment method that directly maximize "human utility" of generation results. Read this [README](./examples/README.md) for more information.
-## Inference Quantization and Serving - After Training
+### Alternative Option For RLHF: Group Relative Policy Optimization (GRPO)
+We support the main algorithm used to train DeepSeek R1 model, a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO. Read this [README](./examples/README.md) for more information.
+
+### SFT for DeepSeek V3
+We support fine-tuning DeepSeek V3/R1 model with LoRA. Read this [README](./examples/README.md) for more information.
+
+### Inference Quantization and Serving - After Training
We provide an online inference server and a benchmark. We aim to run inference on single GPU, so quantization is essential when using large models.
@@ -282,213 +150,7 @@ We support 8-bit quantization (RTN), 4-bit quantization (GPTQ), and FP16 inferen
Online inference server scripts can help you deploy your own services.
For more details, see [`inference/`](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/inference).
-## O1 Journey
-### Inference with Self-refined MCTS
-We provide the implementation of MCT Self-Refine (MCTSr) algorithm, an innovative integration of Large Language Models with Monte Carlo Tree Search.
-You can serve model using vLLM and update the config file in `Qwen32B_prompt_CFG` and then run the following script.
-```python
-from coati.reasoner.guided_search.mcts import MCTS
-from coati.reasoner.guided_search.prompt_store.qwen import Qwen32B_prompt_CFG
-
-problem = "How Many R in 'Strawberry'"
-
-search_tree = MCTS(problem=problem, max_simulations=8, cfg=Qwen32B_prompt_CFG)
-answer = search_tree.simulate()
-print(answer)
-```
-
-## Coati7B examples
-
-### Generation
-
-
-
-
-
-
-
-
+
+
-
-
+
+
-## Hardware Requirements
+### SFT for DeepSeek V3
+We add a script to supervised-fintune the DeepSeek V3/R1 model with LoRA. The script is located in `examples/training_scripts/lora_fintune.py`. The script is similar to the SFT script for Coati7B, but with a few differences. This script is compatible with Peft.
+
+#### Dataset preparation
+
+This script receives JSONL format file as input dataset. Each line of dataset should be a list of chat dialogues. E.g.
+```json
+[{"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"}]
+```
+```json
+[{"role": "user", "content": "火烧赤壁 曹操为何不拨打119求救?"}, {"role": "assistant", "content": "因为在三国时期,还没有电话和现代的消防系统,所以曹操无法拨打119求救。"}]
+```
+
+The dialogues can by multiple turns and it can contain system prompt. For more details, see the [chat_templating](https://huggingface.co/docs/transformers/main/chat_templating).
+
+#### Model weights preparation
+
+We use bf16 weights for finetuning. If you downloaded fp8 DeepSeek V3/R1 weights, you can use the [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) to convert the weights to bf16 via GPU. For Ascend NPU, you can use this [script](https://gitee.com/ascend/ModelZoo-PyTorch/blob/master/MindIE/LLM/DeepSeek/DeepSeek-V2/NPU_inference/fp8_cast_bf16.py).
+
+#### Usage
+
+After preparing the dataset and model weights, you can run the script with the following command:
+```bash
+colossalai run --hostfile path-to-host-file --nproc_per_node 8 lora_finetune.py --pretrained path-to-DeepSeek-R1-bf16 --dataset path-to-dataset.jsonl --plugin moe --lr 2e-5 --max_length 256 -g --ep 8 --pp 3 --batch_size 24 --lora_rank 8 --lora_alpha 16 --num_epochs 2 --warmup_steps 8 --tensorboard_dir logs --save_dir DeepSeek-R1-bf16-lora
+```
+
+For more details of each argument, you can run `python lora_finetune.py --help`.
+
+The sample command does not use CPU offload to get better throughput. The minimum hardware requirement for sample command is 32 ascend 910B NPUs (with `ep=8,pp=4`) or 24 H100/H800 GPUs (with `ep=8,pp=3`). If you enable CPU offload by `--zero_cpu_offload`, the hardware requirement can be further reduced.
+
+## Hardware Requirements
For SFT, we recommend using zero2 or zero2-cpu for 7B model and tp is your model is extra large. We tested the VRAM consumption on a dummy dataset with a sequence length of 2048. In all experiments, we use H800 GPUs with 80GB VRAM and enable gradient checkpointing and flash attention.
- 2 H800 GPU
- zero2-cpu, micro batch size=4, VRAM Usage=22457.98 MB
@@ -942,35 +960,9 @@ For KTO, we recommend using zero2-cpu or zero2 plugin, We tested the VRAM consum
- zero2_cpu, micro batch size=2, VRAM_USAGE=32443.22 MB
- zero2, micro batch size=4, VRAM_USAGE=59307.97 MB
-## List of Supported Models
-
-For SFT, we support the following models/series:
-- Colossal-LLaMA-2
-- ChatGLM2
-- ChatGLM3 (only with zero2, zero2_cpu plugin)
-- Baichuan2
-- LLaMA2
-- Qwen1.5-7B-Chat (with transformers==4.39.1)
-- Yi-1.5
-
-For PPO and DPO, we theoratically support the following models/series (without guarantee):
-- Colossal-LLaMA-2 (tested)
-- ChatGLM2
-- Baichuan2
-- LLaMA2 (tested)
-- Qwen1.5-7B-Chat (with transformers==4.39.1)
-- Yi-1.5
-
-*-* The zero2, zero2_cpu plugin also support a wide range of chat models not listed above.
-
## Inference example
-
-
We support different inference options, including int8 and int4 quantization.
For details, see [`inference/`](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/inference).
-
## Attention
-
-
The examples are demos for the whole training process. You need to change the hyper-parameters to reach great performance.
diff --git a/applications/ColossalChat/examples/ray/1mmt_prompt.py b/applications/ColossalChat/examples/ray/1mmt_prompt.py
deleted file mode 100755
index 8de6219ec4e9..000000000000
--- a/applications/ColossalChat/examples/ray/1mmt_prompt.py
+++ /dev/null
@@ -1,181 +0,0 @@
-import argparse
-import os
-import socket
-from functools import partial
-
-import pandas as pd
-import ray
-from coati.quant import llama_load_quant, low_resource_init
-from coati.ray.detached_trainer_ppo import DetachedPPOTrainer
-from coati.ray.experience_maker_holder import ExperienceMakerHolder
-from coati.ray.utils import (
- get_actor_from_args,
- get_critic_from_args,
- get_reward_model_from_args,
- get_strategy_from_args,
- get_tokenizer_from_args,
-)
-from torch.utils.data import DataLoader
-from transformers import AutoConfig
-from transformers.modeling_utils import no_init_weights
-
-
-def get_free_port():
- with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
- s.bind(("", 0))
- return s.getsockname()[1]
-
-
-def get_local_ip():
- with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
- s.connect(("8.8.8.8", 80))
- return s.getsockname()[0]
-
-
-def main(args):
- master_addr = str(get_local_ip())
- # trainer_env_info
- trainer_port = str(get_free_port())
- env_info_trainers = [
- {
- "local_rank": "0",
- "rank": str(rank),
- "world_size": str(args.num_trainers),
- "master_port": trainer_port,
- "master_addr": master_addr,
- }
- for rank in range(args.num_trainers)
- ]
-
- # maker_env_info
- maker_port = str(get_free_port())
- env_info_maker = {
- "local_rank": "0",
- "rank": "0",
- "world_size": "1",
- "master_port": maker_port,
- "master_addr": master_addr,
- }
-
- # configure tokenizer
- tokenizer = get_tokenizer_from_args(args.model)
-
- def trainer_model_fn():
- actor = get_actor_from_args(args.model, args.pretrain).half().cuda()
- critic = get_critic_from_args(args.model, args.critic_pretrain).half().cuda()
- return actor, critic
-
- # configure Trainer
- trainer_refs = [
- DetachedPPOTrainer.options(name=f"trainer{i}", num_gpus=1, max_concurrency=2).remote(
- experience_maker_holder_name_list=["maker1"],
- strategy_fn=partial(get_strategy_from_args, args.trainer_strategy),
- model_fn=trainer_model_fn,
- env_info=env_info_trainer,
- train_batch_size=args.train_batch_size,
- buffer_limit=16,
- eval_performance=True,
- debug=args.debug,
- update_lora_weights=not (args.lora_rank == 0),
- )
- for i, env_info_trainer in enumerate(env_info_trainers)
- ]
-
- def model_fn():
- actor = get_actor_from_args(args.model, args.pretrain).requires_grad_(False).half().cuda()
- critic = get_critic_from_args(args.model, args.critic_pretrain).requires_grad_(False).half().cuda()
- reward_model = get_reward_model_from_args(args.model, args.critic_pretrain).requires_grad_(False).half().cuda()
- if args.initial_model_quant_ckpt is not None and args.model == "llama":
- # quantize initial model
- actor_cfg = AutoConfig.from_pretrained(args.pretrain)
- with low_resource_init(), no_init_weights():
- initial_model = get_actor_from_args(args.model, config=actor_cfg)
- initial_model.model = (
- llama_load_quant(
- initial_model.model, args.initial_model_quant_ckpt, args.quant_bits, args.quant_group_size
- )
- .cuda()
- .requires_grad_(False)
- )
- else:
- initial_model = get_actor_from_args(args.model, args.pretrain).requires_grad_(False).half().cuda()
- return actor, critic, reward_model, initial_model
-
- # configure Experience Maker
- experience_holder_ref = ExperienceMakerHolder.options(name="maker1", num_gpus=1, max_concurrency=2).remote(
- detached_trainer_name_list=[f"trainer{i}" for i in range(args.num_trainers)],
- strategy_fn=partial(get_strategy_from_args, args.maker_strategy),
- model_fn=model_fn,
- env_info=env_info_maker,
- experience_batch_size=args.experience_batch_size,
- kl_coef=0.1,
- debug=args.debug,
- update_lora_weights=not (args.lora_rank == 0),
- # sync_models_from_trainers=True,
- # generation kwargs:
- max_length=512,
- do_sample=True,
- temperature=1.0,
- top_k=50,
- pad_token_id=tokenizer.pad_token_id,
- eos_token_id=tokenizer.eos_token_id,
- eval_performance=True,
- use_cache=True,
- )
-
- # uncomment this function if sync_models_from_trainers is True
- # ray.get([
- # trainer_ref.sync_models_to_remote_makers.remote()
- # for trainer_ref in trainer_refs
- # ])
-
- wait_tasks = []
-
- total_steps = args.experience_batch_size * args.experience_steps // (args.num_trainers * args.train_batch_size)
- for trainer_ref in trainer_refs:
- wait_tasks.append(trainer_ref.fit.remote(total_steps, args.update_steps, args.train_epochs))
-
- dataset_size = args.experience_batch_size * 4
-
- def build_dataloader():
- def tokenize_fn(texts):
- batch = tokenizer(texts, return_tensors="pt", max_length=96, padding="max_length", truncation=True)
- return {k: v.cuda() for k, v in batch.items()}
-
- dataset = pd.read_csv(args.prompt_path)["prompt"]
- dataloader = DataLoader(dataset=dataset, batch_size=dataset_size, shuffle=True, collate_fn=tokenize_fn)
- return dataloader
-
- wait_tasks.append(experience_holder_ref.workingloop.remote(build_dataloader, num_steps=args.experience_steps))
-
- ray.get(wait_tasks)
-
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument("--prompt_path", type=str, default=None)
- parser.add_argument("--num_trainers", type=int, default=1)
- parser.add_argument(
- "--trainer_strategy",
- choices=["ddp", "colossalai_gemini", "colossalai_zero2", "colossalai_gemini_cpu", "colossalai_zero2_cpu"],
- default="ddp",
- )
- parser.add_argument("--maker_strategy", choices=["naive"], default="naive")
- parser.add_argument("--model", default="gpt2", choices=["gpt2", "bloom", "opt", "llama"])
- parser.add_argument("--critic_model", default="gpt2", choices=["gpt2", "bloom", "opt", "llama"])
- parser.add_argument("--pretrain", type=str, default=None)
- parser.add_argument("--critic_pretrain", type=str, default=None)
- parser.add_argument("--experience_steps", type=int, default=4)
- parser.add_argument("--experience_batch_size", type=int, default=8)
- parser.add_argument("--train_epochs", type=int, default=1)
- parser.add_argument("--update_steps", type=int, default=2)
- parser.add_argument("--train_batch_size", type=int, default=8)
- parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank")
-
- parser.add_argument("--initial_model_quant_ckpt", type=str, default=None)
- parser.add_argument("--quant_bits", type=int, default=4)
- parser.add_argument("--quant_group_size", type=int, default=128)
- parser.add_argument("--debug", action="store_true")
- args = parser.parse_args()
- ray.init(namespace=os.environ["RAY_NAMESPACE"], runtime_env={"env_vars": dict(os.environ)})
- main(args)
diff --git a/applications/ColossalChat/examples/ray/mmmt_prompt.py b/applications/ColossalChat/examples/ray/mmmt_prompt.py
deleted file mode 100755
index 7c03a0468b02..000000000000
--- a/applications/ColossalChat/examples/ray/mmmt_prompt.py
+++ /dev/null
@@ -1,201 +0,0 @@
-import argparse
-import os
-import socket
-from functools import partial
-
-import pandas as pd
-import ray
-from coati.quant import llama_load_quant, low_resource_init
-from coati.ray.detached_trainer_ppo import DetachedPPOTrainer
-from coati.ray.experience_maker_holder import ExperienceMakerHolder
-from coati.ray.utils import (
- get_actor_from_args,
- get_critic_from_args,
- get_receivers_per_sender,
- get_reward_model_from_args,
- get_strategy_from_args,
-)
-from torch.utils.data import DataLoader
-from transformers import AutoConfig, AutoTokenizer
-from transformers.modeling_utils import no_init_weights
-
-
-def get_free_port():
- with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
- s.bind(("", 0))
- return s.getsockname()[1]
-
-
-def get_local_ip():
- with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
- s.connect(("8.8.8.8", 80))
- return s.getsockname()[0]
-
-
-def main(args):
- master_addr = str(get_local_ip())
- # trainer_env_info
- trainer_port = str(get_free_port())
- env_info_trainers = [
- {
- "local_rank": "0",
- "rank": str(rank),
- "world_size": str(args.num_trainers),
- "master_port": trainer_port,
- "master_addr": master_addr,
- }
- for rank in range(args.num_trainers)
- ]
-
- # maker_env_info
- maker_port = str(get_free_port())
- env_info_makers = [
- {
- "local_rank": "0",
- "rank": str(rank),
- "world_size": str(args.num_makers),
- "master_port": maker_port,
- "master_addr": master_addr,
- }
- for rank in range(args.num_makers)
- ]
-
- # configure tokenizer
- tokenizer = AutoTokenizer.from_pretrained(args.pretrain)
- tokenizer.pad_token = tokenizer.eos_token
-
- def model_fn():
- actor = get_actor_from_args(args.model, args.pretrain).requires_grad_(False).half().cuda()
- critic = get_critic_from_args(args.model, args.critic_pretrain).requires_grad_(False).half().cuda()
- reward_model = get_reward_model_from_args(args.model, args.critic_pretrain).requires_grad_(False).half().cuda()
- if args.initial_model_quant_ckpt is not None and args.model == "llama":
- # quantize initial model
- actor_cfg = AutoConfig.from_pretrained(args.pretrain)
- with low_resource_init(), no_init_weights():
- initial_model = get_actor_from_args(args.model, config=actor_cfg)
- initial_model.model = (
- llama_load_quant(
- initial_model.model, args.initial_model_quant_ckpt, args.quant_bits, args.quant_group_size
- )
- .cuda()
- .requires_grad_(False)
- )
- else:
- initial_model = get_actor_from_args(args.model, args.pretrain).requires_grad_(False).half().cuda()
- return actor, critic, reward_model, initial_model
-
- # configure Experience Maker
- experience_holder_refs = [
- ExperienceMakerHolder.options(name=f"maker{i}", num_gpus=1, max_concurrency=2).remote(
- detached_trainer_name_list=[
- f"trainer{x}"
- for x in get_receivers_per_sender(i, args.num_makers, args.num_trainers, allow_idle_sender=False)
- ],
- strategy_fn=partial(get_strategy_from_args, args.maker_strategy),
- model_fn=model_fn,
- env_info=env_info_maker,
- kl_coef=0.1,
- debug=args.debug,
- update_lora_weights=not (args.lora_rank == 0),
- # sync_models_from_trainers=True,
- # generation kwargs:
- max_length=512,
- do_sample=True,
- temperature=1.0,
- top_k=50,
- pad_token_id=tokenizer.pad_token_id,
- eos_token_id=tokenizer.eos_token_id,
- eval_performance=True,
- use_cache=True,
- )
- for i, env_info_maker in enumerate(env_info_makers)
- ]
-
- def trainer_model_fn():
- actor = get_actor_from_args(args.model, args.pretrain, lora_rank=args.lora_rank).half().cuda()
- critic = get_critic_from_args(args.model, args.critic_pretrain, lora_rank=args.lora_rank).half().cuda()
- return actor, critic
-
- # configure Trainer
- trainer_refs = [
- DetachedPPOTrainer.options(name=f"trainer{i}", num_gpus=1, max_concurrency=2).remote(
- experience_maker_holder_name_list=[
- f"maker{x}"
- for x in get_receivers_per_sender(i, args.num_trainers, args.num_makers, allow_idle_sender=True)
- ],
- strategy_fn=partial(get_strategy_from_args, args.trainer_strategy),
- model_fn=trainer_model_fn,
- env_info=env_info_trainer,
- train_batch_size=args.train_batch_size,
- buffer_limit=16,
- eval_performance=True,
- debug=args.debug,
- update_lora_weights=not (args.lora_rank == 0),
- )
- for i, env_info_trainer in enumerate(env_info_trainers)
- ]
-
- dataset_size = args.experience_batch_size * 4
-
- def build_dataloader():
- def tokenize_fn(texts):
- batch = tokenizer(texts, return_tensors="pt", max_length=96, padding="max_length", truncation=True)
- return {k: v.cuda() for k, v in batch.items()}
-
- dataset = pd.read_csv(args.prompt_path)["prompt"]
- dataloader = DataLoader(dataset=dataset, batch_size=dataset_size, shuffle=True, collate_fn=tokenize_fn)
- return dataloader
-
- # uncomment this function if sync_models_from_trainers is True
- # ray.get([
- # trainer_ref.sync_models_to_remote_makers.remote()
- # for trainer_ref in trainer_refs
- # ])
-
- wait_tasks = []
-
- for experience_holder_ref in experience_holder_refs:
- wait_tasks.append(experience_holder_ref.workingloop.remote(build_dataloader, num_steps=args.experience_steps))
-
- total_steps = (
- args.experience_batch_size
- * args.experience_steps
- * args.num_makers
- // (args.num_trainers * args.train_batch_size)
- )
- for trainer_ref in trainer_refs:
- wait_tasks.append(trainer_ref.fit.remote(total_steps, args.update_steps, args.train_epochs))
-
- ray.get(wait_tasks)
-
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument("--prompt_path", type=str, default=None)
- parser.add_argument("--num_makers", type=int, default=1)
- parser.add_argument("--num_trainers", type=int, default=1)
- parser.add_argument(
- "--trainer_strategy",
- choices=["ddp", "colossalai_gemini", "colossalai_zero2", "colossalai_gemini_cpu", "colossalai_zero2_cpu"],
- default="ddp",
- )
- parser.add_argument("--maker_strategy", choices=["naive"], default="naive")
- parser.add_argument("--model", default="gpt2", choices=["gpt2", "bloom", "opt", "llama"])
- parser.add_argument("--critic_model", default="gpt2", choices=["gpt2", "bloom", "opt", "llama"])
- parser.add_argument("--pretrain", type=str, default=None)
- parser.add_argument("--critic_pretrain", type=str, default=None)
- parser.add_argument("--experience_steps", type=int, default=4)
- parser.add_argument("--experience_batch_size", type=int, default=8)
- parser.add_argument("--train_epochs", type=int, default=1)
- parser.add_argument("--update_steps", type=int, default=2)
- parser.add_argument("--train_batch_size", type=int, default=8)
- parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank")
-
- parser.add_argument("--initial_model_quant_ckpt", type=str, default=None)
- parser.add_argument("--quant_bits", type=int, default=4)
- parser.add_argument("--quant_group_size", type=int, default=128)
- parser.add_argument("--debug", action="store_true")
- args = parser.parse_args()
-
- ray.init(namespace=os.environ["RAY_NAMESPACE"], runtime_env={"env_vars": dict(os.environ)})
- main(args)
diff --git a/applications/ColossalChat/examples/ray/requirements.txt b/applications/ColossalChat/examples/ray/requirements.txt
deleted file mode 100755
index e0275631807f..000000000000
--- a/applications/ColossalChat/examples/ray/requirements.txt
+++ /dev/null
@@ -1 +0,0 @@
-ray
diff --git a/applications/ColossalChat/examples/ray/test_ci.sh b/applications/ColossalChat/examples/ray/test_ci.sh
deleted file mode 100755
index 895f7de0fea9..000000000000
--- a/applications/ColossalChat/examples/ray/test_ci.sh
+++ /dev/null
@@ -1,12 +0,0 @@
-#!/bin/bash
-
-set -xe
-BASE=$(realpath $(dirname $0))
-
-export RAY_NAMESPACE=admin
-export DATA=/data/scratch/chatgpt/prompts.csv
-
-# install requirements
-pip install -r ${BASE}/requirements.txt
-
-python ${BASE}/mmmt_prompt.py --prompt_path $DATA --num_makers 2 --num_trainers 2 --trainer_strategy colossalai_gemini --model opt --critic_model opt --pretrain facebook/opt-350m --critic_pretrain facebook/opt-125m --experience_batch_size 4 --train_batch_size 2