Skip to content
Merged

Co #33

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
739cfe3
[chat] fix enable single gpu training bug
zhang-yi-chi Apr 22, 2023
179558a
[devops] fix chat ci (#3628)
ver217 Apr 24, 2023
df309fc
[Chat] Remove duplicate functions (#3625)
ddobokki Apr 24, 2023
e1b0a78
Merge pull request #3621 from zhang-yi-chi/fix/chat-train-prompts-sin…
TongLi3701 Apr 24, 2023
3837fa2
Merge pull request #28 from jamesthesnake/ra
jamesthesnake Apr 25, 2023
b622015
Merge pull request #29 from jamesthesnake/l
jamesthesnake Apr 25, 2023
b9a8dff
[doc] Fix typo under colossalai and doc(#3618)
digger-yu Apr 26, 2023
4b3240c
[booster] add low level zero plugin (#3594)
ver217 Apr 26, 2023
50793b3
[gemini] accelerate inference (#3641)
ver217 Apr 26, 2023
f828831
[chat] polish performance evaluator (#3647)
ver217 Apr 26, 2023
2a95195
[chat] refactor trainer (#3648)
ver217 Apr 26, 2023
8bccb72
[Doc] enhancement on README.md for chat examples (#3646)
Camille7777 Apr 27, 2023
6ef7011
[chat] remove lm model class (#3653)
ver217 Apr 27, 2023
842768a
[chat] refactor model save/load logic (#3654)
ver217 Apr 27, 2023
a22407c
[zero] Suggests a minor change to confusing variable names in the ZeR…
yhna940 Apr 27, 2023
aa77dda
remove unnecessary step and update readme
TongLi3701 Apr 27, 2023
c419117
update questions and readme
TongLi3701 Apr 27, 2023
ed3eaa6
update documentation
TongLi3701 Apr 28, 2023
c1a3559
update readme
TongLi3701 Apr 28, 2023
268b3cd
[chat] set default zero2 strategy (#3667)
binmakeswell Apr 28, 2023
816add7
Merge pull request #3656 from TongLi3701/chat/update_eval
TongLi3701 Apr 28, 2023
1a60dc0
[chat] typo accimulation_steps -> accumulation_steps (#3662)
tanitna Apr 28, 2023
fa5097a
Merge pull request #32 from hpcaitech/main
jamesthesnake Apr 30, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/run_chatgpt_examples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ on:
jobs:
tests:
name: Run ChatGPT examples
if: |
github.event.pull_request.draft == false &&
github.base_ref == 'main' &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
Expand Down
22 changes: 13 additions & 9 deletions .github/workflows/run_chatgpt_unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,20 @@ on:
pull_request:
types: [synchronize, opened, reopened]
paths:
- 'applications/ChatGPT/chatgpt/**'
- 'applications/ChatGPT/requirements.txt'
- 'applications/ChatGPT/setup.py'
- 'applications/ChatGPT/requirements-test.txt'
- 'applications/ChatGPT/tests/**'
- 'applications/ChatGPT/pytest.ini'
- 'applications/Chat/coati/**'
- 'applications/Chat/requirements.txt'
- 'applications/Chat/setup.py'
- 'applications/Chat/requirements-test.txt'
- 'applications/Chat/tests/**'
- 'applications/Chat/pytest.ini'

jobs:
tests:
name: Run ChatGPT unit tests
if: |
github.event.pull_request.draft == false &&
github.base_ref == 'main' &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
Expand All @@ -28,14 +32,14 @@ jobs:

- name: Install ColossalAI and ChatGPT
run: |
pip install -v .
cd applications/ChatGPT
pip install -e .
cd applications/Chat
pip install -v .
pip install -r requirements-test.txt

- name: Execute Unit Testing
run: |
cd applications/ChatGPT
cd applications/Chat
rm -rf ~/.cache/colossalai
pytest tests/
env:
Expand Down
20 changes: 14 additions & 6 deletions applications/Chat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,18 +243,23 @@ from coati.trainer import SFTTrainer
model = LlamaLM(pretrained=args.pretrain)
tokenizer = AutoTokenizer.from_pretrained(args.pretrain)

(model, optim) = strategy.prepare((model, optim))
trainer = SFTTrainer(model=model,
strategy=strategy,
optim=optim,
train_dataloader=train_dataloader,
eval_dataloader=eval_dataloader,
batch_size=args.batch_size,
max_epochs=args.max_epochs,
accimulation_steps = args.accimulation_steps
accumulation_steps = args.accumulation_steps
)

trainer.fit()
trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)
# this saves in pytorch format
strategy.save_model(model, args.save_path, only_rank0=True)

# this saves in HF format. ColossalAI strategy with stage-3 doesn't support this method
strategy.save_pretrained(model, args.save_path, only_rank0=True, tokenizer=tokenizer)
```

</details>
Expand All @@ -263,7 +268,7 @@ trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)

Here are some examples that can allow you to train a 7B model on a single or multiple consumer-grade GPUs.

If you only have a single 24G GPU, you can use the following script. `batch_size` and `lora_rank` are the most important parameters to successfully train the model.
If you only have a single 24G GPU, you can use the following script. `batch_size`, `lora_rank` and `grad_checkpoint` are the most important parameters to successfully train the model.
```
torchrun --standalone --nproc_per_node=1 train_sft.py \
--pretrain "/path/to/LLaMa-7B/" \
Expand All @@ -273,11 +278,12 @@ torchrun --standalone --nproc_per_node=1 train_sft.py \
--save_path /path/to/Coati-7B \
--dataset /path/to/data.json \
--batch_size 1 \
--accimulation_steps 8 \
--accumulation_steps 8 \
--lr 2e-5 \
--max_datasets_size 512 \
--max_epochs 1 \
--lora_rank 16 \
--grad_checkpoint
```

`colossalai_gemini` strategy can enable a single 24G GPU to train the whole model without using LoRA if you have sufficient CPU memory. You can use the following script.
Expand All @@ -290,10 +296,11 @@ torchrun --standalone --nproc_per_node=1 train_sft.py \
--save_path /path/to/Coati-7B \
--dataset /path/to/data.json \
--batch_size 1 \
--accimulation_steps 8 \
--accumulation_steps 8 \
--lr 2e-5 \
--max_datasets_size 512 \
--max_epochs 1 \
--grad_checkpoint
```

If you have 4x32 GB GPUs, you can even train the whole 7B model using our `colossalai_zero2_cpu` strategy! The script is given as follows.
Expand All @@ -306,10 +313,11 @@ torchrun --standalone --nproc_per_node=4 train_sft.py \
--save_path /path/to/Coati-7B \
--dataset /path/to/data.json \
--batch_size 1 \
--accimulation_steps 8 \
--accumulation_steps 8 \
--lr 2e-5 \
--max_datasets_size 512 \
--max_epochs 1 \
--grad_checkpoint
```
</details>

Expand Down
85 changes: 13 additions & 72 deletions applications/Chat/benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,70 +1,5 @@
# Benchmarks

## Benchmark GPT on dummy prompt data

We provide various GPT models (string in parentheses is the corresponding model name used in this script):

- GPT2-S (s)
- GPT2-M (m)
- GPT2-L (l)
- GPT2-XL (xl)
- GPT2-4B (4b)
- GPT2-6B (6b)
- GPT2-8B (8b)
- GPT2-10B (10b)
- GPT2-12B (12b)
- GPT2-15B (15b)
- GPT2-18B (18b)
- GPT2-20B (20b)
- GPT2-24B (24b)
- GPT2-28B (28b)
- GPT2-32B (32b)
- GPT2-36B (36b)
- GPT2-40B (40b)
- GPT3 (175b)

We also provide various training strategies:

- ddp: torch DDP
- colossalai_gemini: ColossalAI GeminiDDP with `placement_policy="cuda"`, like zero3
- colossalai_gemini_cpu: ColossalAI GeminiDDP with `placement_policy="cpu"`, like zero3-offload
- colossalai_zero2: ColossalAI zero2
- colossalai_zero2_cpu: ColossalAI zero2-offload
- colossalai_zero1: ColossalAI zero1
- colossalai_zero1_cpu: ColossalAI zero1-offload

We only support `torchrun` to launch now. E.g.

```shell
# run GPT2-S on single-node single-GPU with min batch size
torchrun --standalone --nproc_per_node 1 benchmark_gpt_dummy.py --model s --strategy ddp --experience_batch_size 1 --train_batch_size 1
# run GPT2-XL on single-node 4-GPU
torchrun --standalone --nproc_per_node 4 benchmark_gpt_dummy.py --model xl --strategy colossalai_zero2
# run GPT3 on 8-node 8-GPU
torchrun --nnodes 8 --nproc_per_node 8 \
--rdzv_id=$JOB_ID --rdzv_backend=c10d --rdzv_endpoint=$HOST_NODE_ADDR \
benchmark_gpt_dummy.py --model 175b --strategy colossalai_gemini
```

> ⚠ Batch sizes in CLI args and outputed throughput/TFLOPS are all values of per GPU.

In this benchmark, we assume the model architectures/sizes of actor and critic are the same for simplicity. But in practice, to reduce training cost, we may use a smaller critic.

We also provide a simple shell script to run a set of benchmarks. But it only supports benchmark on single node. However, it's easy to run on multi-nodes by modifying launch command in this script.

Usage:

```shell
# run for GPUS=(1 2 4 8) x strategy=("ddp" "colossalai_zero2" "colossalai_gemini" "colossalai_zero2_cpu" "colossalai_gemini_cpu") x model=("s" "m" "l" "xl" "2b" "4b" "6b" "8b" "10b") x batch_size=(1 2 4 8 16 32 64 128 256)
./benchmark_gpt_dummy.sh
# run for GPUS=2 x strategy=("ddp" "colossalai_zero2" "colossalai_gemini" "colossalai_zero2_cpu" "colossalai_gemini_cpu") x model=("s" "m" "l" "xl" "2b" "4b" "6b" "8b" "10b") x batch_size=(1 2 4 8 16 32 64 128 256)
./benchmark_gpt_dummy.sh 2
# run for GPUS=2 x strategy=ddp x model=("s" "m" "l" "xl" "2b" "4b" "6b" "8b" "10b") x batch_size=(1 2 4 8 16 32 64 128 256)
./benchmark_gpt_dummy.sh 2 ddp
# run for GPUS=2 x strategy=ddp x model=l x batch_size=(1 2 4 8 16 32 64 128 256)
./benchmark_gpt_dummy.sh 2 ddp l
```

## Benchmark OPT with LoRA on dummy prompt data

We provide various OPT models (string in parentheses is the corresponding model name used in this script):
Expand All @@ -80,15 +15,21 @@ We provide various OPT models (string in parentheses is the corresponding model
- OPT-10B (10b)
- OPT-13B (13b)

We also provide various training strategies:

- ddp: torch DDP
- colossalai_gemini: ColossalAI GeminiDDP with `placement_policy="cuda"`, like zero3
- colossalai_gemini_cpu: ColossalAI GeminiDDP with `placement_policy="cpu"`, like zero3-offload
- colossalai_zero2: ColossalAI zero2
- colossalai_zero2_cpu: ColossalAI zero2-offload
- colossalai_zero1: ColossalAI zero1
- colossalai_zero1_cpu: ColossalAI zero1-offload

We only support `torchrun` to launch now. E.g.

```shell
# run OPT-125M with no lora (lora_rank=0) on single-node single-GPU with min batch size
torchrun --standalone --nproc_per_node 1 benchmark_opt_lora_dummy.py --model 125m --strategy ddp --experience_batch_size 1 --train_batch_size 1 --lora_rank 0
# run OPT-350M with lora_rank=4 on single-node 4-GPU
torchrun --standalone --nproc_per_node 4 benchmark_opt_lora_dummy.py --model 350m --strategy colossalai_zero2 --lora_rank 4
torchrun --standalone --nproc_per_node 1 benchmark_opt_lora_dummy.py --model 125m --critic_model 125m --strategy ddp --experience_batch_size 1 --train_batch_size 1 --lora_rank 0
# run Actor (OPT-1.3B) and Critic (OPT-350M) with lora_rank=4 on single-node 4-GPU
torchrun --standalone --nproc_per_node 4 benchmark_opt_lora_dummy.py --model 1.3b --critic_model 350m --strategy colossalai_zero2 --lora_rank 4
```

> ⚠ Batch sizes in CLI args and outputed throughput/TFLOPS are all values of per GPU.

In this benchmark, we assume the model architectures/sizes of actor and critic are the same for simplicity. But in practice, to reduce training cost, we may use a smaller critic.
Loading