Skip to content
Merged

Ra #39

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
fa5097a
Merge pull request #32 from hpcaitech/main
jamesthesnake Apr 30, 2023
e795a9f
Merge pull request #33 from jamesthesnake/co
jamesthesnake Apr 30, 2023
bfbf650
fix spelling error
digger-yu May 4, 2023
8ba7858
Update generate_gpt35_answers.py
digger-yu May 4, 2023
7bd0bee
[chat] add opt attn kernel (#3655)
ver217 May 4, 2023
6650dae
[doc] fix chat spelling error (#3671)
digger-yu May 5, 2023
0f785cb
[chat] PPO stage3 doc enhancement (#3679)
Camille7777 May 5, 2023
307894f
[booster] gemini plugin support shard checkpoint (#3610)
flybird11111 May 5, 2023
b36e67c
Merge pull request #3680 from digger-yu/digger-yu-patch-2
TongLi3701 May 5, 2023
b49020c
[CI] Update test_sharded_optim_with_sync_bn.py (#3688)
digger-yu May 5, 2023
d0915f5
[booster] refactor all dp fashion plugins (#3684)
ver217 May 5, 2023
65bdc31
fix some spelling error with applications/Chat/examples/ (#3692)
digger-yu May 6, 2023
d556648
[example] add finetune bert with booster example (#3693)
ver217 May 6, 2023
2da5d81
[chat] fix train_prompts.py gemini strategy bug (#3666)
zhang-yi-chi May 6, 2023
2629f97
[tensor] Refactor handle_trans_spec in DistSpecManager
yhna940 May 6, 2023
f83ea81
[example] add train resnet/vit with booster example (#3694)
ver217 May 8, 2023
3bf09ef
[booster] update prepare dataloader method for plugin (#3706)
ver217 May 8, 2023
8ce2a71
Merge pull request #36 from hpcaitech/main
jamesthesnake May 8, 2023
6552cbf
[booster] fix no_sync method (#3709)
ver217 May 9, 2023
20068ba
[booster] add tests for ddp and low level zero's checkpointio (#3715)
flybird11111 May 10, 2023
f7361ee
[chat] fix community example ray (#3719)
MisterLin1995 May 10, 2023
b7141c3
[CI] fix some spelling errors (#3707)
digger-yu May 10, 2023
899aa86
[CI] fix typo with tests components (#3695)
digger-yu May 11, 2023
1f73609
[CI] fix typo with tests/ etc. (#3727)
digger-yu May 11, 2023
1b5cd7b
Merge pull request #42 from hpcaitech/main
jamesthesnake May 14, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/run_chatgpt_examples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ on:
jobs:
tests:
name: Run ChatGPT examples
if: |
github.event.pull_request.draft == false &&
github.base_ref == 'main' &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
Expand Down
22 changes: 13 additions & 9 deletions .github/workflows/run_chatgpt_unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,20 @@ on:
pull_request:
types: [synchronize, opened, reopened]
paths:
- 'applications/ChatGPT/chatgpt/**'
- 'applications/ChatGPT/requirements.txt'
- 'applications/ChatGPT/setup.py'
- 'applications/ChatGPT/requirements-test.txt'
- 'applications/ChatGPT/tests/**'
- 'applications/ChatGPT/pytest.ini'
- 'applications/Chat/coati/**'
- 'applications/Chat/requirements.txt'
- 'applications/Chat/setup.py'
- 'applications/Chat/requirements-test.txt'
- 'applications/Chat/tests/**'
- 'applications/Chat/pytest.ini'

jobs:
tests:
name: Run ChatGPT unit tests
if: |
github.event.pull_request.draft == false &&
github.base_ref == 'main' &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
Expand All @@ -28,14 +32,14 @@ jobs:

- name: Install ColossalAI and ChatGPT
run: |
pip install -v .
cd applications/ChatGPT
pip install -e .
cd applications/Chat
pip install -v .
pip install -r requirements-test.txt

- name: Execute Unit Testing
run: |
cd applications/ChatGPT
cd applications/Chat
rm -rf ~/.cache/colossalai
pytest tests/
env:
Expand Down
22 changes: 15 additions & 7 deletions applications/Chat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ The Coati package provides a unified large language model framework that has imp
Image source: https://openai.com/blog/chatgpt
</div>

**As Colossa-AI is undergoing some major updates, this project will be actively maintained to stay in line with the Colossal-AI project.**
**As Colossal-AI is undergoing some major updates, this project will be actively maintained to stay in line with the Colossal-AI project.**


More details can be found in the latest news.
Expand Down Expand Up @@ -243,18 +243,23 @@ from coati.trainer import SFTTrainer
model = LlamaLM(pretrained=args.pretrain)
tokenizer = AutoTokenizer.from_pretrained(args.pretrain)

(model, optim) = strategy.prepare((model, optim))
trainer = SFTTrainer(model=model,
strategy=strategy,
optim=optim,
train_dataloader=train_dataloader,
eval_dataloader=eval_dataloader,
batch_size=args.batch_size,
max_epochs=args.max_epochs,
accimulation_steps = args.accimulation_steps
accumulation_steps = args.accumulation_steps
)

trainer.fit()
trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)
# this saves in pytorch format
strategy.save_model(model, args.save_path, only_rank0=True)

# this saves in HF format. ColossalAI strategy with stage-3 doesn't support this method
strategy.save_pretrained(model, args.save_path, only_rank0=True, tokenizer=tokenizer)
```

</details>
Expand All @@ -263,7 +268,7 @@ trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)

Here are some examples that can allow you to train a 7B model on a single or multiple consumer-grade GPUs.

If you only have a single 24G GPU, you can use the following script. `batch_size` and `lora_rank` are the most important parameters to successfully train the model.
If you only have a single 24G GPU, you can use the following script. `batch_size`, `lora_rank` and `grad_checkpoint` are the most important parameters to successfully train the model.
```
torchrun --standalone --nproc_per_node=1 train_sft.py \
--pretrain "/path/to/LLaMa-7B/" \
Expand All @@ -273,11 +278,12 @@ torchrun --standalone --nproc_per_node=1 train_sft.py \
--save_path /path/to/Coati-7B \
--dataset /path/to/data.json \
--batch_size 1 \
--accimulation_steps 8 \
--accumulation_steps 8 \
--lr 2e-5 \
--max_datasets_size 512 \
--max_epochs 1 \
--lora_rank 16 \
--grad_checkpoint
```

`colossalai_gemini` strategy can enable a single 24G GPU to train the whole model without using LoRA if you have sufficient CPU memory. You can use the following script.
Expand All @@ -290,10 +296,11 @@ torchrun --standalone --nproc_per_node=1 train_sft.py \
--save_path /path/to/Coati-7B \
--dataset /path/to/data.json \
--batch_size 1 \
--accimulation_steps 8 \
--accumulation_steps 8 \
--lr 2e-5 \
--max_datasets_size 512 \
--max_epochs 1 \
--grad_checkpoint
```

If you have 4x32 GB GPUs, you can even train the whole 7B model using our `colossalai_zero2_cpu` strategy! The script is given as follows.
Expand All @@ -306,10 +313,11 @@ torchrun --standalone --nproc_per_node=4 train_sft.py \
--save_path /path/to/Coati-7B \
--dataset /path/to/data.json \
--batch_size 1 \
--accimulation_steps 8 \
--accumulation_steps 8 \
--lr 2e-5 \
--max_datasets_size 512 \
--max_epochs 1 \
--grad_checkpoint
```
</details>

Expand Down
85 changes: 13 additions & 72 deletions applications/Chat/benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,70 +1,5 @@
# Benchmarks

## Benchmark GPT on dummy prompt data

We provide various GPT models (string in parentheses is the corresponding model name used in this script):

- GPT2-S (s)
- GPT2-M (m)
- GPT2-L (l)
- GPT2-XL (xl)
- GPT2-4B (4b)
- GPT2-6B (6b)
- GPT2-8B (8b)
- GPT2-10B (10b)
- GPT2-12B (12b)
- GPT2-15B (15b)
- GPT2-18B (18b)
- GPT2-20B (20b)
- GPT2-24B (24b)
- GPT2-28B (28b)
- GPT2-32B (32b)
- GPT2-36B (36b)
- GPT2-40B (40b)
- GPT3 (175b)

We also provide various training strategies:

- ddp: torch DDP
- colossalai_gemini: ColossalAI GeminiDDP with `placement_policy="cuda"`, like zero3
- colossalai_gemini_cpu: ColossalAI GeminiDDP with `placement_policy="cpu"`, like zero3-offload
- colossalai_zero2: ColossalAI zero2
- colossalai_zero2_cpu: ColossalAI zero2-offload
- colossalai_zero1: ColossalAI zero1
- colossalai_zero1_cpu: ColossalAI zero1-offload

We only support `torchrun` to launch now. E.g.

```shell
# run GPT2-S on single-node single-GPU with min batch size
torchrun --standalone --nproc_per_node 1 benchmark_gpt_dummy.py --model s --strategy ddp --experience_batch_size 1 --train_batch_size 1
# run GPT2-XL on single-node 4-GPU
torchrun --standalone --nproc_per_node 4 benchmark_gpt_dummy.py --model xl --strategy colossalai_zero2
# run GPT3 on 8-node 8-GPU
torchrun --nnodes 8 --nproc_per_node 8 \
--rdzv_id=$JOB_ID --rdzv_backend=c10d --rdzv_endpoint=$HOST_NODE_ADDR \
benchmark_gpt_dummy.py --model 175b --strategy colossalai_gemini
```

> ⚠ Batch sizes in CLI args and outputed throughput/TFLOPS are all values of per GPU.

In this benchmark, we assume the model architectures/sizes of actor and critic are the same for simplicity. But in practice, to reduce training cost, we may use a smaller critic.

We also provide a simple shell script to run a set of benchmarks. But it only supports benchmark on single node. However, it's easy to run on multi-nodes by modifying launch command in this script.

Usage:

```shell
# run for GPUS=(1 2 4 8) x strategy=("ddp" "colossalai_zero2" "colossalai_gemini" "colossalai_zero2_cpu" "colossalai_gemini_cpu") x model=("s" "m" "l" "xl" "2b" "4b" "6b" "8b" "10b") x batch_size=(1 2 4 8 16 32 64 128 256)
./benchmark_gpt_dummy.sh
# run for GPUS=2 x strategy=("ddp" "colossalai_zero2" "colossalai_gemini" "colossalai_zero2_cpu" "colossalai_gemini_cpu") x model=("s" "m" "l" "xl" "2b" "4b" "6b" "8b" "10b") x batch_size=(1 2 4 8 16 32 64 128 256)
./benchmark_gpt_dummy.sh 2
# run for GPUS=2 x strategy=ddp x model=("s" "m" "l" "xl" "2b" "4b" "6b" "8b" "10b") x batch_size=(1 2 4 8 16 32 64 128 256)
./benchmark_gpt_dummy.sh 2 ddp
# run for GPUS=2 x strategy=ddp x model=l x batch_size=(1 2 4 8 16 32 64 128 256)
./benchmark_gpt_dummy.sh 2 ddp l
```

## Benchmark OPT with LoRA on dummy prompt data

We provide various OPT models (string in parentheses is the corresponding model name used in this script):
Expand All @@ -80,15 +15,21 @@ We provide various OPT models (string in parentheses is the corresponding model
- OPT-10B (10b)
- OPT-13B (13b)

We also provide various training strategies:

- ddp: torch DDP
- colossalai_gemini: ColossalAI GeminiDDP with `placement_policy="cuda"`, like zero3
- colossalai_gemini_cpu: ColossalAI GeminiDDP with `placement_policy="cpu"`, like zero3-offload
- colossalai_zero2: ColossalAI zero2
- colossalai_zero2_cpu: ColossalAI zero2-offload
- colossalai_zero1: ColossalAI zero1
- colossalai_zero1_cpu: ColossalAI zero1-offload

We only support `torchrun` to launch now. E.g.

```shell
# run OPT-125M with no lora (lora_rank=0) on single-node single-GPU with min batch size
torchrun --standalone --nproc_per_node 1 benchmark_opt_lora_dummy.py --model 125m --strategy ddp --experience_batch_size 1 --train_batch_size 1 --lora_rank 0
# run OPT-350M with lora_rank=4 on single-node 4-GPU
torchrun --standalone --nproc_per_node 4 benchmark_opt_lora_dummy.py --model 350m --strategy colossalai_zero2 --lora_rank 4
torchrun --standalone --nproc_per_node 1 benchmark_opt_lora_dummy.py --model 125m --critic_model 125m --strategy ddp --experience_batch_size 1 --train_batch_size 1 --lora_rank 0
# run Actor (OPT-1.3B) and Critic (OPT-350M) with lora_rank=4 on single-node 4-GPU
torchrun --standalone --nproc_per_node 4 benchmark_opt_lora_dummy.py --model 1.3b --critic_model 350m --strategy colossalai_zero2 --lora_rank 4
```

> ⚠ Batch sizes in CLI args and outputed throughput/TFLOPS are all values of per GPU.

In this benchmark, we assume the model architectures/sizes of actor and critic are the same for simplicity. But in practice, to reduce training cost, we may use a smaller critic.
Loading