jamesthesnake · jamesthesnake · Apr 30, 2023 · Apr 26, 2023 · Apr 26, 2023 · Apr 26, 2023
diff --git a/.github/workflows/run_chatgpt_unit_tests.yml b/.github/workflows/run_chatgpt_unit_tests.yml
@@ -32,14 +32,14 @@ jobs:
 
       - name: Install ColossalAI and ChatGPT
         run: |
-          pip install -v .
-          cd applications/ChatGPT
+          pip install -e .
+          cd applications/Chat
           pip install -v .
           pip install -r requirements-test.txt
 
       - name: Execute Unit Testing
         run: |
-          cd applications/ChatGPT
+          cd applications/Chat
           rm -rf ~/.cache/colossalai
           pytest tests/
         env:

diff --git a/applications/Chat/README.md b/applications/Chat/README.md
@@ -243,18 +243,23 @@ from coati.trainer import SFTTrainer
 model = LlamaLM(pretrained=args.pretrain)
 tokenizer = AutoTokenizer.from_pretrained(args.pretrain)
 
+(model, optim) = strategy.prepare((model, optim))
 trainer = SFTTrainer(model=model,
     strategy=strategy,
     optim=optim,
     train_dataloader=train_dataloader,
     eval_dataloader=eval_dataloader,
     batch_size=args.batch_size,
     max_epochs=args.max_epochs,
-    accimulation_steps = args.accimulation_steps
+    accumulation_steps = args.accumulation_steps
 )
 
 trainer.fit()
-trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)
+# this saves in pytorch format
+strategy.save_model(model, args.save_path, only_rank0=True)
+
+# this saves in HF format. ColossalAI strategy with stage-3 doesn't support this method
+strategy.save_pretrained(model, args.save_path, only_rank0=True, tokenizer=tokenizer)
 ```
 
 </details>
@@ -263,7 +268,7 @@ trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)
 
 Here are some examples that can allow you to train a 7B model on a single or multiple consumer-grade GPUs.
 
-If you only have a single 24G GPU, you can use the following script. `batch_size` and `lora_rank` are the most important parameters to successfully train the model.
+If you only have a single 24G GPU, you can use the following script. `batch_size`, `lora_rank` and `grad_checkpoint` are the most important parameters to successfully train the model.
 ```
 torchrun --standalone --nproc_per_node=1 train_sft.py \
     --pretrain "/path/to/LLaMa-7B/" \
@@ -273,11 +278,12 @@ torchrun --standalone --nproc_per_node=1 train_sft.py \
     --save_path  /path/to/Coati-7B \
     --dataset /path/to/data.json \
     --batch_size 1 \
-    --accimulation_steps 8 \
+    --accumulation_steps 8 \
     --lr 2e-5 \
     --max_datasets_size 512 \
     --max_epochs 1 \
     --lora_rank 16 \
+    --grad_checkpoint
 ```
 
 `colossalai_gemini` strategy can enable a single 24G GPU to train the whole model without using LoRA if you have sufficient CPU memory. You can use the following script.
@@ -290,10 +296,11 @@ torchrun --standalone --nproc_per_node=1 train_sft.py \
     --save_path  /path/to/Coati-7B \
     --dataset /path/to/data.json \
     --batch_size 1 \
-    --accimulation_steps 8 \
+    --accumulation_steps 8 \
     --lr 2e-5 \
     --max_datasets_size 512 \
     --max_epochs 1 \
+    --grad_checkpoint
 ```
 
 If you have 4x32 GB GPUs, you can even train the whole 7B model using our `colossalai_zero2_cpu` strategy! The script is given as follows.
@@ -306,10 +313,11 @@ torchrun --standalone --nproc_per_node=4 train_sft.py \
     --save_path  /path/to/Coati-7B \
     --dataset /path/to/data.json \
     --batch_size 1 \
-    --accimulation_steps 8 \
+    --accumulation_steps 8 \
     --lr 2e-5 \
     --max_datasets_size 512 \
     --max_epochs 1 \
+    --grad_checkpoint
 ```
 </details>
 

diff --git a/applications/Chat/benchmarks/README.md b/applications/Chat/benchmarks/README.md
@@ -1,70 +1,5 @@
 # Benchmarks
 
-## Benchmark GPT on dummy prompt data
-
-We provide various GPT models (string in parentheses is the corresponding model name used in this script):
-
-- GPT2-S (s)
-- GPT2-M (m)
-- GPT2-L (l)
-- GPT2-XL (xl)
-- GPT2-4B (4b)
-- GPT2-6B (6b)
-- GPT2-8B (8b)
-- GPT2-10B (10b)
-- GPT2-12B (12b)
-- GPT2-15B (15b)
-- GPT2-18B (18b)
-- GPT2-20B (20b)
-- GPT2-24B (24b)
-- GPT2-28B (28b)
-- GPT2-32B (32b)
-- GPT2-36B (36b)
-- GPT2-40B (40b)
-- GPT3 (175b)
-
-We also provide various training strategies:
-
-- ddp: torch DDP
-- colossalai_gemini: ColossalAI GeminiDDP with `placement_policy="cuda"`, like zero3
-- colossalai_gemini_cpu: ColossalAI GeminiDDP with `placement_policy="cpu"`, like zero3-offload
-- colossalai_zero2: ColossalAI zero2
-- colossalai_zero2_cpu: ColossalAI zero2-offload
-- colossalai_zero1: ColossalAI zero1
-- colossalai_zero1_cpu: ColossalAI zero1-offload
-
-We only support `torchrun` to launch now. E.g.
-
-```shell
-# run GPT2-S on single-node single-GPU with min batch size
-torchrun --standalone --nproc_per_node 1 benchmark_gpt_dummy.py --model s --strategy ddp --experience_batch_size 1 --train_batch_size 1
-# run GPT2-XL on single-node 4-GPU
-torchrun --standalone --nproc_per_node 4 benchmark_gpt_dummy.py --model xl --strategy colossalai_zero2
-# run GPT3 on 8-node 8-GPU
-torchrun --nnodes 8 --nproc_per_node 8 \
- --rdzv_id=$JOB_ID --rdzv_backend=c10d --rdzv_endpoint=$HOST_NODE_ADDR \
- benchmark_gpt_dummy.py --model 175b --strategy colossalai_gemini
-```
-
-> ⚠ Batch sizes in CLI args and outputed throughput/TFLOPS are all values of per GPU.
-
-In this benchmark, we assume the model architectures/sizes of actor and critic are the same for simplicity. But in practice, to reduce training cost, we may use a smaller critic.
-
-We also provide a simple shell script to run a set of benchmarks. But it only supports benchmark on single node. However, it's easy to run on multi-nodes by modifying launch command in this script.
-
-Usage:
-
-```shell
-# run for GPUS=(1 2 4 8) x strategy=("ddp" "colossalai_zero2" "colossalai_gemini" "colossalai_zero2_cpu" "colossalai_gemini_cpu") x model=("s" "m" "l" "xl" "2b" "4b" "6b" "8b" "10b") x batch_size=(1 2 4 8 16 32 64 128 256)
-./benchmark_gpt_dummy.sh
-# run for GPUS=2 x strategy=("ddp" "colossalai_zero2" "colossalai_gemini" "colossalai_zero2_cpu" "colossalai_gemini_cpu") x model=("s" "m" "l" "xl" "2b" "4b" "6b" "8b" "10b") x batch_size=(1 2 4 8 16 32 64 128 256)
-./benchmark_gpt_dummy.sh 2
-# run for GPUS=2 x strategy=ddp x model=("s" "m" "l" "xl" "2b" "4b" "6b" "8b" "10b") x batch_size=(1 2 4 8 16 32 64 128 256)
-./benchmark_gpt_dummy.sh 2 ddp
-# run for GPUS=2 x strategy=ddp x model=l x batch_size=(1 2 4 8 16 32 64 128 256)
-./benchmark_gpt_dummy.sh 2 ddp l
-```
-
 ## Benchmark OPT with LoRA on dummy prompt data
 
 We provide various OPT models (string in parentheses is the corresponding model name used in this script):
@@ -80,15 +15,21 @@ We provide various OPT models (string in parentheses is the corresponding model
 - OPT-10B (10b)
 - OPT-13B (13b)
 
+We also provide various training strategies:
+
+- ddp: torch DDP
+- colossalai_gemini: ColossalAI GeminiDDP with `placement_policy="cuda"`, like zero3
+- colossalai_gemini_cpu: ColossalAI GeminiDDP with `placement_policy="cpu"`, like zero3-offload
+- colossalai_zero2: ColossalAI zero2
+- colossalai_zero2_cpu: ColossalAI zero2-offload
+- colossalai_zero1: ColossalAI zero1
+- colossalai_zero1_cpu: ColossalAI zero1-offload
+
 We only support `torchrun` to launch now. E.g.
 
 ```shell
 # run OPT-125M with no lora (lora_rank=0) on single-node single-GPU with min batch size
-torchrun --standalone --nproc_per_node 1 benchmark_opt_lora_dummy.py --model 125m --strategy ddp --experience_batch_size 1 --train_batch_size 1 --lora_rank 0
-# run OPT-350M with lora_rank=4 on single-node 4-GPU
-torchrun --standalone --nproc_per_node 4 benchmark_opt_lora_dummy.py --model 350m --strategy colossalai_zero2 --lora_rank 4
+torchrun --standalone --nproc_per_node 1 benchmark_opt_lora_dummy.py --model 125m --critic_model 125m --strategy ddp --experience_batch_size 1 --train_batch_size 1 --lora_rank 0
+# run Actor (OPT-1.3B) and Critic (OPT-350M) with lora_rank=4 on single-node 4-GPU
+torchrun --standalone --nproc_per_node 4 benchmark_opt_lora_dummy.py --model 1.3b --critic_model 350m --strategy colossalai_zero2 --lora_rank 4
 ```
-
-> ⚠ Batch sizes in CLI args and outputed throughput/TFLOPS are all values of per GPU.
-
-In this benchmark, we assume the model architectures/sizes of actor and critic are the same for simplicity. But in practice, to reduce training cost, we may use a smaller critic.
diff --git a/applications/Chat/benchmarks/benchmark_gpt_dummy.py b/applications/Chat/benchmarks/benchmark_gpt_dummy.py