diff --git a/applications/Chat/README.md b/applications/Chat/README.md
index e3b605d9b796..0a5f7840d997 100644
--- a/applications/Chat/README.md
+++ b/applications/Chat/README.md
@@ -15,20 +15,18 @@
- [Install the Transformers](#install-the-transformers)
- [How to use?](#how-to-use)
- [Supervised datasets collection](#supervised-datasets-collection)
- - [Stage1 - Supervised instructs tuning](#stage1---supervised-instructs-tuning)
- - [Stage2 - Training reward model](#stage2---training-reward-model)
- - [Stage3 - Training model with reinforcement learning by human feedback](#stage3---training-model-with-reinforcement-learning-by-human-feedback)
- - [Inference - After Training](#inference---after-training)
- - [8-bit setup](#8-bit-setup)
- - [4-bit setup](#4-bit-setup)
+ - [RLHF Training Stage1 - Supervised instructs tuning](#RLHF-training-stage1---supervised-instructs-tuning)
+ - [RLHF Training Stage2 - Training reward model](#RLHF-training-stage2---training-reward-model)
+ - [RLHF Training Stage3 - Training model with reinforcement learning by human feedback](#RLHF-training-stage3---training-model-with-reinforcement-learning-by-human-feedback)
+ - [Inference Quantization and Serving - After Training](#inference-quantization-and-serving---after-training)
- [Coati7B examples](#coati7b-examples)
- [Generation](#generation)
- [Open QA](#open-qa)
- - [Limitation for LLaMA-finetuned models](#limitation-for-llama-finetuned-models)
- - [Limitation of dataset](#limitation-of-dataset)
+ - [Limitation for LLaMA-finetuned models](#limitation)
+ - [Limitation of dataset](#limitation)
- [FAQ](#faq)
- - [How to save/load checkpoint](#how-to-saveload-checkpoint)
- - [How to train with limited resources](#how-to-train-with-limited-resources)
+ - [How to save/load checkpoint](#faq)
+ - [How to train with limited resources](#faq)
- [The Plan](#the-plan)
- [Real-time progress](#real-time-progress)
- [Invitation to open-source contribution](#invitation-to-open-source-contribution)
@@ -107,43 +105,19 @@ Here is how we collected the data
-you can run the `examples/train_prompts.sh` to start training PPO with human feedback
-
-```
-torchrun --standalone --nproc_per_node=4 train_prompts.py \
- --pretrain "/path/to/LLaMa-7B/" \
- --model 'llama' \
- --strategy colossalai_zero2 \
- --prompt_path /path/to/your/prompt_dataset \
- --pretrain_dataset /path/to/your/pretrain_dataset \
- --rm_pretrain /your/pretrain/rm/defination \
- --rm_path /your/rm/model/path
-```
+You can run the `examples/train_prompts.sh` to start training PPO with human feedback.
For more details, see [`examples/`](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples).
-### Inference - After Training
-#### 8-bit setup
-
-8-bit quantization is originally supported by the latest [transformers](https://github.com/huggingface/transformers). Please install it from source.
-
-Please ensure you have downloaded HF-format model weights of LLaMA models.
+### Inference Quantization and Serving - After Training
-Usage:
+We provide an online inference server and a benchmark. We aim to run inference on single GPU, so quantization is essential when using large models.
-```python
-from transformers import LlamaForCausalLM
-USE_8BIT = True # use 8-bit quantization; otherwise, use fp16
-model = LlamaForCausalLM.from_pretrained(
- "pretrained/path",
- load_in_8bit=USE_8BIT,
- torch_dtype=torch.float16,
- device_map="auto",
- )
-if not USE_8BIT:
- model.half() # use fp16
-model.eval()
-```
-
-**Troubleshooting**: if you get errors indicating your CUDA-related libraries are not found when loading the 8-bit model, you can check whether your `LD_LIBRARY_PATH` is correct.
-
-E.g. you can set `export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH`.
-
-#### 4-bit setup
-
-Please ensure you have downloaded the HF-format model weights of LLaMA models first.
-
-Then you can follow [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa). This lib provides efficient CUDA kernels and weight conversion scripts.
-
-After installing this lib, we may convert the original HF-format LLaMA model weights to a 4-bit version.
-
-```shell
-CUDA_VISIBLE_DEVICES=0 python llama.py /path/to/pretrained/llama-7b c4 --wbits 4 --groupsize 128 --save llama7b-4bit.pt
-```
-
-Run this command in your cloned `GPTQ-for-LLaMa` directory, then you will get a 4-bit weight file `llama7b-4bit-128g.pt`.
-
-**Troubleshooting**: if you get errors about `position_ids`, you can checkout to commit `50287c3b9ae4a3b66f6b5127c643ec39b769b155`(`GPTQ-for-LLaMa` repo).
+We support 8-bit quantization (RTN), 4-bit quantization (GPTQ), and FP16 inference. You can
+Online inference server scripts can help you deploy your own services.
For more details, see [`inference/`](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/inference).
@@ -283,24 +210,27 @@ For more details, see [`inference/`](https://github.com/hpcaitech/ColossalAI/tre
You can find more examples in this [repo](https://github.com/XueFuzhao/InstructionWild/blob/main/comparison.md).
-### Limitation for LLaMA-finetuned models
+### Limitation
+
+
+