Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 86 additions & 4 deletions applications/Colossal-LLaMA-2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,10 @@
- [Performance Evaluation](#performance-evaluation)
- [Examples](#examples)
- [Training Logs](#training-logs)
- [Import from Transformers (Inference)](#import-from-transformers-inference)
- [Inference](#inference)
- [Import from HuggingFace](#import-from-huggingface)
- [Import from Modelscope](#import-from-modelscope)
- [Quick Start](#quick-start)
- [Usage](#usage)
- [Install](#install)
- [0. Pre-requisite](#0-pre-requisite)
Expand All @@ -21,8 +24,14 @@
- [1. Init Tokenizer Preparation](#1-init-tokenizer-preparation)
- [2. Init Model Preparation](#2-init-model-preparation)
- [3. Data Preparation](#3-data-preparation)
- [3.1 Data for Pretraining](#31-data-for-pretraining)
- [3.2 Data for Supervised Fine-tuning](#32-data-for-supervised-fine-tuning)
- [4. Command Line Arguments for Training](#4-command-line-arguments-for-training)
- [4.1 Arguments for Pretraining](#41-arguments-for-pretraining)
- [4.2 Arguments for Supervised Fine-tuning](#42-arguments-for-supervised-fine-tuning)
- [5. Running Command](#5-running-command)
- [5.1 Command for Pretraining](#51-command-for-pretraining)
- [5.2 Command for Supervised Fine-tuning](#52-command-for-supervised-fine-tuning)
- [Technical Insights](#technical-insights)
- [Data](#data)
- [Tokenizer](#tokenizer)
Expand Down Expand Up @@ -117,7 +126,8 @@ We also recorded the training logs for the experiment
<img src="https://github.com/hpcaitech/public_assets/blob/main/applications/colossal-llama-2/trainingLossByTokens.jpeg?raw=true" width=600/>
</p>

### Import from Transformers (Inference)
### Inference
#### Import from HuggingFace
To load Colossal-LLaMA-2-7B-base model using Transformers, use the following code:
```Python
from transformers import AutoModelForCausalLM, AutoTokenizer
Expand All @@ -135,14 +145,15 @@ pred = model.generate(**inputs,
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)[len(input):])
```

#### Import from Modelscope
You can also load our model using modelscope, use the following code:
```Python
from modelscope import AutoModelForCausalLM, AutoTokenizer, snapshot_download
model_dir = snapshot_download('colossalai/Colossal-LLaMA-2-7b-base', revision='v1.0.1')
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval()
generation_kwargs = {"max_new_tokens": 256,
"top_p": 0.95,
generation_kwargs = {"max_new_tokens": 256,
"top_p": 0.95,
"temperature": 0.3
}
input = '离离原上草,'
Expand All @@ -153,6 +164,30 @@ print(tokenizer.decode(output.cpu()[0], skip_special_tokens=True)[len(input):])
```
You can download model weights from [🤗HuggingFace](https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-7b-base) or [👾Modelscope](https://modelscope.cn/models/colossalai/Colossal-LLaMA-2-7b-base/summary).

#### Quick Start
You can run [`inference_example.py`](inference_example.py) to quickly start the inference of our base model by loading model weights from HF.

Command to run the script:
```bash
python inference_example.py \
--model_path "<HF_REPO_NAME_OR_LOCAL_PATH_TO_MODEL>" \
--device "cuda:0" \
--max_new_tokens 512 \
--do_sample True \
--temperature 0.3 \
--top_k 50 \
--top_p 0.95 \
--input_txt "YOUR_PROMPT_OR_QUESTION"
```
Here is details about CLI arguments:
* Model path: `--model_path`. HF repo name or local path of the model.
* Device: `--device`. Set the device.
* Max new tokens: `--max_new_tokens`. Set maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
* Do sample: `--do_sample`. Set whether or not to use sampling.
* Temperature: `--temperature`. Set temperature value.
* Top_k: `--top_k`. Set top_k value for top-k-filtering.
* Top_p: `--top_p`. Set top_p value for generation.
* Input_txt: `--input_txt`. The prompt string input to the model.
## Usage
### Install

Expand Down Expand Up @@ -218,6 +253,8 @@ Here is details about CLI arguments:
❗️**Important**: Once you initialize the new model checkpoint, copy your new tokenizer files (`special_tokens_map.json`, `tokenizer.model` and `tokenizer_config.json`) to your new model folder.

#### 3. Data Preparation

##### 3.1 Data for Pretraining
Raw data should be formatted as `jsonl` format. Each data point should have the following fields:
* `source` (str, compulsory): This part is ignored when calculating loss. Default can be empty.
* `target` (str, compulsory): Loss will be calculated.
Expand Down Expand Up @@ -250,7 +287,31 @@ Here is details about CLI arguments:
* Max length: `max_length`. Max length of spliced samples. Default value is 4096.
* Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training.

##### 3.2 Data for Supervised Fine-tuning
We prepare data for supervised fine-tuning in a similar way. The main difference lies in the data format. Each data point should have the following field:
* `messages` (list, compulsory): This part consists of a conversation between a human and assistant. The length of `messages` can vary and only content from `assistant` is used for calculating loss.

Examples:
```JSON
{"messages": [{"from": "human", "content": "What are the three primary colors?"}, {"from": "assistant", "content": "The three primary colors are red, blue, and yellow."}]}
{"messages": [{"from": "human", "content": "解释个人电脑和服务器之间的区别。"}, {"from": "assistant", "content": "个人电脑和服务器是两种不同类型的计算机系统,它们的主要区别在于用途、硬件配置和性能。 个人电脑,顾名思义,是为个人使用而设计的计算机。它们通常用于日常的工作、娱乐和学习,可以运行各种各样的应用程序和游戏。个人电脑的硬件配置一般是按照标准配置来设计的,不过也可以根据个人需求进行定制。 而服务器是为了满足大量用户的需求而设计的计算机系统,它们通常用于为用户提供各种网络服务,如网站、电子邮件和文件传输等。服务器通常需要高性能的硬件配置,并且可以承受高负载和长时间的运行。由于服务器需要支持大量用户的访问,它们通常配备多核处理器、大容量内存和大容量硬盘驱动器,以提高系统的运行速度和稳定性。 总之,个人电脑和服务器之间的主要区别在于它们的用途、硬件配置和性能。个人电脑用于个人使用,而服务器用于支持大量用户的访问。服务器的硬件配置通常比个人电脑更高,以保证系统的性能和稳定性。"}]}
```

Command to convert jsonl dataset to arrow format is similar to the command in [3.1 Data for Pretraining](#31-data-for-pretraining). In `prepare_sft_dataset.py`, we don't concatenate different data samples.
```
python prepare_sft_dataset.py.py \
--data_input_dirs "<JOSNL_DIR_1>,<JOSNL_DIR_2>,<JOSNL_DIR_3>" \
--tokenizer_dir "<TOKENIZER_DIR>" \
--data_cache_dir "jsonl_to_arrow_cache" \
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
--max_length 4096 \
--num_spliced_dataset_bins 10
```

#### 4. Command Line Arguments for Training

##### 4.1 Arguments for Pretraining
You can use `colossalai run` to launch multi-nodes training:
```bash
colossalai run --nproc_per_node YOUR_GPU_PER_NODE --hostfile YOUR_HOST_FILE \
Expand Down Expand Up @@ -288,7 +349,16 @@ Here is details about CLI arguments:
* Tensor parallelism size: `--tp`. TP size for 3d Parallelism. The default value is 1.
* Zero stage: `--zero`. Zero stage for 3d Parallelism. The default value is 1.

##### 4.2 Arguments for Supervised Fine-tuning
We add support for gradient accumulation and NEFTuning for supervised fine-tuning and thus there are two more arguments apart from the arguments listed in [4.1 Arguments for Pretraining](#41-arguments-for-pretraining).

Here is details about CLI arguments:
* Accumulation steps: `--accumulation_steps`. The default value is `8`.
* NEFTuning: `--use_neft`. The default value is `False`. It can help improve the performance of chat models.

#### 5. Running Command

##### 5.1 Command for Pretraining
An [example bash](train.example.sh) is also provided for the experiment. Here is the steps to run the experiment:
* Create your own hostfile: `cp hostfile.example hostfile`.
* Create your own bash: `cp train.example.sh train.sh`.
Expand All @@ -310,6 +380,10 @@ declare -a dataset=(
"<DIR_2>/part-00000"
)
```

##### 5.2 Command for Supervised Fine-tuning
An [example bash](train_sft.example.sh) is provided. The only difference with the command for pretraining is the two arguments (`--accumulation_steps` and `--use_neft`) in the script. You can refer to [4.2 Arguments for Supervised Fine-tuning](#42-arguments-for-supervised-fine-tuning) for more details.

## Technical Insights
In order to enhance LLaMA-2's capabilities for understanding and generating Chinese content, The [Colossal-AI](https://github.com/hpcaitech/ColossalAI) team proposes the continuation of pre-training the LLaMA-2 model using both Chinese and English corpora. The overall pipeline can be described as follows:

Expand Down Expand Up @@ -416,3 +490,11 @@ Applying the above process to perform knowledge transfer in any field allows for
year={2023}
}
```
```bibtex
@article{jain2023neftune,
title={NEFTune: Noisy Embeddings Improve Instruction Finetuning},
author={Jain, Neel and Chiang, Ping-yeh and Wen, Yuxin and Kirchenbauer, John and Chu, Hong-Min and Somepalli, Gowthami and Bartoldson, Brian R and Kailkhura, Bhavya and Schwarzschild, Avi and Saha, Aniruddha and others},
journal={arXiv preprint arXiv:2310.05914},
year={2023}
}
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Copyright 2023 lm-sys@FastChat
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import dataclasses
from enum import Enum, auto
from typing import List


class SeparatorStyle(Enum):
ADD_BOS_EOS_TOKEN = auto()


@dataclasses.dataclass
class Conversation:
system: str
roles: List[str]
messages: List[List[str]]
offset: int
sep_style: SeparatorStyle
seps: List[str]

def clear(self):
self.messages = []

def get_prompt(self, length: int = None):
if length is None:
length = len(self.messages)

if self.sep_style == SeparatorStyle.ADD_BOS_EOS_TOKEN:
ret = self.system
for role, message in self.messages[0:length]:
if message:
ret += role + ": " + self.seps[0] + message + self.seps[1]
else:
ret += role + ": " + self.seps[0]
return ret
else:
raise ValueError(f"Invalid style: {self.sep_style}")

def save_prompt(self):
if self.sep_style == SeparatorStyle.ADD_BOS_EOS_TOKEN:
ret = self.system
for role, message in self.messages:
if message:
ret += role + ": " + self.seps[0] + message + self.seps[1] + "\n"
else:
ret += role + ": " + self.seps[0]
return ret
else:
raise ValueError(f"Invalid style: {self.sep_style}")

def append_message(self, role, message):
self.messages.append([role, message])

def copy(self):
return Conversation(
system=self.system,
roles=self.roles,
messages=[[x, y] for x, y in self.messages],
offset=self.offset,
sep_style=self.sep_style,
seps=self.seps,
)

def dict(self):
return {
"system": self.system,
"roles": self.roles,
"messages": self.messages,
"offset": self.offset,
"seps": self.seps,
}


conv = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
roles=("Human", "Assistant"),
messages=[],
offset=0,
sep_style=SeparatorStyle.ADD_BOS_EOS_TOKEN,
seps=["<s>", "</s>"],
)

default_conversation = conv
Loading