From 3fb3bdece274b33b0e312970f255109af668af68 Mon Sep 17 00:00:00 2001 From: CjhHa1 Date: Wed, 30 Aug 2023 13:34:38 +0800 Subject: [PATCH 1/3] create readme --- colossalai/inference/README.md | 0 tests/kit/model_zoo/torchrec/__init__.py | 2 +- 2 files changed, 1 insertion(+), 1 deletion(-) create mode 100644 colossalai/inference/README.md diff --git a/colossalai/inference/README.md b/colossalai/inference/README.md new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/tests/kit/model_zoo/torchrec/__init__.py b/tests/kit/model_zoo/torchrec/__init__.py index 43952e6998cf..4a19f2449602 100644 --- a/tests/kit/model_zoo/torchrec/__init__.py +++ b/tests/kit/model_zoo/torchrec/__init__.py @@ -1 +1 @@ -from .torchrec import * +#from .torchrec import * From f534293882cc5030d6ac82dc2714de9d51fd2544 Mon Sep 17 00:00:00 2001 From: CjhHa1 Date: Wed, 30 Aug 2023 16:52:01 +0800 Subject: [PATCH 2/3] add readme.md --- colossalai/inference/README.md | 91 ++++++++++++++++++++++++ tests/kit/model_zoo/torchrec/__init__.py | 2 +- 2 files changed, 92 insertions(+), 1 deletion(-) diff --git a/colossalai/inference/README.md b/colossalai/inference/README.md index e69de29bb2d1..0a8c2c9acd5d 100644 --- a/colossalai/inference/README.md +++ b/colossalai/inference/README.md @@ -0,0 +1,91 @@ +# 🚀 Colossal-Inference + +## Table of contents + +## Introduction + +`Colossal Inference` is a module that contains colossal-ai designed inference framework, featuring high performance, steady and easy usability. `Colossal Inference` incorporated the advantages of the latest open-source inference systems, including TGI, vLLM, FasterTransformer, Lightly and flash attention. while combining the design of Colossal AI, especially Shardformer, to reduce the learning curve for users. + +## Design + +Colossal Inference is composed of two main components: + +1. High performance kernels and ops: which are inspired from existing libraries and modified correspondingly. +2. Efficient memory management mechanism:which includes the key-value cache manager, allowing for zero memory waste during inference. + 1. `cache manager`: serves as a memory manager to help manage the key-value cache, it integrates functions such as memory allocation, indexing and release. + 2. `batch_infer_info`: holds all essential elements of a batch inference, which is updated every batch. +3. High-level inference engine combined with `Shardformer`: it allows the our inference framework to easily invoke and utilize various parallel methods. + 1. `engine.TPInferEngine`: it is a high level interface that integrates with shardformer, especially for multi-card (tensor parallel) inference: + 2. `modeling.llama.LlamaInferenceForwards`: contains the `forward` methods for llama inference. (in this case : llama) + 3. `policies.llama.LlamaModelInferPolicy` : contains the policies for `llama` models, which is used to call `shardformer` and segmentation the model forward in tensor parallelism way. + +## Pipeline of inference: + +In this section we discuss how the colossal inference works and integrates with the `Shardformer` . The details can be found in our codes. + +![Colossal-inference-2.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/1747151c-bfac-4f31-b780-828dd517fa96/Colossal-inference-2.png) + +## Roadmap of our implementation + +- [x] Design cache manager and batch infer state +- [x] Design TpInference engine to integrates with `Shardformer` +- [x] Register corresponding high-performance `kernel` and `ops` +- [x] Design policies and forwards (e.g. `Llama` and `Bloom` + - [x] policy + - [x] context forward + - [x] token forward +- [ ] Replace the kernels with `faster-transformer` in token-forward stage +- [ ] Support all models + - [x] Llama + - [x] Bloom + - [ ] Chatglm2 +- [ ] Benchmarking for all models + +## Get stated + +### Installation + +```bash +pip install -e . +``` + +### Requirements + +dependencies + +```bash +pytorch= 1.13.1 (gpu) +transformers= 4.30.2 +triton==2.0.0.dev20221202 +vllm= +flash-attention= +``` + +### Doker + +You can use our official doker container as well. + +```bash +doker.. +``` + +### Dive into fast-inference! + +example files are in + +```bash +cd colossalai.examples +python xx +``` + +## Performance + +### environment: + +We conducted [benchmark tests](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/examples/performance_benchmark.py) to evaluate the performance. We compared the inference `latency` and `throughputs` between `colossal-inference` and `torch`. + +We set the batch size to 4, the number of attention heads to 8, and the head dimension to 64. `N_CTX` refers to the sequence length. + +In the case of using 2 GPUs, the results are as follows. + +### diff --git a/tests/kit/model_zoo/torchrec/__init__.py b/tests/kit/model_zoo/torchrec/__init__.py index 4a19f2449602..43952e6998cf 100644 --- a/tests/kit/model_zoo/torchrec/__init__.py +++ b/tests/kit/model_zoo/torchrec/__init__.py @@ -1 +1 @@ -#from .torchrec import * +from .torchrec import * From 83f4e4f800b945257fdd673dabf50cf2796ce215 Mon Sep 17 00:00:00 2001 From: CjhHa1 Date: Wed, 30 Aug 2023 17:43:24 +0800 Subject: [PATCH 3/3] fix typos --- colossalai/inference/README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/colossalai/inference/README.md b/colossalai/inference/README.md index 0a8c2c9acd5d..abfd1ff9070a 100644 --- a/colossalai/inference/README.md +++ b/colossalai/inference/README.md @@ -4,7 +4,7 @@ ## Introduction -`Colossal Inference` is a module that contains colossal-ai designed inference framework, featuring high performance, steady and easy usability. `Colossal Inference` incorporated the advantages of the latest open-source inference systems, including TGI, vLLM, FasterTransformer, Lightly and flash attention. while combining the design of Colossal AI, especially Shardformer, to reduce the learning curve for users. +`Colossal Inference` is a module that contains colossal-ai designed inference framework, featuring high performance, steady and easy usability. `Colossal Inference` incorporated the advantages of the latest open-source inference systems, including TGI, vLLM, FasterTransformer, LightLLM and flash attention. while combining the design of Colossal AI, especially Shardformer, to reduce the learning curve for users. ## Design @@ -14,10 +14,10 @@ Colossal Inference is composed of two main components: 2. Efficient memory management mechanism:which includes the key-value cache manager, allowing for zero memory waste during inference. 1. `cache manager`: serves as a memory manager to help manage the key-value cache, it integrates functions such as memory allocation, indexing and release. 2. `batch_infer_info`: holds all essential elements of a batch inference, which is updated every batch. -3. High-level inference engine combined with `Shardformer`: it allows the our inference framework to easily invoke and utilize various parallel methods. +3. High-level inference engine combined with `Shardformer`: it allows our inference framework to easily invoke and utilize various parallel methods. 1. `engine.TPInferEngine`: it is a high level interface that integrates with shardformer, especially for multi-card (tensor parallel) inference: 2. `modeling.llama.LlamaInferenceForwards`: contains the `forward` methods for llama inference. (in this case : llama) - 3. `policies.llama.LlamaModelInferPolicy` : contains the policies for `llama` models, which is used to call `shardformer` and segmentation the model forward in tensor parallelism way. + 3. `policies.llama.LlamaModelInferPolicy` : contains the policies for `llama` models, which is used to call `shardformer` and segmentate the model forward in tensor parallelism way. ## Pipeline of inference: @@ -30,7 +30,7 @@ In this section we discuss how the colossal inference works and integrates with - [x] Design cache manager and batch infer state - [x] Design TpInference engine to integrates with `Shardformer` - [x] Register corresponding high-performance `kernel` and `ops` -- [x] Design policies and forwards (e.g. `Llama` and `Bloom` +- [x] Design policies and forwards (e.g. `Llama` and `Bloom`) - [x] policy - [x] context forward - [x] token forward @@ -41,7 +41,7 @@ In this section we discuss how the colossal inference works and integrates with - [ ] Chatglm2 - [ ] Benchmarking for all models -## Get stated +## Get started ### Installation @@ -61,12 +61,12 @@ vllm= flash-attention= ``` -### Doker +### Docker -You can use our official doker container as well. +You can use our official docker container as well. ```bash -doker.. +docker.. ``` ### Dive into fast-inference!