From 3fb3bdece274b33b0e312970f255109af668af68 Mon Sep 17 00:00:00 2001
From: CjhHa1 <cjh18671720497@outlook.com>
Date: Wed, 30 Aug 2023 13:34:38 +0800
Subject: [PATCH 1/3] create readme

---
 colossalai/inference/README.md           | 0
 tests/kit/model_zoo/torchrec/__init__.py | 2 +-
 2 files changed, 1 insertion(+), 1 deletion(-)
 create mode 100644 colossalai/inference/README.md

diff --git a/colossalai/inference/README.md b/colossalai/inference/README.md
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/tests/kit/model_zoo/torchrec/__init__.py b/tests/kit/model_zoo/torchrec/__init__.py
index 43952e6998cf..4a19f2449602 100644
--- a/tests/kit/model_zoo/torchrec/__init__.py
+++ b/tests/kit/model_zoo/torchrec/__init__.py
@@ -1 +1 @@
-from .torchrec import *
+#from .torchrec import *

From f534293882cc5030d6ac82dc2714de9d51fd2544 Mon Sep 17 00:00:00 2001
From: CjhHa1 <cjh18671720497@outlook.com>
Date: Wed, 30 Aug 2023 16:52:01 +0800
Subject: [PATCH 2/3] add readme.md

---
 colossalai/inference/README.md           | 91 ++++++++++++++++++++++++
 tests/kit/model_zoo/torchrec/__init__.py |  2 +-
 2 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/colossalai/inference/README.md b/colossalai/inference/README.md
index e69de29bb2d1..0a8c2c9acd5d 100644
--- a/colossalai/inference/README.md
+++ b/colossalai/inference/README.md
@@ -0,0 +1,91 @@
+# 🚀 Colossal-Inference
+
+## Table of contents
+
+## Introduction
+
+`Colossal Inference` is a module that contains colossal-ai designed inference framework, featuring high performance, steady and easy usability. `Colossal Inference` incorporated the advantages of the latest open-source inference systems, including TGI, vLLM, FasterTransformer, Lightly and flash attention. while combining the design of Colossal AI, especially Shardformer, to reduce the learning curve for users.
+
+## Design
+
+Colossal Inference is composed of two main components:
+
+1. High performance kernels and ops: which are inspired from existing libraries and modified correspondingly.
+2. Efficient memory management mechanism：which includes the key-value cache manager, allowing for zero memory waste during inference.
+   1. `cache manager`: serves as a memory manager to help manage the key-value cache, it integrates functions such as memory allocation, indexing and release.
+   2. `batch_infer_info`: holds all essential elements of a batch inference, which is updated every batch.
+3. High-level inference engine combined with `Shardformer`: it allows the our inference framework to easily invoke and utilize various parallel methods.
+   1. `engine.TPInferEngine`: it is a high level interface that integrates with shardformer, especially for multi-card (tensor parallel) inference:
+   2. `modeling.llama.LlamaInferenceForwards`: contains the `forward` methods for llama inference. (in this case : llama)
+   3. `policies.llama.LlamaModelInferPolicy` : contains the policies for `llama` models, which is used to call `shardformer` and segmentation the model forward in tensor parallelism way.
+
+## Pipeline of inference:
+
+In this section we discuss how the colossal inference works and integrates with the `Shardformer` . The details can be found in our codes.
+
+![Colossal-inference-2.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/1747151c-bfac-4f31-b780-828dd517fa96/Colossal-inference-2.png)
+
+## Roadmap of our implementation
+
+- [x] Design cache manager and batch infer state
+- [x] Design TpInference engine to integrates with `Shardformer`
+- [x] Register corresponding high-performance `kernel` and `ops`
+- [x] Design policies and forwards (e.g. `Llama` and `Bloom`
+  - [x] policy
+  - [x] context forward
+  - [x] token forward
+- [ ] Replace the kernels with `faster-transformer` in token-forward stage
+- [ ] Support all models
+  - [x] Llama
+  - [x] Bloom
+  - [ ] Chatglm2
+- [ ] Benchmarking for all models
+
+## Get stated
+
+### Installation
+
+```bash
+pip install -e .
+```
+
+### Requirements
+
+dependencies
+
+```bash
+pytorch= 1.13.1 (gpu)
+transformers= 4.30.2
+triton==2.0.0.dev20221202
+vllm=
+flash-attention=
+```
+
+### Doker
+
+You can use our official doker container as well.
+
+```bash
+doker..
+```
+
+### Dive into fast-inference!
+
+example files are in
+
+```bash
+cd colossalai.examples
+python xx
+```
+
+## Performance
+
+### environment:
+
+We conducted [benchmark tests](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/examples/performance_benchmark.py) to evaluate the performance. We compared the inference `latency` and `throughputs` between `colossal-inference` and `torch`.
+
+We set the batch size to 4, the number of attention heads to 8, and the head dimension to 64. `N_CTX` refers to the sequence length.
+
+In the case of using 2 GPUs, the results are as follows.
+
+###
diff --git a/tests/kit/model_zoo/torchrec/__init__.py b/tests/kit/model_zoo/torchrec/__init__.py
index 4a19f2449602..43952e6998cf 100644
--- a/tests/kit/model_zoo/torchrec/__init__.py
+++ b/tests/kit/model_zoo/torchrec/__init__.py
@@ -1 +1 @@
-#from .torchrec import *
+from .torchrec import *

From 83f4e4f800b945257fdd673dabf50cf2796ce215 Mon Sep 17 00:00:00 2001
From: CjhHa1 <cjh18671720497@outlook.com>
Date: Wed, 30 Aug 2023 17:43:24 +0800
Subject: [PATCH 3/3] fix typos

---
 colossalai/inference/README.md | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/colossalai/inference/README.md b/colossalai/inference/README.md
index 0a8c2c9acd5d..abfd1ff9070a 100644
--- a/colossalai/inference/README.md
+++ b/colossalai/inference/README.md
@@ -4,7 +4,7 @@
 
 ## Introduction
 
-`Colossal Inference` is a module that contains colossal-ai designed inference framework, featuring high performance, steady and easy usability. `Colossal Inference` incorporated the advantages of the latest open-source inference systems, including TGI, vLLM, FasterTransformer, Lightly and flash attention. while combining the design of Colossal AI, especially Shardformer, to reduce the learning curve for users.
+`Colossal Inference` is a module that contains colossal-ai designed inference framework, featuring high performance, steady and easy usability. `Colossal Inference` incorporated the advantages of the latest open-source inference systems, including TGI, vLLM, FasterTransformer, LightLLM and flash attention. while combining the design of Colossal AI, especially Shardformer, to reduce the learning curve for users.
 
 ## Design
 
@@ -14,10 +14,10 @@ Colossal Inference is composed of two main components:
 2. Efficient memory management mechanism：which includes the key-value cache manager, allowing for zero memory waste during inference.
    1. `cache manager`: serves as a memory manager to help manage the key-value cache, it integrates functions such as memory allocation, indexing and release.
    2. `batch_infer_info`: holds all essential elements of a batch inference, which is updated every batch.
-3. High-level inference engine combined with `Shardformer`: it allows the our inference framework to easily invoke and utilize various parallel methods.
+3. High-level inference engine combined with `Shardformer`: it allows our inference framework to easily invoke and utilize various parallel methods.
    1. `engine.TPInferEngine`: it is a high level interface that integrates with shardformer, especially for multi-card (tensor parallel) inference:
    2. `modeling.llama.LlamaInferenceForwards`: contains the `forward` methods for llama inference. (in this case : llama)
-   3. `policies.llama.LlamaModelInferPolicy` : contains the policies for `llama` models, which is used to call `shardformer` and segmentation the model forward in tensor parallelism way.
+   3. `policies.llama.LlamaModelInferPolicy` : contains the policies for `llama` models, which is used to call `shardformer` and segmentate the model forward in tensor parallelism way.
 
 ## Pipeline of inference:
 
@@ -30,7 +30,7 @@ In this section we discuss how the colossal inference works and integrates with
 - [x] Design cache manager and batch infer state
 - [x] Design TpInference engine to integrates with `Shardformer`
 - [x] Register corresponding high-performance `kernel` and `ops`
-- [x] Design policies and forwards (e.g. `Llama` and `Bloom`
+- [x] Design policies and forwards (e.g. `Llama` and `Bloom`)
   - [x] policy
   - [x] context forward
   - [x] token forward
@@ -41,7 +41,7 @@ In this section we discuss how the colossal inference works and integrates with
   - [ ] Chatglm2
 - [ ] Benchmarking for all models
 
-## Get stated
+## Get started
 
 ### Installation
 
@@ -61,12 +61,12 @@ vllm=
 flash-attention=
 ```
 
-### Doker
+### Docker
 
-You can use our official doker container as well.
+You can use our official docker container as well.
 
 ```bash
-doker..
+docker..
 ```
 
 ### Dive into fast-inference!