From 070ae85e9e1201780c4362b06fdd9c52369f0d5e Mon Sep 17 00:00:00 2001 From: "cuiqing.li" Date: Fri, 1 Sep 2023 16:17:55 +0800 Subject: [PATCH 1/2] update readme --- colossalai/inference/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/colossalai/inference/README.md b/colossalai/inference/README.md index 591d3c93a220..62583e10cf55 100644 --- a/colossalai/inference/README.md +++ b/colossalai/inference/README.md @@ -94,6 +94,8 @@ For various models, experiments were conducted using multiple batch sizes under ### Single GPU Performance: +Currently the stats below are calculated based on A100 (single GPU), and we calculate token latency based average values of context-forward and decoding forward process, which means we combine both of process to calculate results. We are actively developing new features and methods to furthur optimize the performance of LLM models. Please stay tuned. + #### Llama | batch_size | 8 | 16 | 32 | @@ -103,7 +105,7 @@ For various models, experiments were conducted using multiple batch sizes under ![llama](https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/Infer-llama.png) -### +### Bloom | batch_size | 4 | 8 | | :---------------------: | :----: | :----: | From 032f7a5701a0d7426e3e1f352d4988cb2d68b0eb Mon Sep 17 00:00:00 2001 From: Cuiqing Li Date: Fri, 1 Sep 2023 16:20:30 +0800 Subject: [PATCH 2/2] Update README.md --- colossalai/inference/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/colossalai/inference/README.md b/colossalai/inference/README.md index 62583e10cf55..2fb255e03a04 100644 --- a/colossalai/inference/README.md +++ b/colossalai/inference/README.md @@ -94,7 +94,7 @@ For various models, experiments were conducted using multiple batch sizes under ### Single GPU Performance: -Currently the stats below are calculated based on A100 (single GPU), and we calculate token latency based average values of context-forward and decoding forward process, which means we combine both of process to calculate results. We are actively developing new features and methods to furthur optimize the performance of LLM models. Please stay tuned. +Currently the stats below are calculated based on A100 (single GPU), and we calculate token latency based on average values of context-forward and decoding forward process, which means we combine both of processes to calculate token generation times. We are actively developing new features and methods to furthur optimize the performance of LLM models. Please stay tuned. #### Llama