Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion colossalai/inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,8 @@ For various models, experiments were conducted using multiple batch sizes under

### Single GPU Performance:

Currently the stats below are calculated based on A100 (single GPU), and we calculate token latency based on average values of context-forward and decoding forward process, which means we combine both of processes to calculate token generation times. We are actively developing new features and methods to furthur optimize the performance of LLM models. Please stay tuned.

#### Llama

| batch_size | 8 | 16 | 32 |
Expand All @@ -103,7 +105,7 @@ For various models, experiments were conducted using multiple batch sizes under

![llama](https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/Infer-llama.png)

###
### Bloom

| batch_size | 4 | 8 |
| :---------------------: | :----: | :----: |
Expand Down