Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
6df844b
[release] grok-1 314b inference (#5490)
binmakeswell Mar 22, 2024
5fcd779
[example] update Grok-1 inference (#5495)
yuanheng-zhao Mar 24, 2024
bb0a668
[hotfix] set return_outputs=False in examples and polish code (#5404)
cwher Mar 25, 2024
34e9092
[release] grok-1 inference benchmark (#5500)
binmakeswell Mar 25, 2024
0688d92
[shardformer]Fix lm parallel. (#5480)
flybird11111 Mar 25, 2024
131f32a
[fix] fix grok-1 example typo (#5506)
yuanheng-zhao Mar 26, 2024
a7790a9
[devops] fix example test ci (#5504)
ver217 Mar 26, 2024
cbe34c5
Fix ColoTensorSpec for py11 (#5440)
dementrock Mar 26, 2024
61da3fb
fixed layout converter caching and updated tester
Edenzzzz Mar 26, 2024
18edcd5
Empty-Commit
Edenzzzz Mar 26, 2024
9a3321e
Merge pull request #5515 from Edenzzzz/fix_layout_convert
Edenzzzz Mar 26, 2024
19e1a5c
[shardformer] update colo attention to support custom mask (#5510)
ver217 Mar 27, 2024
e6707a6
[format] applied code formatting on changed files in pull request 551…
github-actions[bot] Mar 27, 2024
00525f7
[shardformer] fix pipeline forward error if custom layer distribution…
insujang Mar 27, 2024
36c4bb2
[Fix] Grok-1 use tokenizer from the same pretrained path (#5532)
yuanheng-zhao Mar 28, 2024
df5e9c5
[ColossalChat] Update RLHF V2 (#5286)
YeAnbang Mar 29, 2024
e614aa3
[shardformer, pipeline] add `gradient_checkpointing_ratio` and hetero…
cwher Apr 1, 2024
7e0ec5a
fix incorrect sharding without zero (#5545)
Edenzzzz Apr 2, 2024
8e412a5
[shardformer] Sequence Parallelism Optimization (#5533)
KKZ20 Apr 3, 2024
15055f9
[hotfix] quick fixes to make legacy tutorials runnable (#5559)
Edenzzzz Apr 7, 2024
a799ca3
[fix] fix typo s/muiti-node /multi-node etc. (#5448)
digger-yu Apr 7, 2024
341263d
[hotfix] fix typo s/get_defualt_parser /get_default_parser (#5548)
digger-yu Apr 7, 2024
641b1ee
[devops] remove post commit ci (#5566)
ver217 Apr 8, 2024
dfabcc3
reslove conflicts
flybird11111 Apr 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
- [ ] I have created an issue for this PR for traceability
- [ ] The title follows the standard format: `[doc/gemini/tensor/...]: A concise description`
- [ ] I have added relevant tags if possible for us to better distinguish different PRs
- [ ] I have installed pre-commit: `pip install pre-commit && pre-commit install`


## 🚨 Issue number
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ jobs:
cd TensorNVMe
conda install cmake
pip install -r requirements.txt
pip install -v .
DISABLE_URING=1 pip install -v .

- name: Store TensorNVMe Cache
run: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build_on_schedule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ jobs:
cd TensorNVMe
conda install cmake
pip install -r requirements.txt
pip install -v .
DISABLE_URING=1 pip install -v .

- uses: actions/checkout@v2
if: steps.check-avai.outputs.avai == 'true'
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/compatiblity_test_on_dispatch.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ jobs:
cd TensorNVMe
apt update && apt install -y cmake
pip install -r requirements.txt
pip install -v .
DISABLE_URING=1 pip install -v .
- uses: actions/checkout@v2
with:
ssh-key: ${{ secrets.SSH_KEY_FOR_CI }}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/compatiblity_test_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ jobs:
cd TensorNVMe
apt update && apt install -y cmake
pip install -r requirements.txt
pip install -v .
DISABLE_URING=1 pip install -v .
- uses: actions/checkout@v2
with:
ssh-key: ${{ secrets.SSH_KEY_FOR_CI }}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/compatiblity_test_on_schedule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ jobs:
cd TensorNVMe
apt update && apt install -y cmake
pip install -r requirements.txt
pip install -v .
DISABLE_URING=1 pip install -v .
- uses: actions/checkout@v2
with:
ssh-key: ${{ secrets.SSH_KEY_FOR_CI }}
Expand Down
4 changes: 1 addition & 3 deletions .github/workflows/example_check_on_dispatch.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ jobs:
matrix: ${{fromJson(needs.manual_check_matrix_preparation.outputs.matrix)}}
container:
image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
options: --gpus all --rm -v /data/scratch/examples-data:/data/
options: --gpus all --rm -v /data/scratch/examples-data:/data/ -v /dev/shm
timeout-minutes: 15
steps:
- name: 📚 Checkout
Expand All @@ -60,5 +60,3 @@ jobs:
echo "Testing ${dir} now"
cd "${PWD}/examples/${dir}"
bash test_ci.sh
env:
NCCL_SHM_DISABLE: 1
4 changes: 1 addition & 3 deletions .github/workflows/example_check_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ jobs:
matrix: ${{fromJson(needs.detect-changed-example.outputs.matrix)}}
container:
image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
options: --gpus all --rm -v /data/scratch/examples-data:/data/
options: --gpus all --rm -v /data/scratch/examples-data:/data/ -v /dev/shm
timeout-minutes: 20
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-run-example-${{ matrix.directory }}
Expand All @@ -95,5 +95,3 @@ jobs:
example_dir=${{ matrix.directory }}
cd "${PWD}/examples/${example_dir}"
bash test_ci.sh
env:
NCCL_SHM_DISABLE: 1
3 changes: 1 addition & 2 deletions .github/workflows/example_check_on_schedule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ jobs:
matrix: ${{fromJson(needs.matrix_preparation.outputs.matrix)}}
container:
image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
options: --gpus all --rm -v /data/scratch/examples-data:/data/ -v /dev/shm
timeout-minutes: 10
steps:
- name: 📚 Checkout
Expand All @@ -50,8 +51,6 @@ jobs:
echo "Testing ${example_dir} now"
cd "${PWD}/examples/${example_dir}"
bash test_ci.sh
env:
NCCL_SHM_DISABLE: 1

- name: Notify Lark
id: message-preparation
Expand Down
97 changes: 0 additions & 97 deletions .github/workflows/post_commit.yml

This file was deleted.

29 changes: 19 additions & 10 deletions .github/workflows/run_chatgpt_examples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,35 +19,44 @@ jobs:
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
options: --gpus all --rm -v /data/scratch/github_actions/chat:/data/scratch/github_actions/chat --shm-size=10.24gb
timeout-minutes: 30
options: --gpus all --rm -v /data/scratch/examples-data:/data/scratch/examples-data --shm-size=10.24gb
timeout-minutes: 60
defaults:
run:
shell: bash
steps:
- name: Checkout ColossalAI
uses: actions/checkout@v2

- name: Install Colossal-AI
run: |
BUILD_EXT=1 pip install -v -e .

- name: Install ChatGPT
run: |
cd applications/Chat
cd applications/ColossalChat
pip install -v .
export BUILD_EXT=1
pip install -r examples/requirements.txt

- name: Install Transformers
run: |
pip install transformers==4.30.2
pip install transformers==4.34.1

- name: Execute Examples
run: |
cd applications/Chat
cd applications/ColossalChat
rm -rf ~/.cache/colossalai
./tests/test_inference.sh
./tests/test_benchmarks.sh
mkdir models
mkdir sft_data
mkdir prompt_data
mkdir preference_data
./tests/test_data_preparation.sh
./tests/test_train.sh
env:
NCCL_SHM_DISABLE: 1
MAX_JOBS: 8
SFT_DATASET: /data/scratch/github_actions/chat/data.json
PROMPT_DATASET: /data/scratch/github_actions/chat/prompts_en.jsonl
PRETRAIN_DATASET: /data/scratch/github_actions/chat/alpaca_data.json
PRETRAINED_MODEL_PATH: ./models
SFT_DATASET: ./sft_data
PROMPT_DATASET: ./prompt_data
PREFERENCE_DATASET: ./preference_data
10 changes: 6 additions & 4 deletions .github/workflows/run_chatgpt_unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
options: --gpus all --rm -v /data/scratch/chatgpt:/data/scratch/chatgpt
options: --gpus all --rm -v /data/scratch/examples-data:/data/scratch/examples-data
timeout-minutes: 30
defaults:
run:
Expand All @@ -32,15 +32,17 @@ jobs:

- name: Install ChatGPT
run: |
cd applications/Chat
cd applications/ColossalChat
pip install -v .
pip install -r requirements-test.txt
pip install pytest

- name: Execute Unit Testing
run: |
cd applications/Chat
cd applications/ColossalChat
rm -rf ~/.cache/colossalai
pytest tests/
cd ./tests
./test_templating.sh
env:
NCCL_SHM_DISABLE: 1
MAX_JOBS: 8
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -159,3 +159,7 @@ coverage.xml
# ignore testmon and coverage files
.coverage
.testmondata*

# log, test files - ColossalChat
applications/ColossalChat/logs
applications/ColossalChat/tests/logs
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
</div>

## Latest News
* [2024/03] [314 Billion Parameter Grok-1 Inference Accelerated by 3.8x, Efficient and Easy-to-Use PyTorch+HuggingFace version is Here](https://hpc-ai.com/blog/314-billion-parameter-grok-1-inference-accelerated-by-3.8x-efficient-and-easy-to-use-pytorchhuggingface-version-is-here)
* [2024/03] [Open-Sora: Revealing Complete Model Parameters, Training Details, and Everything for Sora-like Video Generation Models](https://hpc-ai.com/blog/open-sora-v1.0)
* [2024/03] [Open-Sora:Sora Replication Solution with 46% Cost Reduction, Sequence Expansion to Nearly a Million](https://hpc-ai.com/blog/open-sora)
* [2024/01] [Inference Performance Improved by 46%, Open Source Solution Breaks the Length Limit of LLM for Multi-Round Conversations](https://hpc-ai.com/blog/Colossal-AI-SwiftInfer)
Expand Down Expand Up @@ -72,6 +73,7 @@
<li>
<a href="#Inference">Inference</a>
<ul>
<li><a href="#Grok-1">Grok-1: 314B model of PyTorch + HuggingFace Inference</a></li>
<li><a href="#SwiftInfer">SwiftInfer:Breaks the Length Limit of LLM for Multi-Round Conversations with 46% Acceleration</a></li>
<li><a href="#GPT-3-Inference">GPT-3</a></li>
<li><a href="#OPT-Serving">OPT-175B Online Serving for Text Generation</a></li>
Expand Down Expand Up @@ -365,6 +367,18 @@ Please visit our [documentation](https://www.colossalai.org/) and [examples](htt


## Inference
### Grok-1
<p id="Grok-1" align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/examples/images/grok-1-inference.jpg" width=600/>
</p>

- 314 Billion Parameter Grok-1 Inference Accelerated by 3.8x, an easy-to-use Python + PyTorch + HuggingFace version for Inference.

[[code]](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/grok-1)
[[blog]](https://hpc-ai.com/blog/314-billion-parameter-grok-1-inference-accelerated-by-3.8x-efficient-and-easy-to-use-pytorchhuggingface-version-is-here)
[[HuggingFace Grok-1 PyTorch model weights]](https://huggingface.co/hpcai-tech/grok-1)
[[ModelScope Grok-1 PyTorch model weights]](https://www.modelscope.cn/models/colossalai/grok-1-pytorch/summary)

<p id="SwiftInfer" align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/SwiftInfer.jpg" width=800/>
</p>
Expand Down
38 changes: 0 additions & 38 deletions applications/Chat/benchmarks/README.md

This file was deleted.

Loading