Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
2bdf76f
fix typo change lazy_iniy to lazy_init (#5099)
digger-yu Nov 24, 2023
d5661f0
[nfc] fix typo change directoty to directory (#5111)
digger-yu Nov 27, 2023
7b789f4
[FEATURE] Add Safety Eval Datasets to ColossalEval (#5095)
Orion-Zheng Nov 27, 2023
126cf18
[hotfix] fixed memory usage of shardformer module replacement (#5122)
kurisusnowdeng Nov 28, 2023
7172459
[shardformer]: support gpt-j, falcon, Mistral and add interleaved pip…
cwher Nov 28, 2023
177c79f
[doc] add moe news (#5128)
binmakeswell Nov 28, 2023
2899cfd
[doc] updated paper citation (#5131)
FrankLeeeee Nov 29, 2023
9110406
fix typo change JOSNL TO JSONL etc. (#5116)
digger-yu Nov 29, 2023
d10ee42
[format] applied code formatting on changed files in pull request 508…
github-actions[bot] Nov 29, 2023
9b36640
[format] applied code formatting on changed files in pull request 512…
github-actions[bot] Nov 29, 2023
f6731db
[format] applied code formatting on changed files in pull request 511…
github-actions[bot] Nov 29, 2023
2a2ec49
[plugin]fix 3d checkpoint load when booster boost without optimizer. …
flybird11111 Nov 30, 2023
c7fd9a5
[ColossalQA] refactor server and webui & add new feature (#5138)
MichelleMa8 Nov 30, 2023
368b5e3
[doc] fix colossalqa document (#5146)
MichelleMa8 Dec 1, 2023
3dbbf83
fix (#5158)
flybird11111 Dec 5, 2023
b397104
[Colossal-Llama-2] Add finetuning Colossal-Llama-2 example (#4878)
chengeharrison Dec 7, 2023
21aa5de
[gemini] hotfix NaN loss while using Gemini + tensor_parallel (#5150)
flybird11111 Dec 8, 2023
b07a6f4
[colossalqa] fix pangu api (#5170)
MichelleMa8 Dec 11, 2023
cefdc32
[ColossalEval] Support GSM, Data Leakage Evaluation and Tensor Parall…
chengeharrison Dec 12, 2023
79718fa
[shardformer] llama support DistCrossEntropy (#5176)
flybird11111 Dec 12, 2023
3ff60d1
Fix ColossalEval (#5186)
chengeharrison Dec 15, 2023
681d9b1
[doc] update pytorch version in documents. (#5177)
flybird11111 Dec 15, 2023
af95267
polish readme in application/chat (#5194)
ht-zhou Dec 20, 2023
4fa689f
[pipeline]: fix p2p comm, add metadata cache and support llama interl…
cwher Dec 22, 2023
eae01b6
Improve logic for selecting metrics (#5196)
chengeharrison Dec 22, 2023
64519eb
[doc] Update required third-party library list for testing and torch …
KKZ20 Dec 27, 2023
02d2328
support linear accumulation fusion (#5199)
flybird11111 Dec 29, 2023
3c0d82b
[pipeline]: support arbitrary batch size in forward_only mode (#5201)
cwher Jan 2, 2024
d799a30
[pipeline]: add p2p fallback order and fix interleaved pp deadlock (#…
cwher Jan 3, 2024
7f3400b
[devops] update torch versoin in ci (#5217)
ver217 Jan 3, 2024
365671b
fix-test (#5210)
flybird11111 Jan 3, 2024
451e914
fix flash attn (#5209)
flybird11111 Jan 3, 2024
b0b53a1
[nfc] fix typo colossalai/shardformer/ (#5133)
digger-yu Jan 4, 2024
d992b55
[Colossal-LLaMA-2] Release Colossal-LLaMA-2-13b-base model (#5224)
TongLi3701 Jan 5, 2024
915b465
[doc] Update README.md of Colossal-LLAMA2 (#5233)
Camille7777 Jan 6, 2024
ce65127
[doc] Make leaderboard format more uniform and good-looking (#5231)
zhimin-z Jan 6, 2024
b9b32b1
[doc] add Colossal-LLaMA-2-13B (#5234)
binmakeswell Jan 7, 2024
4fb4a22
[format] applied code formatting on changed files in pull request 523…
github-actions[bot] Jan 7, 2024
7bc6969
[doc] SwiftInfer release (#5236)
binmakeswell Jan 8, 2024
d565df3
[pipeline] A more general _communicate in p2p (#5062)
zeyugao Jan 8, 2024
41e52c1
[doc] fix typo in Colossal-LLaMA-2/README.md (#5247)
digger-yu Jan 10, 2024
edf94a3
[workflow] fixed build CI (#5240)
FrankLeeeee Jan 10, 2024
d5eeeb1
[ci] fixed booster test (#5251)
FrankLeeeee Jan 11, 2024
2b83418
[ci] fixed ddp test (#5254)
FrankLeeeee Jan 11, 2024
756c400
fix typo in applications/ColossalEval/README.md (#5250)
digger-yu Jan 11, 2024
e830ef9
[ci] fix shardformer tests. (#5255)
flybird11111 Jan 11, 2024
c174c4f
[doc] fix doc typo (#5256)
binmakeswell Jan 11, 2024
ef4f0ee
[hotfix]: add pp sanity check and fix mbs arg (#5268)
cwher Jan 15, 2024
04244aa
[workflow] fixed incomplete bash command (#5272)
FrankLeeeee Jan 16, 2024
d69cd2e
[workflow] fixed oom tests (#5275)
FrankLeeeee Jan 16, 2024
2a0558d
[ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276)
flybird11111 Jan 17, 2024
46e0916
[shardformer] hybridparallelplugin support gradients accumulation. (#…
flybird11111 Jan 17, 2024
5d9a0ae
[hotfix] Fix ShardFormer test execution path when using sequence para…
KKZ20 Jan 17, 2024
1484693
Merge branch 'main' into sync/npu
ver217 Jan 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .compatibility
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
1.12.0-11.3.0
1.13.0-11.6.0
2.0.0-11.7.0
2.1.0-11.8.0
138 changes: 15 additions & 123 deletions .github/workflows/build_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,57 +22,6 @@ on:
delete:

jobs:
prepare_cache:
name: Prepare testmon cache
if: |
github.event_name == 'create' &&
github.event.ref_type == 'branch' &&
github.event.repository.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
steps:
- name: Copy testmon cache
run: | # branch name may contain slash, we need to replace it with space
export REF_BRANCH=$(echo ${{ github.event.ref }} | sed "s/\// /")
if [ -d /github/home/testmon_cache/${MAIN_BRANCH} ]; then
cp -p -r /github/home/testmon_cache/${MAIN_BRANCH} "/github/home/testmon_cache/${REF_BRANCH}"
fi
env:
MAIN_BRANCH: ${{ github.event.master_branch }}

prepare_cache_for_pr:
name: Prepare testmon cache for PR
if: |
github.event_name == 'pull_request' &&
(github.event.action == 'opened' || github.event.action == 'reopened' || (github.event.action == 'edited' && github.event.changes.base != null)) &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-repare-cache
cancel-in-progress: true
steps:
- name: Copy testmon cache
run: | # branch name may contain slash, we need to replace it with space
export BASE=$(echo ${{ github.event.pull_request.base.ref }} | sed "s/\// /")
if [ -d "/github/home/testmon_cache/${BASE}" ] && [ ! -z "$(ls -A "/github/home/testmon_cache/${BASE}")" ]; then
mkdir -p /github/home/testmon_cache/_pull/${PR_NUMBER} && cp -p -r "/github/home/testmon_cache/${BASE}"/.testmondata* /github/home/testmon_cache/_pull/${PR_NUMBER}
fi
env:
PR_NUMBER: ${{ github.event.number }}

detect:
name: Detect file change
if: |
Expand Down Expand Up @@ -140,8 +89,8 @@ jobs:
if: needs.detect.outputs.anyLibraryFileChanged == 'true'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --gpus all --rm -v /data/scratch/cifar-10:/data/scratch/cifar-10 -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
options: --gpus all --rm -v /dev/shm -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
timeout-minutes: 60
defaults:
run:
Expand Down Expand Up @@ -174,6 +123,7 @@ jobs:
run: |
cd TensorNVMe
cp -p -r ./build /github/home/tensornvme_cache/
cp -p -r ./cmake-build /github/home/tensornvme_cache/

- name: Checkout Colossal-AI
uses: actions/checkout@v2
Expand All @@ -198,31 +148,24 @@ jobs:
# -p flag is required to preserve the file timestamp to avoid ninja rebuild
cp -p -r /__w/ColossalAI/ColossalAI/build /github/home/cuda_ext_cache/

- name: Restore Testmon Cache
run: |
if [ -d /github/home/testmon_cache/_pull/${PR_NUMBER} ] && [ ! -z "$(ls -A /github/home/testmon_cache/_pull/${PR_NUMBER})" ]; then
cp -p -r /github/home/testmon_cache/_pull/${PR_NUMBER}/.testmondata* /__w/ColossalAI/ColossalAI/
fi
env:
PR_NUMBER: ${{ github.event.number }}

- name: Execute Unit Testing
run: |
CURL_CA_BUNDLE="" PYTHONPATH=$PWD pytest -m "not largedist" --testmon --testmon-forceselect --testmon-cov=. --durations=10 tests/
CURL_CA_BUNDLE="" PYTHONPATH=$PWD FAST_TEST=1 pytest \
-m "not largedist" \
--durations=0 \
--ignore tests/test_analyzer \
--ignore tests/test_auto_parallel \
--ignore tests/test_fx \
--ignore tests/test_autochunk \
--ignore tests/test_gptq \
--ignore tests/test_infer_ops \
--ignore tests/test_legacy \
--ignore tests/test_smoothquant \
tests/
env:
DATA: /data/scratch/cifar-10
NCCL_SHM_DISABLE: 1
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
TESTMON_CORE_PKGS: /__w/ColossalAI/ColossalAI/requirements/requirements.txt,/__w/ColossalAI/ColossalAI/requirements/requirements-test.txt
LLAMA_PATH: /data/scratch/llama-tiny

- name: Store Testmon Cache
run: |
mkdir -p /github/home/testmon_cache/_pull/${PR_NUMBER}
cp -p -r /__w/ColossalAI/ColossalAI/.testmondata* /github/home/testmon_cache/_pull/${PR_NUMBER}/
env:
PR_NUMBER: ${{ github.event.number }}

- name: Collate artifact
env:
PR_NUMBER: ${{ github.event.number }}
Expand Down Expand Up @@ -259,54 +202,3 @@ jobs:
with:
name: report
path: report/

store_cache:
name: Store testmon cache for PR
if: |
github.event_name == 'pull_request' &&
github.event.action == 'closed' &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
steps:
- name: Store testmon cache if possible
if: github.event.pull_request.merged == true
run: | # branch name may contain slash, we need to replace it with space
export BASE=$(echo ${{ github.event.pull_request.base.ref }} | sed "s/\// /")
if [ -d /github/home/testmon_cache/_pull/${PR_NUMBER} ] && [ ! -z "$(ls -A /github/home/testmon_cache/_pull/${PR_NUMBER})" ]; then
cp -p -r /github/home/testmon_cache/_pull/${PR_NUMBER}/.testmondata* "/github/home/testmon_cache/${BASE}/"
fi
env:
PR_NUMBER: ${{ github.event.pull_request.number }}

- name: Remove testmon cache
run: |
rm -rf /github/home/testmon_cache/_pull/${PR_NUMBER}
env:
PR_NUMBER: ${{ github.event.pull_request.number }}

remove_cache:
name: Remove testmon cache
if: |
github.event_name == 'delete' &&
github.event.ref_type == 'branch' &&
github.event.repository.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
steps:
- name: Remove testmon cache
run: | # branch name may contain slash, we need to replace it with space
export BASE=$(echo ${{ github.event.ref }} | sed "s/\// /")
rm -rf "/github/home/testmon_cache/${BASE}"
23 changes: 14 additions & 9 deletions .github/workflows/build_on_schedule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,22 @@ jobs:
build:
name: Build and Test Colossal-AI
if: github.repository == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, 8-gpu]
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --gpus all --rm -v /data/scratch/cifar-10:/data/scratch/cifar-10 -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
timeout-minutes: 40
image: hpcaitech/pytorch-cuda:2.0.0-11.7.0
options: --gpus all --rm -v /dev/shm -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
timeout-minutes: 90
steps:
- name: Check GPU Availability # ensure all GPUs have enough memory
id: check-avai
run: |
avai=true
for i in $(seq 0 7);
ngpu=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
endIndex=$(($ngpu-1))
for i in $(seq 0 $endIndex);
do
gpu_used=$(nvidia-smi -i $i --query-gpu=memory.used --format=csv,noheader,nounits)
[ "$gpu_used" -gt "10000" ] && avai=false
[ "$gpu_used" -gt "2000" ] && avai=false
done

echo "GPU is available: $avai"
Expand Down Expand Up @@ -60,9 +62,12 @@ jobs:
- name: Unit Testing
if: steps.check-avai.outputs.avai == 'true'
run: |
PYTHONPATH=$PWD pytest --durations=0 tests
PYTHONPATH=$PWD pytest \
-m "not largedist" \
--durations=0 \
tests/
env:
DATA: /data/scratch/cifar-10
NCCL_SHM_DISABLE: 1
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
LLAMA_PATH: /data/scratch/llama-tiny

Expand All @@ -71,7 +76,7 @@ jobs:
if: ${{ failure() }}
run: |
url=$SERVER_URL/$REPO/actions/runs/$RUN_ID
msg="Scheduled Build and Test failed on 8 GPUs, please visit $url for details"
msg="Scheduled Build and Test failed, please visit $url for details"
echo $msg
python .github/workflows/scripts/send_message_to_lark.py -m "$msg" -u $WEBHOOK_URL
env:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/doc_test_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ jobs:
needs: detect-changed-doc
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
image: hpcaitech/pytorch-cuda:2.0.0-11.7.0
options: --gpus all --rm
timeout-minutes: 20
defaults:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/doc_test_on_schedule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
name: Test the changed Doc
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
options: --gpus all --rm
timeout-minutes: 60
steps:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/example_check_on_dispatch.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ jobs:
fail-fast: false
matrix: ${{fromJson(needs.manual_check_matrix_preparation.outputs.matrix)}}
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
image: hpcaitech/pytorch-cuda:2.0.0-11.7.0
options: --gpus all --rm -v /data/scratch/examples-data:/data/
timeout-minutes: 15
steps:
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/example_check_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -77,9 +77,9 @@ jobs:
fail-fast: false
matrix: ${{fromJson(needs.detect-changed-example.outputs.matrix)}}
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
image: hpcaitech/pytorch-cuda:2.0.0-11.7.0
options: --gpus all --rm -v /data/scratch/examples-data:/data/
timeout-minutes: 15
timeout-minutes: 20
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-run-example-${{ matrix.directory }}
cancel-in-progress: true
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/example_check_on_schedule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ jobs:
fail-fast: false
matrix: ${{fromJson(needs.matrix_preparation.outputs.matrix)}}
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
timeout-minutes: 15
image: hpcaitech/pytorch-cuda:2.0.0-11.7.0
timeout-minutes: 10
steps:
- name: 📚 Checkout
uses: actions/checkout@v3
Expand Down
Loading