Skip to content
Merged

Ra #48

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
ad6460c
[NFC] fix typo applications/ and colossalai/ (#3735)
digger-yu May 15, 2023
b37797e
[booster] support torch fsdp plugin in booster (#3697)
wukong1992 May 15, 2023
afb239b
[devops] update torch version of CI (#3725)
ver217 May 15, 2023
6050f37
[booster] removed models that don't support fsdp (#3744)
wukong1992 May 15, 2023
7386c66
[fix] Add init to fix import error when importing _analyzer (#3668)
Wesley-Jzy May 16, 2023
1baeb39
[NFC] fix typo with colossalai/auto_parallel/tensor_shard (#3742)
digger-yu May 17, 2023
c03bd7c
[devops] make build on PR run automatically (#3748)
ver217 May 17, 2023
5dd573c
[devops] fix ci for document check (#3751)
ver217 May 17, 2023
0575983
[chat] fix bugs in stage 3 training (#3759)
chengeharrison May 17, 2023
d449525
[doc] update booster tutorials (#3718)
flybird11111 May 18, 2023
15024e4
[auto] fix install cmd (#3772)
binmakeswell May 18, 2023
48bd056
[doc] update hybrid parallelism doc (#3770)
flybird11111 May 18, 2023
2703a37
[amp] Add naive amp demo (#3774)
flybird11111 May 18, 2023
5452df6
[plugin] torch ddp plugin supports sharded model checkpoint (#3775)
ver217 May 18, 2023
5ce6c9d
[doc] add tutorial for cluster utils (#3763)
ver217 May 19, 2023
21e29e2
[doc] add tutorial for booster plugins (#3758)
ver217 May 19, 2023
32f81f1
[NFC] fix typo colossalai/amp auto_parallel autochunk (#3756)
digger-yu May 19, 2023
b4788d6
[devops] fix doc test on pr (#3782)
ver217 May 19, 2023
ad2cf58
[chat] add performance and tutorial (#3786)
binmakeswell May 19, 2023
60e6a15
[doc] add tutorial for booster checkpoint (#3785)
ver217 May 19, 2023
3c07a28
[plugin] a workaround for zero plugins' optimizer checkpoint (#3780)
ver217 May 19, 2023
72688ad
[doc] add booster docstring and fix autodoc (#3789)
ver217 May 22, 2023
d9393b8
[doc] add deprecated warning on doc Basics section (#3754)
Yanjia0 May 22, 2023
fe1561a
[doc] update gradient cliping document (#3778)
flybird11111 May 22, 2023
62c7e67
[format] applied code formatting on changed files in pull request 378…
github-actions[bot] May 22, 2023
4d29c0f
Fix/docker action (#3266)
liuzeming-yuxi May 22, 2023
788e07d
[workflow] fixed the docker build workflow (#3794)
FrankLeeeee May 22, 2023
f5c425c
fixed the example docstring for booster (#3795)
FrankLeeeee May 22, 2023
ef02d7e
[doc] update gradient accumulation (#3771)
flybird11111 May 23, 2023
ad93c73
[workflow] enable testing for develop & feature branch (#3801)
FrankLeeeee May 23, 2023
615e2e5
[test] fixed lazy init test import error (#3799)
FrankLeeeee May 23, 2023
e871e34
[API] add docstrings and initialization to apex amp, naive amp (#3783)
flybird11111 May 23, 2023
9265f2d
[NFC]fix typo colossalai/auto_parallel nn utils etc. (#3779)
digger-yu May 23, 2023
8c62e50
[doc] update amp document
flybird11111 May 23, 2023
1167bf5
[doc] update amp document
flybird11111 May 23, 2023
a520610
[doc] update amp document
flybird11111 May 23, 2023
75272ef
[doc] add removed warning
flybird11111 May 23, 2023
c425a69
[doc] add removed change of config.py
flybird11111 May 23, 2023
6b305a9
[booster] torch fsdp fix ckpt (#3788)
wukong1992 May 23, 2023
19d1530
[doc] add warning about fsdp plugin (#3813)
ver217 May 23, 2023
1e3b64f
[workflow] enblaed doc build from a forked repo (#3815)
FrankLeeeee May 23, 2023
8aa1fb2
[doc]fix
flybird11111 May 23, 2023
278fcbc
[doc]fix
flybird11111 May 23, 2023
725365f
Merge pull request #3810 from jiangmingyan/amp
flybird11111 May 23, 2023
7f8203a
fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808)
digger-yu May 24, 2023
269150b
[Docker] Fix a couple of build issues (#3691)
ymwangg May 24, 2023
05b8a8d
[workflow] changed to doc build to be on schedule and release (#3825)
FrankLeeeee May 24, 2023
3496637
[evaluation] add automatic evaluation pipeline (#3821)
chengeharrison May 24, 2023
e90fdb1
fix typo docs/
digger-yu May 24, 2023
518b31c
[docs] change placememt_policy to placement_policy (#3829)
digger-yu May 24, 2023
84500b7
[workflow] fixed testmon cache in build CI (#3806)
FrankLeeeee May 24, 2023
3229f93
[booster] add warning for torch fsdp plugin doc (#3833)
wukong1992 May 25, 2023
54e97ed
[workflow] supported test on CUDA 10.2 (#3841)
FrankLeeeee May 25, 2023
a64df3f
[doc] update document of gemini instruction. (#3842)
flybird11111 May 25, 2023
e2d81eb
[nfc] fix typo colossalai/ applications/ (#3831)
digger-yu May 25, 2023
d42b1be
[release] bump to v0.3.0 (#3830)
FrankLeeeee May 25, 2023
ae959a7
[workflow] fixed workflow check for docker build (#3849)
FrankLeeeee May 25, 2023
b047487
[doc] update nvme offload documents. (#3850)
flybird11111 May 25, 2023
815f9bb
Merge pull request #47 from hpcaitech/main
jamesthesnake May 27, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[run]
concurrency = multiprocessing
parallel = true
sigterm = true
6 changes: 3 additions & 3 deletions .github/workflows/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
- [Compatibility Test on Dispatch](#compatibility-test-on-dispatch)
- [Release](#release)
- [User Friendliness](#user-friendliness)
- [Commmunity](#commmunity)
- [Community](#community)
- [Configuration](#configuration)
- [Progress Log](#progress-log)

Expand Down Expand Up @@ -43,7 +43,7 @@ I will provide the details of each workflow below.

| Workflow Name | File name | Description |
| ---------------------- | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Build on PR` | `build_on_pr.yml` | This workflow is triggered when the label `Run build and Test` is assigned to a PR. It will run all the unit tests in the repository with 4 GPUs. |
| `Build on PR` | `build_on_pr.yml` | This workflow is triggered when a PR changes essential files. It will run all the unit tests in the repository with 4 GPUs. |
| `Build on Schedule` | `build_on_schedule.yml` | This workflow will run the unit tests everyday with 8 GPUs. The result is sent to Lark. |
| `Report test coverage` | `report_test_coverage.yml` | This PR will put up a comment to report the test coverage results when `Build` is done. |

Expand Down Expand Up @@ -97,7 +97,7 @@ This workflow is triggered by manually dispatching the workflow. It has the foll
| `Synchronize submodule` | `submodule.yml` | This workflow will check if any git submodule is updated. If so, it will create a PR to update the submodule pointers. |
| `Close inactive issues` | `close_inactive.yml` | This workflow will close issues which are stale for 14 days. |

### Commmunity
### Community

| Workflow Name | File name | Description |
| -------------------------------------------- | -------------------------------- | -------------------------------------------------------------------------------- |
Expand Down
43 changes: 33 additions & 10 deletions .github/workflows/build_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,29 @@ name: Build on PR

on:
pull_request:
types: [synchronize, labeled]
types: [synchronize, opened, reopened]
branches:
- "main"
- "develop"
- "feature/**"
paths:
- ".github/workflows/build_on_pr.yml" # run command & env variables change
- "colossalai/**" # source code change
- "!colossalai/**.md" # ignore doc change
- "op_builder/**" # cuda extension change
- "!op_builder/**.md" # ignore doc change
- "requirements/**" # requirements change
- "tests/**" # test change
- "!tests/**.md" # ignore doc change
- "pytest.ini" # test config change
- "setup.py" # install command change

jobs:
detect:
name: Detect file change
if: |
github.event.pull_request.draft == false &&
github.base_ref == 'main' &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI' &&
contains( github.event.pull_request.labels.*.name, 'Run Build and Test')
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
outputs:
changedExtenisonFiles: ${{ steps.find-extension-change.outputs.all_changed_files }}
anyExtensionFileChanged: ${{ steps.find-extension-change.outputs.any_changed }}
Expand Down Expand Up @@ -66,11 +79,12 @@ jobs:
build:
name: Build and Test Colossal-AI
needs: detect
if: needs.detect.outputs.anyLibraryFileChanged == 'true'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.11.0-11.3.0
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --gpus all --rm -v /data/scratch/cifar-10:/data/scratch/cifar-10
timeout-minutes: 40
timeout-minutes: 60
defaults:
run:
shell: bash
Expand Down Expand Up @@ -110,7 +124,6 @@ jobs:
[ ! -z "$(ls -A /github/home/cuda_ext_cache/)" ] && cp -p -r /github/home/cuda_ext_cache/* /__w/ColossalAI/ColossalAI/

- name: Install Colossal-AI
if: needs.detect.outputs.anyLibraryFileChanged == 'true'
run: |
CUDA_EXT=1 pip install -v -e .
pip install -r requirements/requirements-test.txt
Expand All @@ -120,15 +133,25 @@ jobs:
# -p flag is required to preserve the file timestamp to avoid ninja rebuild
cp -p -r /__w/ColossalAI/ColossalAI/build /github/home/cuda_ext_cache/

- name: Restore Testmon Cache
run: |
if [ -d /github/home/testmon_cache ]; then
[ ! -z "$(ls -A /github/home/testmon_cache)" ] && cp -p -r /github/home/testmon_cache/.testmondata* /__w/ColossalAI/ColossalAI/
fi

- name: Execute Unit Testing
if: needs.detect.outputs.anyLibraryFileChanged == 'true'
run: |
CURL_CA_BUNDLE="" PYTHONPATH=$PWD pytest --cov=. --cov-report xml tests/
CURL_CA_BUNDLE="" PYTHONPATH=$PWD pytest --testmon --testmon-cov=. tests/
env:
DATA: /data/scratch/cifar-10
NCCL_SHM_DISABLE: 1
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64

- name: Store Testmon Cache
run: |
[ -d /github/home/testmon_cache ] || mkdir /github/home/testmon_cache
cp -p -r /__w/ColossalAI/ColossalAI/.testmondata* /github/home/testmon_cache/

- name: Collate artifact
env:
PR_NUMBER: ${{ github.event.number }}
Expand All @@ -140,7 +163,7 @@ jobs:
echo $PR_NUMBER > ./report/pr_number

# generate coverage.xml if any
if [ "$anyLibraryFileChanged" == "true" ]; then
if [ "$anyLibraryFileChanged" == "true" ] && [ -e .coverage ]; then
allFiles=""
for file in $changedLibraryFiles; do
if [ "$allFiles" == "" ]; then
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build_on_schedule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
if: github.repository == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, 8-gpu]
container:
image: hpcaitech/pytorch-cuda:1.11.0-11.3.0
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --gpus all --rm -v /data/scratch/cifar-10:/data/scratch/cifar-10
timeout-minutes: 40
steps:
Expand Down
47 changes: 29 additions & 18 deletions .github/workflows/compatiblity_test_on_dispatch.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,26 +19,26 @@ jobs:
outputs:
matrix: ${{ steps.set-matrix.outputs.matrix }}
steps:
- id: set-matrix
env:
TORCH_VERSIONS: ${{ inputs.torch_version }}
CUDA_VERSIONS: ${{ inputs.cuda_version }}
run: |
IFS=','
DOCKER_IMAGE=()
- id: set-matrix
env:
TORCH_VERSIONS: ${{ inputs.torch_version }}
CUDA_VERSIONS: ${{ inputs.cuda_version }}
run: |
IFS=','
DOCKER_IMAGE=()

for tv in $TORCH_VERSIONS
do
for cv in $CUDA_VERSIONS
do
DOCKER_IMAGE+=("\"hpcaitech/pytorch-cuda:${tv}-${cv}\"")
done
done
for tv in $TORCH_VERSIONS
do
for cv in $CUDA_VERSIONS
do
DOCKER_IMAGE+=("\"hpcaitech/pytorch-cuda:${tv}-${cv}\"")
done
done

container=$( IFS=',' ; echo "${DOCKER_IMAGE[*]}" )
container="[${container}]"
echo "$container"
echo "::set-output name=matrix::{\"container\":$(echo "$container")}"
container=$( IFS=',' ; echo "${DOCKER_IMAGE[*]}" )
container="[${container}]"
echo "$container"
echo "::set-output name=matrix::{\"container\":$(echo "$container")}"

build:
name: Test for PyTorch Compatibility
Expand Down Expand Up @@ -70,6 +70,17 @@ jobs:
- uses: actions/checkout@v2
with:
ssh-key: ${{ secrets.SSH_KEY_FOR_CI }}
- name: Download cub for CUDA 10.2
run: |
CUDA_VERSION=$(cat $CUDA_HOME/version.txt | grep "CUDA Version" | awk '{print $NF}' | cut -d. -f1,2)

# check if it is CUDA 10.2
# download cub
if [ "$CUDA_VERSION" = "10.2" ]; then
wget https://github.com/NVIDIA/cub/archive/refs/tags/1.8.0.zip
unzip 1.8.0.zip
cp -r cub-1.8.0/cub/ colossalai/kernel/cuda_native/csrc/kernels/include/
fi
- name: Install Colossal-AI
run: |
pip install -r requirements/requirements.txt
Expand Down
16 changes: 14 additions & 2 deletions .github/workflows/compatiblity_test_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ name: Compatibility Test on PR
on:
pull_request:
paths:
- 'version.txt'
- '.compatibility'
- "version.txt"
- ".compatibility"

jobs:
matrix_preparation:
Expand Down Expand Up @@ -58,6 +58,18 @@ jobs:
- uses: actions/checkout@v2
with:
ssh-key: ${{ secrets.SSH_KEY_FOR_CI }}
- name: Download cub for CUDA 10.2
run: |
CUDA_VERSION=$(cat $CUDA_HOME/version.txt | grep "CUDA Version" | awk '{print $NF}' | cut -d. -f1,2)

# check if it is CUDA 10.2
# download cub
if [ "$CUDA_VERSION" = "10.2" ]; then
wget https://github.com/NVIDIA/cub/archive/refs/tags/1.8.0.zip
unzip 1.8.0.zip
cp -r cub-1.8.0/cub/ colossalai/kernel/cuda_native/csrc/kernels/include/
fi

- name: Install Colossal-AI
run: |
pip install -v --no-cache-dir .
Expand Down
Original file line number Diff line number Diff line change
@@ -1,18 +1,16 @@
name: Build Documentation After Merge
name: Build Documentation On Schedule & After Release

on:
workflow_dispatch:
pull_request:
paths:
- 'version.txt'
- 'docs/**'
types:
- closed
schedule:
- cron: "0 12 * * *" # build doc every day at 8pm Singapore time (12pm UTC time)
release:
types: [published]

jobs:
build-doc:
name: Trigger Documentation Build Workflow
if: ( github.event_name == 'workflow_dispatch' || github.event.pull_request.merged == true ) && github.repository == 'hpcaitech/ColossalAI'
if: github.repository == 'hpcaitech/ColossalAI'
runs-on: ubuntu-latest
steps:
- name: trigger workflow in ColossalAI-Documentation
Expand Down
35 changes: 20 additions & 15 deletions .github/workflows/doc_check_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,57 +2,62 @@ name: Check Documentation on PR

on:
pull_request:
branches:
- "main"
- "develop"
- "feature/**"
paths:
- 'docs/**'
- "docs/**"

jobs:
check-i18n:
name: Check docs in diff languages
if: |
github.event.pull_request.draft == false &&
github.base_ref == 'main' &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
github.event.pull_request.draft == false &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2

- uses: actions/setup-python@v2
with:
python-version: '3.8.14'
python-version: "3.8.14"

- run: python .github/workflows/scripts/check_doc_i18n.py -d docs/source

check-doc-build:
name: Test if the docs can be built
if: |
github.event.pull_request.draft == false &&
github.base_ref == 'main' &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
github.event.pull_request.draft == false &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
path: './ColossalAI'
path: "./ColossalAI"
fetch-depth: 0

- uses: actions/checkout@v2
with:
path: './ColossalAI-Documentation'
repository: 'hpcaitech/ColossalAI-Documentation'
path: "./ColossalAI-Documentation"
repository: "hpcaitech/ColossalAI-Documentation"

- uses: actions/setup-python@v2
with:
python-version: '3.8.14'
python-version: "3.8.14"

# we use the versions in the main branch as the guide for versions to display
# checkout will give your merged branch
# therefore, we need to make the merged branch as the main branch
# there is no main branch, so it's safe to checkout the main branch from the merged branch
# docer will rebase the remote main branch to the merged branch, so we have to config user
- name: Make the merged branch main
run: |
cd ColossalAI
curBranch=$(git rev-parse --abbrev-ref HEAD)
git checkout main
git merge $curBranch # fast-forward master up to the merge
git checkout -b main
git branch -u origin/main
git config user.name 'github-actions'
git config user.email 'github-actions@github.com'

- name: Build docs
run: |
Expand Down
Loading