Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
215 commits
Select commit Hold shift + click to select a range
f5c5d4c
init
oahzxl Oct 27, 2022
87cddf7
rename and remove useless func
oahzxl Oct 27, 2022
78cfe43
basic chunk
oahzxl Nov 2, 2022
86f2a31
add evoformer
oahzxl Nov 2, 2022
820ea4d
align evoformer
oahzxl Nov 2, 2022
f8aeece
add meta
oahzxl Nov 3, 2022
c35718e
basic chunk
oahzxl Nov 4, 2022
d95cfe2
basic memory
oahzxl Nov 7, 2022
12301dd
finish basic inference memory estimation
oahzxl Nov 8, 2022
8cca684
finish memory estimation
oahzxl Nov 8, 2022
22f9c60
fix bug
oahzxl Nov 9, 2022
d7634af
finish memory estimation
oahzxl Nov 11, 2022
1607d04
add part of index tracer
oahzxl Nov 14, 2022
c36dba0
finish basic index tracer
oahzxl Nov 14, 2022
70a98b8
add doc string
oahzxl Nov 14, 2022
f379d1a
add doc str
oahzxl Nov 15, 2022
7e2bd1e
polish code
oahzxl Nov 15, 2022
fad3b6d
polish code
oahzxl Nov 15, 2022
54a34a7
update active log
oahzxl Nov 15, 2022
d9ca2f8
polish code
oahzxl Nov 15, 2022
7330d90
add possible region search
oahzxl Dec 4, 2022
3b7d671
finish region search loop
oahzxl Dec 6, 2022
f24c418
finish chunk define
oahzxl Dec 6, 2022
a9d6437
support new op
oahzxl Dec 6, 2022
6d99994
rename index tracer
oahzxl Dec 6, 2022
2b4ebcc
finishi codegen on msa
oahzxl Dec 8, 2022
979e61d
redesign index tracer, add source and change compute
oahzxl Dec 9, 2022
9294451
pass outproduct mean
oahzxl Dec 10, 2022
d31e146
code format
oahzxl Dec 10, 2022
5de9e46
code format
oahzxl Dec 10, 2022
31a2c5d
work with outerproductmean and msa
oahzxl Dec 12, 2022
b7b67c3
code style
oahzxl Dec 12, 2022
5cdfcfe
code style
oahzxl Dec 12, 2022
8511d90
code style
oahzxl Dec 12, 2022
98f9728
code style
oahzxl Dec 12, 2022
8754fa2
change threshold
oahzxl Dec 12, 2022
1e0fd11
support check_index_duplicate
oahzxl Dec 13, 2022
cda3e85
support index dupilictae and update loop
oahzxl Dec 13, 2022
de65e6c
support output
oahzxl Dec 13, 2022
e83e3c6
update memory estimate
oahzxl Dec 16, 2022
e66a18a
optimise search
oahzxl Dec 16, 2022
9d516fa
fix layernorm
oahzxl Dec 18, 2022
d734529
move flow tracer
oahzxl Dec 21, 2022
d361d53
refactor flow tracer
oahzxl Dec 21, 2022
ded1005
format code
oahzxl Dec 21, 2022
774d34f
refactor flow search
oahzxl Dec 23, 2022
522f017
code style
oahzxl Dec 23, 2022
d309e93
adapt codegen to prepose node
oahzxl Dec 23, 2022
49ba619
code style
oahzxl Dec 23, 2022
4d89525
remove abandoned function
oahzxl Dec 23, 2022
4f5e105
remove flow tracer
oahzxl Dec 23, 2022
fa5e6fb
code style
oahzxl Dec 23, 2022
e0ae68e
code style
oahzxl Dec 23, 2022
884a228
reorder nodes
oahzxl Dec 23, 2022
51ef838
finish node reorder
oahzxl Dec 23, 2022
9b1b890
update run
oahzxl Dec 23, 2022
786a398
code style
oahzxl Dec 23, 2022
1b8a066
add chunk select class
oahzxl Dec 26, 2022
8f5a0ed
add chunk select
oahzxl Dec 26, 2022
378a49d
code style
oahzxl Dec 27, 2022
6be89a3
add chunksize in emit, fix bug in reassgin shape
oahzxl Dec 27, 2022
a2b4755
code style
oahzxl Dec 27, 2022
cb2dd1a
turn off print mem
oahzxl Dec 27, 2022
69af931
add evoformer openfold init
oahzxl Dec 29, 2022
fff493c
init openfold
oahzxl Dec 29, 2022
1d7ca02
add benchmark
oahzxl Dec 29, 2022
5a916c0
add print
oahzxl Dec 29, 2022
7a23deb
code style
oahzxl Dec 29, 2022
efe6fe3
code style
oahzxl Dec 29, 2022
289f3a4
init openfold
oahzxl Dec 29, 2022
5c4df01
update openfold
oahzxl Dec 29, 2022
f7d8092
align openfold
oahzxl Dec 29, 2022
f5515e9
use max_mem to control stratge
oahzxl Dec 29, 2022
e5a5fbb
update source add
oahzxl Dec 30, 2022
966e4ea
add reorder in mem estimator
oahzxl Dec 30, 2022
80efd70
improve reorder efficeincy
oahzxl Dec 31, 2022
5f24f4f
support ones_like, add prompt if fit mode search fail
oahzxl Dec 31, 2022
7fd3b45
fix a bug in ones like, dont gen chunk if dim size is 1
oahzxl Jan 1, 2023
9c5e028
fix bug again
oahzxl Jan 1, 2023
55cb713
update min memory stratege, reduce mem usage by 30%
oahzxl Jan 5, 2023
71e72c4
last version of benchmark
oahzxl Jan 5, 2023
27ab524
refactor structure
oahzxl Jan 6, 2023
efb1c64
restruct dir
oahzxl Jan 6, 2023
06a5355
update test
oahzxl Jan 6, 2023
d1f0773
rename
oahzxl Jan 6, 2023
1a6d2a7
take apart chunk code gen
oahzxl Jan 6, 2023
8a634af
close mem and code print
oahzxl Jan 6, 2023
2bde9d2
code format
oahzxl Jan 6, 2023
fd87d78
rename ambiguous variable
oahzxl Jan 6, 2023
ae27a8b
seperate flow tracer
oahzxl Jan 6, 2023
f4a1607
seperate input node dim search
oahzxl Jan 6, 2023
f856611
seperate prepose_nodes
oahzxl Jan 6, 2023
6685a9d
seperate non chunk input
oahzxl Jan 6, 2023
c3d72f7
seperate reorder
oahzxl Jan 6, 2023
da40768
rename
oahzxl Jan 6, 2023
4748967
ad reorder graph
oahzxl Jan 6, 2023
a6cdbf9
seperate trace flow
oahzxl Jan 6, 2023
c3a2bf4
code style
oahzxl Jan 6, 2023
8a989a0
code style
oahzxl Jan 6, 2023
4d223e1
fix typo
oahzxl Jan 9, 2023
cb68ee8
set benchmark
oahzxl Jan 9, 2023
18a51c8
rename test
oahzxl Jan 9, 2023
74b8139
update codegen test
oahzxl Jan 9, 2023
9880fd2
Fix state_dict key missing issue of the ZeroDDP (#2363)
eric8607242 Jan 9, 2023
3abbaf8
update codegen test
oahzxl Jan 9, 2023
a005965
update codegen test
oahzxl Jan 9, 2023
d106b27
add chunk search test
oahzxl Jan 9, 2023
d5c4f0b
code style
oahzxl Jan 9, 2023
aafc351
add available
oahzxl Jan 9, 2023
498b5ca
[hotfix] fix gpt gemini example (#2404)
1SAA Jan 9, 2023
19cc64b
remove autochunk_available
oahzxl Jan 9, 2023
d3f5ce9
[workflow] added nightly release to pypi (#2403)
FrankLeeeee Jan 9, 2023
212b5b1
add comments
oahzxl Jan 9, 2023
1951f7f
code style
oahzxl Jan 9, 2023
a68d240
add doc for search chunk
oahzxl Jan 9, 2023
85e045b
[doc] updated readme regarding pypi installation (#2406)
FrankLeeeee Jan 9, 2023
065f0b4
add doc for search
oahzxl Jan 9, 2023
551cafe
[doc] updated kernel-related optimisers' docstring (#2385)
FrankLeeeee Jan 9, 2023
0ea903b
rename trace_index to trace_indice
oahzxl Jan 9, 2023
cb9817f
rename function from index to indice
oahzxl Jan 9, 2023
1bb1f2a
rename
oahzxl Jan 9, 2023
a4ed5b0
rename in doc
oahzxl Jan 9, 2023
ea13a20
[polish] polish code for get_static_torch_model (#2405)
1SAA Jan 9, 2023
865f2e0
rename
oahzxl Jan 9, 2023
d914a21
rename
oahzxl Jan 9, 2023
0b6af55
remove useless function
oahzxl Jan 9, 2023
53bb868
[worfklow] added coverage test (#2399)
FrankLeeeee Jan 9, 2023
1be0ac3
add doc for trace indice
oahzxl Jan 9, 2023
8de8de9
[docker] updated Dockerfile and release workflow (#2410)
FrankLeeeee Jan 10, 2023
7d4abaa
add doc
oahzxl Jan 10, 2023
615e7e6
update doc
oahzxl Jan 10, 2023
a591d45
add available
oahzxl Jan 10, 2023
fd818cf
change imports
oahzxl Jan 10, 2023
c1492e5
add test in import
oahzxl Jan 10, 2023
8327932
[workflow] refactored the example check workflow (#2411)
FrankLeeeee Jan 10, 2023
7d5640b
Update parallel_context.py (#2408)
haofanwang Jan 10, 2023
e532679
Merge branch 'main' of https://github.com/oahzxl/ColossalAI into chunk
oahzxl Jan 10, 2023
d84e747
[hotfix] add DISTPAN argument for benchmark (#2412)
1SAA Jan 10, 2023
4befaab
[workflow] added precommit check for code consistency (#2401)
FrankLeeeee Jan 10, 2023
7ab2db2
adapt new fx
oahzxl Jan 10, 2023
9d43223
[workflow] added translation for non-english comments (#2414)
FrankLeeeee Jan 10, 2023
2445279
[setup] refactored setup.py for dependency graph (#2413)
FrankLeeeee Jan 10, 2023
36ab2cb
change import
oahzxl Jan 10, 2023
61fdd34
update doc
oahzxl Jan 10, 2023
57b6157
[workflow] auto comment if precommit check fails (#2417)
FrankLeeeee Jan 10, 2023
dddacd2
[hotfix] add norm clearing for the overflow step (#2416)
1SAA Jan 10, 2023
93f62dd
[autochunk] add autochunk feature
feifeibear Jan 10, 2023
fe0f797
[examples] adding tflops to PaLM (#2365)
ZijianYY Jan 10, 2023
b3472d3
[workflow]auto comment with test coverage report (#2419)
FrankLeeeee Jan 10, 2023
cd38167
[doc] added documentation for CI/CD (#2420)
FrankLeeeee Jan 10, 2023
63be79d
[example] removed duplicated stable diffusion example (#2424)
FrankLeeeee Jan 11, 2023
bb4e9a3
[zero] add inference mode and its unit test (#2418)
1SAA Jan 11, 2023
2125667
[workflow] report test coverage even if below threshold (#2431)
FrankLeeeee Jan 11, 2023
a3e5496
[example] improved the clarity yof the example readme (#2427)
FrankLeeeee Jan 11, 2023
7829aa0
[ddp] add is_ddp_ignored (#2434)
1SAA Jan 11, 2023
1b7587d
[workflow] make test coverage report collapsable (#2436)
FrankLeeeee Jan 11, 2023
41429b9
[autoparallel] add shard option (#2423)
YuliangLiu0306 Jan 11, 2023
c41e59e
[fx] allow native ckpt trace and codegen. (#2438)
super-dainiu Jan 11, 2023
c72c827
[cli] provided more details if colossalai run fail (#2442)
FrankLeeeee Jan 11, 2023
2731531
[autoparallel] integrate device mesh initialization into autoparallel…
YuliangLiu0306 Jan 11, 2023
5521af7
[zero] fix state_dict and load_state_dict for ddp ignored parameters …
1SAA Jan 11, 2023
3916341
[example] updated the hybrid parallel tutorial (#2444)
FrankLeeeee Jan 11, 2023
2bfeb24
[zero] add warning for ignored parameters (#2446)
1SAA Jan 11, 2023
ac18a44
[example] updated large-batch optimizer tutorial (#2448)
FrankLeeeee Jan 11, 2023
cfd1d5e
[example] fixed seed error in train_dreambooth_colossalai.py (#2445)
haofanwang Jan 11, 2023
483efda
[workflow] fixed the on-merge condition check (#2452)
FrankLeeeee Jan 11, 2023
c9ec519
[workflow] automated the compatiblity test (#2453)
FrankLeeeee Jan 11, 2023
8221fd7
[autoparallel] update binary elementwise handler (#2451)
YuliangLiu0306 Jan 12, 2023
32c46e1
[workflow] automated bdist wheel build (#2459)
FrankLeeeee Jan 12, 2023
9358262
Fix False warning in initialize.py (#2456)
haofanwang Jan 12, 2023
c20529f
[examples] update autoparallel tutorial demo (#2449)
YuliangLiu0306 Jan 12, 2023
14d9299
[cli] fixed hostname mismatch error (#2465)
FrankLeeeee Jan 12, 2023
e6943e2
[example] integrate autoparallel demo with CI (#2466)
FrankLeeeee Jan 12, 2023
867c8c2
[zero] low level optim supports ProcessGroup (#2464)
feifeibear Jan 13, 2023
8e85d24
[example] update vit ci script (#2469)
ver217 Jan 13, 2023
8b7495d
[example] integrate seq-parallel tutorial with CI (#2463)
FrankLeeeee Jan 13, 2023
a5dc425
[zero] polish low level optimizer (#2473)
1SAA Jan 13, 2023
fef5c94
polish pp middleware (#2476)
Wesley-Jzy Jan 13, 2023
f525d1f
[example] update gpt gemini example ci test (#2477)
ver217 Jan 13, 2023
21c8822
[zero] add unit test for low-level zero init (#2474)
1SAA Jan 15, 2023
579dba5
[workflow] fixed the skip condition of example weekly check workflow…
FrankLeeeee Jan 16, 2023
f78bad2
[example] stable diffusion add roadmap
feifeibear Jan 16, 2023
9cba38b
add dummy test_ci.sh
feifeibear Jan 16, 2023
e4c38ba
[example] stable diffusion add roadmap (#2482)
feifeibear Jan 16, 2023
7c31706
[CI] add test_ci.sh for palm, opt and gpt (#2475)
feifeibear Jan 16, 2023
e64a05b
polish code
feifeibear Jan 16, 2023
236b419
Merge branch 'main' of https://github.com/hpcaitech/ColossalAI into d…
feifeibear Jan 16, 2023
37baea2
[example] titans for gpt
feifeibear Jan 16, 2023
315e143
polish readme
feifeibear Jan 16, 2023
92f65fb
remove license
feifeibear Jan 16, 2023
38424db
polish code
feifeibear Jan 16, 2023
438ea60
update readme
feifeibear Jan 16, 2023
3a21485
[example] titans for gpt (#2484)
feifeibear Jan 16, 2023
67e1912
[autoparallel] support origin activation ckpt on autoprallel system (…
YuliangLiu0306 Jan 16, 2023
4953b4a
[autochunk] support evoformer tracer (#2485)
oahzxl Jan 16, 2023
fcc6d61
[example] fix requirements (#2488)
binmakeswell Jan 17, 2023
d565a24
[zero] add unit testings for hybrid parallelism (#2486)
1SAA Jan 18, 2023
8208fd0
Merge branch 'main' of https://github.com/hpcaitech/ColossalAI into d…
feifeibear Jan 18, 2023
a4b75b7
[hotfix] gpt example titans bug #2493
feifeibear Jan 18, 2023
e58cc44
polish code and fix dataloader bugs
feifeibear Jan 18, 2023
e327e95
[hotfix] gpt example titans bug #2493 (#2494)
feifeibear Jan 18, 2023
5db3a5b
[fx] allow control of ckpt_codegen init (#2498)
oahzxl Jan 18, 2023
025b482
[example] dreambooth example
feifeibear Jan 18, 2023
7f822a5
Merge branch 'main' of https://github.com/hpcaitech/ColossalAI into d…
feifeibear Jan 18, 2023
32390cb
add test_ci.sh to dreambooth
feifeibear Jan 19, 2023
304f1ba
Merge pull request #2499 from feifeibear/dev0116_10
Fazziekey Jan 19, 2023
ecccc91
[autochunk] support autochunk on evoformer (#2497)
oahzxl Jan 19, 2023
99d9713
Revert "Update parallel_context.py (#2408)"
kurisusnowdeng Jan 19, 2023
0f02b8c
add avg partition (#2483)
Wesley-Jzy Jan 19, 2023
72341e6
[auto-chunk] support extramsa (#3) (#2504)
oahzxl Jan 20, 2023
35c0c00
[utils] lazy init. (#2148)
super-dainiu Jan 20, 2023
c04f183
[autochunk] support parsing blocks (#2506)
oahzxl Jan 20, 2023
2d1a7df
[zero] add strict ddp mode (#2508)
1SAA Jan 20, 2023
a6a1061
[doc] update opt and tutorial links (#2509)
binmakeswell Jan 20, 2023
0af7938
[workflow] fixed changed file detection (#2515)
FrankLeeeee Jan 26, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .bdist.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"build": [
{
"torch_version": "1.11.0",
"cuda_image": "hpcaitech/cuda-conda:10.2"
},
{
"torch_version": "1.11.0",
"cuda_image": "hpcaitech/cuda-conda:11.3"
},
{
"torch_version": "1.12.1",
"cuda_image": "hpcaitech/cuda-conda:10.2"
},
{
"torch_version": "1.12.1",
"cuda_image": "hpcaitech/cuda-conda:11.3"
},
{
"torch_version": "1.12.1",
"cuda_image": "hpcaitech/cuda-conda:11.6"
}
]
}
3 changes: 3 additions & 0 deletions .compatibility
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
1.12.0-11.3.0
1.11.0-11.3.0
1.10.1-11.3.0
149 changes: 149 additions & 0 deletions .github/workflows/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# CI/CD

## Table of Contents

- [CI/CD](#cicd)
- [Table of Contents](#table-of-contents)
- [Overview](#overview)
- [Workflows](#workflows)
- [Checks on Pull Requests](#checks-on-pull-requests)
- [Regular Checks](#regular-checks)
- [Release](#release)
- [Manual Dispatch](#manual-dispatch)
- [Release bdist wheel](#release-bdist-wheel)
- [Dispatch Example Test](#dispatch-example-test)
- [Compatibility Test](#compatibility-test)
- [User Friendliness](#user-friendliness)
- [Configuration](#configuration)
- [Progress Log](#progress-log)

## Overview

Automation makes our development more efficient as the machine automatically run the pre-defined tasks for the contributors.
This saves a lot of manual work and allow the developer to fully focus on the features and bug fixes.
In Colossal-AI, we use [GitHub Actions](https://github.com/features/actions) to automate a wide range of workflows to ensure the robustness of the software.
In the section below, we will dive into the details of different workflows available.

## Workflows

### Checks on Pull Requests

| Workflow Name | File name | Description |
| --------------------------- | ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Build` | `build.yml` | This workflow is triggered when the label `Run build and Test` is assigned to a PR. It will run all the unit tests in the repository with 4 GPUs. |
| `Pre-commit` | `pre_commit.yml` | This workflow runs pre-commit checks for code style consistency. |
| `Report pre-commit failure` | `report_precommit_failure.yml` | This PR will put up a comment in the PR to explain the precommit failure and remedy. This is executed when `Pre-commit` is done |
| `Report test coverage` | `report_test_coverage.yml` | This PR will put up a comment to report the test coverage results. This is executed when `Build` is completed. |
| `Test example` | `auto_example_check.yml` | The example will be automatically tested if its files are changed in the PR |

### Regular Checks

| Workflow Name | File name | Description |
| ----------------------- | ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Test example` | `auto_example_check.yml` | This workflow will test all examples every Sunday |
| `Compatibility Test` | `auto_compatibility_test.yml` | This workflow will check the compatiblity of Colossal-AI against PyTorch and CUDA every Sunday. The PyTorch and CUDA versions are specified in `.compatibility`. |
| `Build on 8 GPUs` | `build_gpu_8.yml` | This workflow will run the unit tests everyday with 8 GPUs. |
| `Synchronize submodule` | `submodule.yml` | This workflow will check if any git submodule is updated. If so, it will create a PR to update the submodule pointers. |
| `Close inactive issues` | `close_inactive.yml` | This workflow will close issues which are stale for 14 days. |

### Release

| Workflow Name | File name | Description |
| --------------------------- | ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Draft GitHub Release Post` | `draft_github_release_post.yml` | Compose a GitHub release post draft based on the commit history. Triggered when the change of `version.txt` is merged. |
| `Release to PyPI` | `release_pypi.yml` | Build and release the wheel to PyPI. Triggered when the change of `version.txt` is merged. |
| `Release Nightly to PyPI` | `release_nightly.yml` | Build and release the nightly wheel to PyPI as `colossalai-nightly`. Automatically executed every Sunday. |
| `Release Docker` | `release_docker.yml` | Build and release the Docker image to DockerHub. Triggered when the change of `version.txt` is merged. |
| `Release bdist wheel` | `release_bdist.yml` | Build binary wheels with pre-built PyTorch extensions. Manually dispatched. See more details in the next section. |
| `Auto Release bdist wheel` | `auto_release_bdist.yml` | Build binary wheels with pre-built PyTorch extensions.Triggered when the change of `version.txt` is merged. Build specificatons are stored in `.bdist.json` |
| `Auto Compatibility Test` | `auto_compatibility_test.yml` | Check Colossal-AI's compatiblity against the PyTorch and CUDA version specified in `.compatibility`. Triggered when `version.txt` is changed in a PR. |

### Manual Dispatch

| Workflow Name | File name | Description |
| ---------------------------- | -------------------------------- | ------------------------------------------------------ |
| `Release bdist wheel` | `release_bdist.yml` | Build binary wheels with pre-built PyTorch extensions. |
| `Dispatch Example Test` | `dispatch_example_check.yml` | Manually test a specified example. |
| `Dispatch Compatiblity Test` | `dispatch_compatiblity_test.yml` | Test PyTorch and Python Compatibility. |

Refer to this [documentation](https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow) on how to manually trigger a workflow.
I will provide the details of each workflow below.

#### Release bdist wheel

Parameters:
- `torch version`:torch version to test against, multiple versions are supported but must be separated by comma. The default is value is all, which will test all available torch versions listed in this [repository](https://github.com/hpcaitech/public_assets/tree/main/colossalai/torch_build/torch_wheels) which is regularly updated.
- `cuda version`: cuda versions to test against, multiple versions are supported but must be separated by comma. The CUDA versions must be present in our [DockerHub repository](https://hub.docker.com/r/hpcaitech/cuda-conda).
- `ref`: input the branch or tag name to build the wheel for this ref.

#### Dispatch Example Test

parameters:
- `example_directory`: the example directory to test. Multiple directories are supported and must be separated by comma. For example, language/gpt, images/vit. Simply input language or simply gpt does not work.


#### Compatibility Test

Parameters:
- `torch version`:torch version to test against, multiple versions are supported but must be separated by comma. The default is value is all, which will test all available torch versions listed in this [repository](https://github.com/hpcaitech/public_assets/tree/main/colossalai/torch_build/torch_wheels).
- `cuda version`: cuda versions to test against, multiple versions are supported but must be separated by comma. The CUDA versions must be present in our [DockerHub repository](https://hub.docker.com/r/hpcaitech/cuda-conda).

> It only test the compatiblity of the main branch


### User Friendliness

| Workflow Name | File name | Description |
| ----------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| `issue-translate` | `translate_comment.yml` | This workflow is triggered when a new issue comment is created. The comment will be translated into English if not written in English. |


## Configuration

This section lists the files used to configure the workflow.

1. `.compatibility`

This `.compatibility` file is to tell GitHub Actions which PyTorch and CUDA versions to test against. Each line in the file is in the format `${torch-version}-${cuda-version}`, which is a tag for Docker image. Thus, this tag must be present in the [docker registry](https://hub.docker.com/r/pytorch/conda-cuda) so as to perform the test.

2. `.bdist.json`

This file controls what pytorch/cuda compatible pre-built releases will be built and published. You can add a new entry according to the json schema below if there is a new wheel that needs to be built with AOT compilation of PyTorch extensions.

```json
{
"build": [
{
"torch_version": "",
"cuda_image": ""
},
]
}
```

## Progress Log

- [x] unit testing
- [x] test on PR
- [x] report test coverage
- [x] regular test
- [x] release
- [x] official release
- [x] nightly build
- [x] binary build
- [x] docker build
- [x] draft release post
- [x] pre-commit
- [x] check on PR
- [x] report failure
- [x] example check
- [x] check on PR
- [x] regular check
- [x] manual dispatch
- [x] compatiblity check
- [x] manual dispatch
- [x] auto test when release
- [x] helpers
- [x] comment translation
- [x] submodule update
- [x] close inactive issue
74 changes: 74 additions & 0 deletions .github/workflows/auto_compatibility_test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
name: Compatibility Test

on:
pull_request:
paths:
- 'version.txt'
- '.compatibility'
# run at 03:00 of every Sunday(singapore time) so here is UTC time Saturday 16:00
schedule:
- cron: '0 19 * * 6'

jobs:
matrix_preparation:
name: Prepare Container List
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.set-matrix.outputs.matrix }}
steps:
- uses: actions/checkout@v3
- id: set-matrix
run: |
IFS=','
DOCKER_IMAGE=()

while read tag; do
DOCKER_IMAGE+=("\"hpcaitech/pytorch-cuda:${tag}\"")
done <.compatibility

container=$( IFS=',' ; echo "${DOCKER_IMAGE[*]}" )
container="[${container}]"
echo "$container"
echo "::set-output name=matrix::{\"container\":$(echo "$container")}"

build:
name: Test for PyTorch Compatibility
needs: matrix_preparation
if: github.repository == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
strategy:
fail-fast: false
matrix: ${{fromJson(needs.matrix_preparation.outputs.matrix)}}
container:
image: ${{ matrix.container }}
options: --gpus all --rm -v /data/scratch/cifar-10:/data/scratch/cifar-10
timeout-minutes: 120
steps:
- name: Install dependencies
run: |
pip install -U pip setuptools wheel --user
- uses: actions/checkout@v2
with:
repository: hpcaitech/TensorNVMe
ssh-key: ${{ secrets.SSH_KEY_FOR_CI }}
path: TensorNVMe
- name: Install tensornvme
run: |
cd TensorNVMe
conda install cmake
pip install -r requirements.txt
pip install -v .
- uses: actions/checkout@v2
with:
ssh-key: ${{ secrets.SSH_KEY_FOR_CI }}
- name: Install Colossal-AI
run: |
pip install -v --no-cache-dir .
pip install -r requirements/requirements-test.txt
- name: Unit Testing
run: |
PYTHONPATH=$PWD pytest tests
env:
DATA: /data/scratch/cifar-10
NCCL_SHM_DISABLE: 1
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
143 changes: 143 additions & 0 deletions .github/workflows/auto_example_check.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
name: Test Example
on:
pull_request:
# any change in the examples folder will trigger check for the corresponding example.
paths:
- 'examples/**'
# run at 00:00 of every Sunday(singapore time) so here is UTC time Saturday 16:00
schedule:
- cron: '0 16 * * 6'

jobs:
# This is for changed example files detect and output a matrix containing all the corresponding directory name.
detect-changed-example:
if: |
github.event.pull_request.draft == false &&
github.base_ref == 'main' &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI' && github.event_name == 'pull_request'
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.setup-matrix.outputs.matrix }}
anyChanged: ${{ steps.setup-matrix.outputs.anyChanged }}
name: Detect changed example files
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0
ref: ${{ github.event.pull_request.head.sha }}

- name: Locate base commit
id: locate-base-sha
run: |
curBranch=$(git rev-parse --abbrev-ref HEAD)
commonCommit=$(git merge-base origin/main $curBranch)
echo $commonCommit
echo "baseSHA=$commonCommit" >> $GITHUB_OUTPUT

- name: Get all changed example files
id: changed-files
uses: tj-actions/changed-files@v35
with:
base_sha: ${{ steps.locate-base-sha.outputs.baseSHA }}

- name: setup matrix
id: setup-matrix
run: |
changedFileName=""
for file in ${{ steps.changed-files.outputs.all_changed_files }}; do
changedFileName="${file}:${changedFileName}"
done
echo "$changedFileName was changed"
res=`python .github/workflows/scripts/example_checks/detect_changed_example.py --fileNameList $changedFileName`
echo "All changed examples are $res"

if [ "$res" = "[]" ]; then
echo "anyChanged=false" >> $GITHUB_OUTPUT
echo "matrix=null" >> $GITHUB_OUTPUT
else
dirs=$( IFS=',' ; echo "${res[*]}" )
echo "anyChanged=true" >> $GITHUB_OUTPUT
echo "matrix={\"directory\":$(echo "$dirs")}" >> $GITHUB_OUTPUT
fi

# If no file is changed, it will prompt an error and shows the matrix do not have value.
check-changed-example:
# Add this condition to avoid executing this job if the trigger event is workflow_dispatch.
if: |
github.event.pull_request.draft == false &&
github.base_ref == 'main' &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI' && github.event_name == 'pull_request' &&
needs.detect-changed-example.outputs.anyChanged == 'true'
name: Test the changed example
needs: detect-changed-example
runs-on: [self-hosted, gpu]
strategy:
matrix: ${{fromJson(needs.detect-changed-example.outputs.matrix)}}
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --gpus all --rm -v /data/scratch/examples-data:/data/
timeout-minutes: 10
steps:
- uses: actions/checkout@v3

- name: Install Colossal-AI
run: |
pip install -v .

- name: Test the example
run: |
example_dir=${{ matrix.directory }}
cd "${PWD}/examples/${example_dir}"
bash test_ci.sh
env:
NCCL_SHM_DISABLE: 1

# This is for all files' weekly check. Specifically, this job is to find all the directories.
matrix_preparation:
if: |
github.repository == 'hpcaitech/ColossalAI' &&
github.event_name == 'schedule'
name: Prepare matrix for weekly check
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.setup-matrix.outputs.matrix }}
steps:
- name: 📚 Checkout
uses: actions/checkout@v3

- name: setup matrix
id: setup-matrix
run: |
res=`python .github/workflows/scripts/example_checks/check_example_weekly.py`
all_loc=$( IFS=',' ; echo "${res[*]}" )
echo "Found the examples: $all_loc"
echo "matrix={\"directory\":$(echo "$all_loc")}" >> $GITHUB_OUTPUT

weekly_check:
if: |
github.repository == 'hpcaitech/ColossalAI' &&
github.event_name == 'schedule'
name: Weekly check all examples
needs: matrix_preparation
runs-on: [self-hosted, gpu]
strategy:
matrix: ${{fromJson(needs.matrix_preparation.outputs.matrix)}}
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
timeout-minutes: 10
steps:
- name: 📚 Checkout
uses: actions/checkout@v3

- name: Install Colossal-AI
run: |
pip install -v .

- name: Traverse all files
run: |
example_dir=${{ matrix.diretory }}
echo "Testing ${example_dir} now"
cd "${PWD}/examples/${example_dir}"
bash test_ci.sh
env:
NCCL_SHM_DISABLE: 1
Loading