Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
426 commits
Select commit Hold shift + click to select a range
1a60dc0
[chat] typo accimulation_steps -> accumulation_steps (#3662)
tanitna Apr 28, 2023
bfbf650
fix spelling error
digger-yu May 4, 2023
8ba7858
Update generate_gpt35_answers.py
digger-yu May 4, 2023
7bd0bee
[chat] add opt attn kernel (#3655)
ver217 May 4, 2023
6650dae
[doc] fix chat spelling error (#3671)
digger-yu May 5, 2023
0f785cb
[chat] PPO stage3 doc enhancement (#3679)
Camille7777 May 5, 2023
307894f
[booster] gemini plugin support shard checkpoint (#3610)
flybird11111 May 5, 2023
b36e67c
Merge pull request #3680 from digger-yu/digger-yu-patch-2
TongLi3701 May 5, 2023
b49020c
[CI] Update test_sharded_optim_with_sync_bn.py (#3688)
digger-yu May 5, 2023
d0915f5
[booster] refactor all dp fashion plugins (#3684)
ver217 May 5, 2023
65bdc31
fix some spelling error with applications/Chat/examples/ (#3692)
digger-yu May 6, 2023
d556648
[example] add finetune bert with booster example (#3693)
ver217 May 6, 2023
2da5d81
[chat] fix train_prompts.py gemini strategy bug (#3666)
zhang-yi-chi May 6, 2023
2629f97
[tensor] Refactor handle_trans_spec in DistSpecManager
yhna940 May 6, 2023
f83ea81
[example] add train resnet/vit with booster example (#3694)
ver217 May 8, 2023
3bf09ef
[booster] update prepare dataloader method for plugin (#3706)
ver217 May 8, 2023
6552cbf
[booster] fix no_sync method (#3709)
ver217 May 9, 2023
20068ba
[booster] add tests for ddp and low level zero's checkpointio (#3715)
flybird11111 May 10, 2023
f7361ee
[chat] fix community example ray (#3719)
MisterLin1995 May 10, 2023
b7141c3
[CI] fix some spelling errors (#3707)
digger-yu May 10, 2023
899aa86
[CI] fix typo with tests components (#3695)
digger-yu May 11, 2023
1f73609
[CI] fix typo with tests/ etc. (#3727)
digger-yu May 11, 2023
ad6460c
[NFC] fix typo applications/ and colossalai/ (#3735)
digger-yu May 15, 2023
b37797e
[booster] support torch fsdp plugin in booster (#3697)
wukong1992 May 15, 2023
afb239b
[devops] update torch version of CI (#3725)
ver217 May 15, 2023
6050f37
[booster] removed models that don't support fsdp (#3744)
wukong1992 May 15, 2023
7386c66
[fix] Add init to fix import error when importing _analyzer (#3668)
Wesley-Jzy May 16, 2023
1baeb39
[NFC] fix typo with colossalai/auto_parallel/tensor_shard (#3742)
digger-yu May 17, 2023
c03bd7c
[devops] make build on PR run automatically (#3748)
ver217 May 17, 2023
5dd573c
[devops] fix ci for document check (#3751)
ver217 May 17, 2023
0575983
[chat] fix bugs in stage 3 training (#3759)
chengeharrison May 17, 2023
d449525
[doc] update booster tutorials (#3718)
flybird11111 May 18, 2023
15024e4
[auto] fix install cmd (#3772)
binmakeswell May 18, 2023
48bd056
[doc] update hybrid parallelism doc (#3770)
flybird11111 May 18, 2023
2703a37
[amp] Add naive amp demo (#3774)
flybird11111 May 18, 2023
5452df6
[plugin] torch ddp plugin supports sharded model checkpoint (#3775)
ver217 May 18, 2023
5ce6c9d
[doc] add tutorial for cluster utils (#3763)
ver217 May 19, 2023
21e29e2
[doc] add tutorial for booster plugins (#3758)
ver217 May 19, 2023
32f81f1
[NFC] fix typo colossalai/amp auto_parallel autochunk (#3756)
digger-yu May 19, 2023
b4788d6
[devops] fix doc test on pr (#3782)
ver217 May 19, 2023
ad2cf58
[chat] add performance and tutorial (#3786)
binmakeswell May 19, 2023
60e6a15
[doc] add tutorial for booster checkpoint (#3785)
ver217 May 19, 2023
3c07a28
[plugin] a workaround for zero plugins' optimizer checkpoint (#3780)
ver217 May 19, 2023
72688ad
[doc] add booster docstring and fix autodoc (#3789)
ver217 May 22, 2023
d9393b8
[doc] add deprecated warning on doc Basics section (#3754)
Yanjia0 May 22, 2023
fe1561a
[doc] update gradient cliping document (#3778)
flybird11111 May 22, 2023
62c7e67
[format] applied code formatting on changed files in pull request 378…
github-actions[bot] May 22, 2023
4d29c0f
Fix/docker action (#3266)
liuzeming-yuxi May 22, 2023
788e07d
[workflow] fixed the docker build workflow (#3794)
FrankLeeeee May 22, 2023
f5c425c
fixed the example docstring for booster (#3795)
FrankLeeeee May 22, 2023
ef02d7e
[doc] update gradient accumulation (#3771)
flybird11111 May 23, 2023
ad93c73
[workflow] enable testing for develop & feature branch (#3801)
FrankLeeeee May 23, 2023
615e2e5
[test] fixed lazy init test import error (#3799)
FrankLeeeee May 23, 2023
e871e34
[API] add docstrings and initialization to apex amp, naive amp (#3783)
flybird11111 May 23, 2023
9265f2d
[NFC]fix typo colossalai/auto_parallel nn utils etc. (#3779)
digger-yu May 23, 2023
8c62e50
[doc] update amp document
flybird11111 May 23, 2023
1167bf5
[doc] update amp document
flybird11111 May 23, 2023
a520610
[doc] update amp document
flybird11111 May 23, 2023
75272ef
[doc] add removed warning
flybird11111 May 23, 2023
c425a69
[doc] add removed change of config.py
flybird11111 May 23, 2023
6b305a9
[booster] torch fsdp fix ckpt (#3788)
wukong1992 May 23, 2023
19d1530
[doc] add warning about fsdp plugin (#3813)
ver217 May 23, 2023
1e3b64f
[workflow] enblaed doc build from a forked repo (#3815)
FrankLeeeee May 23, 2023
8aa1fb2
[doc]fix
flybird11111 May 23, 2023
278fcbc
[doc]fix
flybird11111 May 23, 2023
725365f
Merge pull request #3810 from jiangmingyan/amp
flybird11111 May 23, 2023
7f8203a
fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808)
digger-yu May 24, 2023
269150b
[Docker] Fix a couple of build issues (#3691)
ymwangg May 24, 2023
05b8a8d
[workflow] changed to doc build to be on schedule and release (#3825)
FrankLeeeee May 24, 2023
3496637
[evaluation] add automatic evaluation pipeline (#3821)
chengeharrison May 24, 2023
e90fdb1
fix typo docs/
digger-yu May 24, 2023
518b31c
[docs] change placememt_policy to placement_policy (#3829)
digger-yu May 24, 2023
84500b7
[workflow] fixed testmon cache in build CI (#3806)
FrankLeeeee May 24, 2023
7c9f2ed
[dtensor] polish sharding spec docstring (#3838)
ver217 May 25, 2023
3229f93
[booster] add warning for torch fsdp plugin doc (#3833)
wukong1992 May 25, 2023
54e97ed
[workflow] supported test on CUDA 10.2 (#3841)
FrankLeeeee May 25, 2023
a64df3f
[doc] update document of gemini instruction. (#3842)
flybird11111 May 25, 2023
e2d81eb
[nfc] fix typo colossalai/ applications/ (#3831)
digger-yu May 25, 2023
d42b1be
[release] bump to v0.3.0 (#3830)
FrankLeeeee May 25, 2023
ae959a7
[workflow] fixed workflow check for docker build (#3849)
FrankLeeeee May 25, 2023
b047487
[doc] update nvme offload documents. (#3850)
flybird11111 May 25, 2023
2506e27
[evaluation] improvement on evaluation (#3862)
chengeharrison May 30, 2023
5f79008
[example] update gemini examples (#3868)
flybird11111 May 30, 2023
281b33f
[doc] update document of zero with chunk. (#3855)
flybird11111 May 30, 2023
46503c3
Modify torch version requirement to adapt torch 2.0
MaruyamaAya Jun 1, 2023
70c8cde
[nfc] fix typo colossalai/cli fx kernel (#3847)
digger-yu Jun 2, 2023
60ec33b
Add a new example of Dreambooth training using the booster API
MaruyamaAya Jun 2, 2023
42e3232
roll back
MaruyamaAya Jun 2, 2023
25447d4
modify path
MaruyamaAya Jun 5, 2023
dbb3269
[lazy] refactor lazy init (#3891)
ver217 Jun 5, 2023
8065cc5
Modify torch version requirement to adapt torch 2.0 (#3896)
MaruyamaAya Jun 5, 2023
07cb211
[doc]update moe chinese document. (#3890)
flybird11111 Jun 5, 2023
ae02d4e
[bf16] add bf16 support (#3882)
ver217 Jun 5, 2023
1878749
[nfc] fix typo colossalai/nn (#3887)
digger-yu Jun 5, 2023
57a6d76
support evaluation for english (#3880)
chengeharrison Jun 5, 2023
ec9bbc0
[devops] improving testmon cache (#3902)
ver217 Jun 6, 2023
c1535cc
[doc] fix docs about booster api usage (#3898)
Fridge003 Jun 6, 2023
0e484e6
[nfc]fix typo colossalai/pipeline tensor nn (#3899)
digger-yu Jun 6, 2023
176010f
update performance evaluation
MaruyamaAya Jun 6, 2023
b56c7f4
update shell file
MaruyamaAya Jun 6, 2023
1c1f71c
fixing insecure hash function
MaruyamaAya Jun 6, 2023
b29e1f0
change directory
MaruyamaAya Jun 6, 2023
d3379f0
fixed model saving bugs
MaruyamaAya Jun 6, 2023
79c9f77
fixed port
MaruyamaAya Jun 6, 2023
b4437e8
fixed port
MaruyamaAya Jun 6, 2023
41fb723
[devops] hotfix CI about testmon cache (#3910)
ver217 Jun 6, 2023
b5f0566
[chat] add distributed PPO trainer (#3740)
ver217 Jun 7, 2023
4fc8bc6
modify file path
MaruyamaAya Jun 7, 2023
9c88b6c
[lazy] fix compatibility problem on torch 1.13 (#3911)
ver217 Jun 7, 2023
c622bb3
Merge pull request #3915 from FrankLeeeee/update/develop
FrankLeeeee Jun 7, 2023
d51e83d
Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop
FrankLeeeee Jun 7, 2023
c25d421
[devops] hotfix testmon cache clean logic (#3917)
ver217 Jun 7, 2023
5e2132d
[workflow] added docker latest tag for release (#3920)
FrankLeeeee Jun 7, 2023
a55fb00
[booster] update bert example, using booster api (#3885)
wukong1992 Jun 7, 2023
b306cec
[example] Modify palm example with the new booster API (#3913)
MaruyamaAya Jun 7, 2023
a9d1cad
fix typo with colossalai/trainer utils zero (#3908)
digger-yu Jun 7, 2023
c94a335
modify shell for check
MaruyamaAya Jun 7, 2023
12c90db
[doc] add lazy init tutorial (#3922)
ver217 Jun 7, 2023
de0d7df
[nfc] fix typo colossalai/zero (#3923)
digger-yu Jun 7, 2023
9166988
[devops] update torch version in compability test (#3919)
ver217 Jun 8, 2023
eb39154
[dtensor] updated api and doc (#3845)
FrankLeeeee Jun 8, 2023
cf4792c
modify shell for check
MaruyamaAya Jun 8, 2023
e417dd0
[example] update opt example using booster api (#3918)
Fridge003 Jun 8, 2023
039854b
modify shell for check
MaruyamaAya Jun 8, 2023
49567d5
modify shell for check
MaruyamaAya Jun 8, 2023
730a092
modify shell for check
MaruyamaAya Jun 8, 2023
407aa48
fix typo examples/community/roberta (#3925)
digger-yu Jun 8, 2023
a98e16e
Merge pull request #3926 from hpcaitech/feature/dtensor
FrankLeeeee Jun 8, 2023
9b5e7ce
modify shell for check
MaruyamaAya Jun 8, 2023
6a69b44
[shardformer] init shardformer code structure (#3731)
FoolPlayer May 22, 2023
58f6432
[shardformer]: Feature/shardformer, add some docstring and readme (#3…
FoolPlayer May 24, 2023
bc19024
[shardformer] updated readme (#3827)
FrankLeeeee May 24, 2023
537a52b
[shardformer] refactored the user api (#3828)
FrankLeeeee May 24, 2023
997544c
[shardformer] update readme with modules implement doc (#3834)
FoolPlayer May 24, 2023
21a3915
[shardformer] add Dropout layer support different dropout pattern (#3…
FoolPlayer Jun 1, 2023
6370a93
update README (#3909)
FoolPlayer Jun 6, 2023
ef15377
[shardformer] add gpt2 policy and modify shard and slicer to support …
FoolPlayer Jun 7, 2023
33eef71
fix typo examples and docs (#3932)
digger-yu Jun 8, 2023
21c4c0b
support UniEval and add CHRF metric (#3924)
chengeharrison Jun 8, 2023
e277534
Merge pull request #3905 from MaruyamaAya/dreambooth
MaruyamaAya Jun 9, 2023
24651fd
Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer
FoolPlayer Jun 9, 2023
ddcf58c
Revert "[sync] sync feature/shardformer with develop"
FrankLeeeee Jun 9, 2023
bd2c7c3
Merge pull request #3942 from hpcaitech/revert-3931-sync/develop-to-s…
FoolPlayer Jun 9, 2023
bd1ab98
[gemini] fixed the gemini checkpoint io (#3934)
FrankLeeeee Jun 9, 2023
e61ffc7
fix typo tests/ (#3936)
digger-yu Jun 9, 2023
1aadeed
fix typo .github/workflows/scripts/ (#3946)
digger-yu Jun 9, 2023
4110d1f
[workflow] cancel duplicated workflow jobs (#3960)
FrankLeeeee Jun 12, 2023
b3ab7fb
[example] update ViT example using booster api (#3940)
Jun 12, 2023
71fe527
[gemini] fixed the gemini checkpoint io (#3934)
FrankLeeeee Jun 9, 2023
6718a2f
[workflow] cancel duplicated workflow jobs (#3960)
FrankLeeeee Jun 12, 2023
2bf6547
Merge pull request #3967 from ver217/update-develop
FrankLeeeee Jun 12, 2023
9d02590
[chat] refactor actor class (#3968)
cwher Jun 13, 2023
8bcad73
[workflow] fixed the directory check in build (#3980)
FrankLeeeee Jun 13, 2023
2925f47
[evaluate] support gpt evaluation with reference (#3972)
chengeharrison Jun 13, 2023
e8ad3c8
[doc] add a note about unit-testing to CONTRIBUTING.md (#3970)
Jun 14, 2023
d4fb7bf
fix typo applications/Chat/coati/ (#3947)
digger-yu Jun 15, 2023
c9cff7e
[checkpointio] General Checkpointing of Sharded Optimizers (#3984)
Jun 15, 2023
725af3e
[booster] make optimizer argument optional for boost (#3993)
cwher Jun 15, 2023
822c3d4
[checkpointio] sharded optimizer checkpoint for DDP plugin (#4002)
Jun 16, 2023
a5883aa
[test] fixed codefactor format report (#4026)
FrankLeeeee Jun 16, 2023
ca768eb
Merge pull request #4025 from hpcaitech/develop
FrankLeeeee Jun 19, 2023
727c459
[nfc] fix dim not defined and fix typo (#3991)
digger-yu Jun 19, 2023
160c64c
[example] fix bucket size in example of gpt gemini (#4028)
Gy-Lu Jun 19, 2023
a52f620
[format] applied code formatting on changed files in pull request 402…
github-actions[bot] Jun 19, 2023
4a81faa
[devops] fix build on pr ci (#4043)
ver217 Jun 19, 2023
b463651
[workflow] cover all public repositories in weekly report (#4069)
FrankLeeeee Jun 22, 2023
0bb0b48
[gemini] fix argument naming during chunk configuration searching
Jun 25, 2023
153b957
[chat] refactor strategy class with booster api (#3987)
cwher Jun 25, 2023
2c8ae37
Merge pull request #4056 from Fridge003/hotfix/fix_gemini_chunk_confi…
Jun 25, 2023
e89b127
[chat]: fix chat evaluation possible bug (#4064)
MichelleMa8 Jun 26, 2023
4da324c
[hotfix]fix argument naming in docs and examples (#4083)
Jun 26, 2023
95e95b6
[testing] move pytest to be inside the function (#4087)
FrankLeeeee Jun 27, 2023
31dc302
[examples] copy resnet example to image (#4090)
CjhHa1 Jun 27, 2023
1ee947f
[workflow] added status check for test coverage workflow (#4106)
FrankLeeeee Jun 28, 2023
2d40759
fix #3852 path error (#4058)
digger-yu Jun 28, 2023
769cddc
fix typo docs/ (#4033)
digger-yu Jun 28, 2023
711e2b4
[doc] update and revise some typos and errs in docs (#4107)
CjhHa1 Jun 28, 2023
b03d64d
[chat] refactor trainer class (#4080)
cwher Jun 29, 2023
edd75a5
[chat] remove naive strategy and split colossalai strategy (#4094)
cwher Jun 29, 2023
09fe9dc
[nfc]fix ColossalaiOptimizer is not defined (#4122)
digger-yu Jun 30, 2023
7e46bc8
fix CheckpointIndexFile is not defined (#4109)
digger-yu Jul 3, 2023
8abc877
fix Tensor is not defined (#4129)
digger-yu Jul 3, 2023
1350ece
[hotfix] fix import bug in checkpoint_io (#4142)
Jul 3, 2023
3d8d5d0
[chat] use official transformers and fix some issues (#4117)
cwher Jul 4, 2023
8d68de7
[shardformer] init shardformer code structure (#3731)
FoolPlayer May 22, 2023
8cc1123
[shardformer]: Feature/shardformer, add some docstring and readme (#3…
FoolPlayer May 24, 2023
235792f
[shardformer] updated readme (#3827)
FrankLeeeee May 24, 2023
4972e1f
[shardformer] refactored the user api (#3828)
FrankLeeeee May 24, 2023
c594dc2
[shardformer] update readme with modules implement doc (#3834)
FoolPlayer May 24, 2023
ab8a47f
[shardformer] add Dropout layer support different dropout pattern (#3…
FoolPlayer Jun 1, 2023
70173e3
update README (#3909)
FoolPlayer Jun 6, 2023
79f8d5d
[shardformer] add gpt2 policy and modify shard and slicer to support …
FoolPlayer Jun 7, 2023
f1cb5ac
[shardformer] Align bert value (#3907)
FoolPlayer Jun 9, 2023
a731304
[shardformer] Unit test (#3928)
FoolPlayer Jun 12, 2023
45927d5
[shardformer] Add dropout layer in shard model and refactor policy ap…
FoolPlayer Jun 12, 2023
6b30dfb
[shardformer] support llama model using shardformer (#3969)
wukong1992 Jun 13, 2023
c1c672d
[shardformer] shardformer support t5 model (#3994)
wukong1992 Jun 15, 2023
f7774ec
[Shardformer] Downstream bert (#3979)
FoolPlayer Jun 15, 2023
a2f9af8
[shardformer] fix an error in readme (#3988)
FoolPlayer Jun 15, 2023
6119712
[device] support init device mesh from process group (#3990)
FrankLeeeee Jun 15, 2023
d3bc530
[shardformer] Refactor shardformer api (#4001)
FoolPlayer Jun 15, 2023
015af59
[shardformer] integrated linear 1D with dtensor (#3996)
FrankLeeeee Jun 15, 2023
dfca967
integrate with dist layer (#4011)
FoolPlayer Jun 16, 2023
3893fa1
[shardformer] refactored embedding and dropout to parallel module (#4…
FrankLeeeee Jun 16, 2023
45d9384
[shardformer] removed inplace tensor sharding (#4018)
FrankLeeeee Jun 16, 2023
507c0ad
add vocabembedding layer
FoolPlayer Jun 16, 2023
df018fc
support bert with new api
FoolPlayer Jun 16, 2023
e253a07
[shardformer] updated doc (#4016)
FrankLeeeee Jun 16, 2023
74d176c
[shardformer] fix bert and gpt downstream with new api (#4024)
FoolPlayer Jun 19, 2023
c1d5453
[shardformer] adapted llama to the new API (#4036)
FrankLeeeee Jun 19, 2023
d857f3d
[shardformer] supported T5 and its variants (#4045)
FrankLeeeee Jun 19, 2023
4021b9a
[shardformer] add gpt2 test and layer class refactor (#4041)
FoolPlayer Jun 20, 2023
58df720
[shardformer] adapted T5 and LLaMa test to use kit (#4049)
FrankLeeeee Jun 21, 2023
f22ddac
[shardformer] refactored the shardformer layer structure (#4053)
FrankLeeeee Jun 21, 2023
7740c55
support kit use for bert/gpt test (#4055)
FoolPlayer Jun 22, 2023
8eb09a4
[shardformer] support module saving and loading (#4062)
FrankLeeeee Jun 22, 2023
0803a61
[shardformer] add linearconv1d test (#4067)
FoolPlayer Jun 22, 2023
70c58cf
[shardformer] supported fused qkv checkpoint (#4073)
FrankLeeeee Jun 23, 2023
92f6791
[shardformer] Add layernorm (#4072)
FoolPlayer Jun 23, 2023
c4b1b65
[test] fixed tests failed due to dtensor change (#4082)
FrankLeeeee Jun 26, 2023
d33a44e
[shardformer] refactored layernorm (#4086)
FrankLeeeee Jun 26, 2023
ac80937
[shardformer] shardformer support opt models (#4091)
flybird11111 Jun 27, 2023
8af29ee
[shardformer] support vision transformer (#4096)
klhhhhh Jun 28, 2023
b1c2901
[shardformer] supported bloom model (#4098)
FrankLeeeee Jun 28, 2023
f3b6aaa
[shardformer] supported fused normalization (#4112)
FrankLeeeee Jun 30, 2023
6a88bae
[shardformer] integrate with data parallelism (#4103)
FrankLeeeee Jun 30, 2023
44a190e
[shardformer] import huggingface implicitly (#4101)
FrankLeeeee Jun 30, 2023
ae035d3
[shardformer] added embedding gradient check (#4124)
FrankLeeeee Jun 30, 2023
7f9b303
[shardformer] write an shardformer example with bert finetuning (#4126)
flybird11111 Jun 30, 2023
74257cb
[shardformer] refactored some doc and api (#4137)
FrankLeeeee Jul 3, 2023
1fb0d95
[shardformer] made tensor parallelism configurable (#4144)
FrankLeeeee Jul 4, 2023
89f45ed
[shardformer] added development protocol for standardization (#4149)
FrankLeeeee Jul 4, 2023
f447ca1
[chat] removed cache file (#4155)
FrankLeeeee Jul 4, 2023
c77b3b1
[format] applied code formatting on changed files in pull request 415…
github-actions[bot] Jul 4, 2023
2ac2404
fix some typo colossalai/shardformer (#4160)
digger-yu Jul 4, 2023
1908caa
[cli] hotfix launch command for multi-nodes (#4165)
ver217 Jul 4, 2023
cc3cbe9
[workflow] show test duration (#4159)
FrankLeeeee Jul 4, 2023
190a6ea
[dtensor] fixed readme file name and removed deprecated file (#4162)
FrankLeeeee Jul 4, 2023
fee32a3
[docker] added ssh and rdma support for docker (#4192)
FrankLeeeee Jul 7, 2023
5891344
Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini …
Jul 7, 2023
c1cf752
[docker] fixed ninja build command (#4203)
FrankLeeeee Jul 10, 2023
4e9b09c
Automated submodule synchronization (#4217)
github-actions[bot] Jul 12, 2023
9a4842c
revise shardformer readme (#4246)
CjhHa1 Jul 17, 2023
7ff11b5
[example] add llama pretraining (#4257)
binmakeswell Jul 17, 2023
4b97754
[Kernels] added triton-implemented of self attention for colossal-ai …
tiandiao123 Jul 18, 2023
fc5cef2
[lazy] support init on cuda (#4269)
ver217 Jul 19, 2023
c6f6005
[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302)
Jul 21, 2023
917ac28
[chat] train sft support tensorboard
ver217 Jul 21, 2023
fcb0280
[chat] train sft support optimizer save load
ver217 Jul 21, 2023
6be1cad
add tensorboard close logic
CZYCW Jul 26, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 2 additions & 2 deletions .compatibility
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
1.12.0-11.3.0
1.11.0-11.3.0
1.10.1-11.3.0
1.13.0-11.6.0
2.0.0-11.7.0
4 changes: 4 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[run]
concurrency = multiprocessing
parallel = true
sigterm = true
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@ contact_links:
about: This issue tracker is not for technical support. Please use WeChat, and ask the community for help.
- name: 😊 Advanced question - GitHub Discussions
url: https://github.com/hpcaitech/ColossalAI/discussions
about: Use GitHub Discussions for advanced and unanswered technical questions, requiring a maintainer's answer.
about: Use GitHub Discussions for advanced and unanswered technical questions, requiring a maintainer's answer.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/feature_request.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ body:
If applicable, add screenshots to help explain your problem.
**Suggest a potential alternative/fix**
Tell us how we could improve this project.
**Optional: Affiliation**
**Optional: Affiliation**
Institution/email information helps better analyze and evaluate users to improve the project. Welcome to establish in-depth cooperation.
placeholder: |
A clear and concise description of your idea.
Expand Down
30 changes: 19 additions & 11 deletions .github/workflows/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
- [Compatibility Test on Dispatch](#compatibility-test-on-dispatch)
- [Release](#release)
- [User Friendliness](#user-friendliness)
- [Commmunity](#commmunity)
- [Community](#community)
- [Configuration](#configuration)
- [Progress Log](#progress-log)

Expand All @@ -30,7 +30,7 @@ In the section below, we will dive into the details of different workflows avail
Refer to this [documentation](https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow) on how to manually trigger a workflow.
I will provide the details of each workflow below.

**A PR which changes the `version.txt` is considered as a release PR in the following coontext.**
**A PR which changes the `version.txt` is considered as a release PR in the following context.**


### Code Style Check
Expand All @@ -43,10 +43,18 @@ I will provide the details of each workflow below.

| Workflow Name | File name | Description |
| ---------------------- | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Build on PR` | `build_on_pr.yml` | This workflow is triggered when the label `Run build and Test` is assigned to a PR. It will run all the unit tests in the repository with 4 GPUs. |
| `Build on PR` | `build_on_pr.yml` | This workflow is triggered when a PR changes essential files and a branch is created/deleted. It will run all the unit tests in the repository with 4 GPUs. |
| `Build on Schedule` | `build_on_schedule.yml` | This workflow will run the unit tests everyday with 8 GPUs. The result is sent to Lark. |
| `Report test coverage` | `report_test_coverage.yml` | This PR will put up a comment to report the test coverage results when `Build` is done. |

To reduce the average time of the unit test on PR, `Build on PR` workflow manages testmon cache.

1. When creating a new branch, it copies `cache/main/.testmondata*` to `cache/<branch>/`.
2. When creating a new PR or change the base branch of a PR, it copies `cache/<base_ref>/.testmondata*` to `cache/_pull/<pr_number>/`.
3. When running unit tests for each PR, it restores testmon cache from `cache/_pull/<pr_number>/`. After the test, it stores the cache back to `cache/_pull/<pr_number>/`.
4. When a PR is closed, if it's merged, it copies `cache/_pull/<pr_number>/.testmondata*` to `cache/<base_ref>/`. Otherwise, it just removes `cache/_pull/<pr_number>`.
5. When a branch is deleted, it removes `cache/<ref>`.

### Example Test

| Workflow Name | File name | Description |
Expand All @@ -58,23 +66,23 @@ I will provide the details of each workflow below.
#### Example Test on Dispatch

This workflow is triggered by manually dispatching the workflow. It has the following input parameters:
- `example_directory`: the example directory to test. Multiple directories are supported and must be separated b$$y comma. For example, language/gpt, images/vit. Simply input language or simply gpt does not work.
- `example_directory`: the example directory to test. Multiple directories are supported and must be separated by comma. For example, language/gpt, images/vit. Simply input language or simply gpt does not work.

### Compatibility Test

| Workflow Name | File name | Description |
| -------------------------------- | ------------------------------------ | -------------------------------------------------------------------------------------------------------------------- |
| `Compatibility Test on PR` | `compatibility_test_on_pr.yml` | Check Colossal-AI's compatiblity when `version.txt` is changed in a PR. |
| `Compatibility Test on Schedule` | `compatibility_test_on_schedule.yml` | This workflow will check the compatiblity of Colossal-AI against PyTorch specified in `.compatibility` every Sunday. |
| `Compatiblity Test on Dispatch` | `compatibility_test_on_dispatch.yml` | Test PyTorch Compatibility manually. |
| `Compatibility Test on PR` | `compatibility_test_on_pr.yml` | Check Colossal-AI's compatibility when `version.txt` is changed in a PR. |
| `Compatibility Test on Schedule` | `compatibility_test_on_schedule.yml` | This workflow will check the compatibility of Colossal-AI against PyTorch specified in `.compatibility` every Sunday. |
| `Compatibility Test on Dispatch` | `compatibility_test_on_dispatch.yml` | Test PyTorch Compatibility manually. |


#### Compatibility Test on Dispatch
This workflow is triggered by manually dispatching the workflow. It has the following input parameters:
- `torch version`:torch version to test against, multiple versions are supported but must be separated by comma. The default is value is all, which will test all available torch versions listed in this [repository](https://github.com/hpcaitech/public_assets/tree/main/colossalai/torch_build/torch_wheels).
- `cuda version`: cuda versions to test against, multiple versions are supported but must be separated by comma. The CUDA versions must be present in our [DockerHub repository](https://hub.docker.com/r/hpcaitech/cuda-conda).

> It only test the compatiblity of the main branch
> It only test the compatibility of the main branch


### Release
Expand All @@ -97,7 +105,7 @@ This workflow is triggered by manually dispatching the workflow. It has the foll
| `Synchronize submodule` | `submodule.yml` | This workflow will check if any git submodule is updated. If so, it will create a PR to update the submodule pointers. |
| `Close inactive issues` | `close_inactive.yml` | This workflow will close issues which are stale for 14 days. |

### Commmunity
### Community

| Workflow Name | File name | Description |
| -------------------------------------------- | -------------------------------- | -------------------------------------------------------------------------------- |
Expand All @@ -113,7 +121,7 @@ This `.compatibility` file is to tell GitHub Actions which PyTorch and CUDA vers

2. `.cuda_ext.json`

This file controls which CUDA versions will be checked against CUDA extenson built. You can add a new entry according to the json schema below to check the AOT build of PyTorch extensions before release.
This file controls which CUDA versions will be checked against CUDA extension built. You can add a new entry according to the json schema below to check the AOT build of PyTorch extensions before release.

```json
{
Expand Down Expand Up @@ -144,7 +152,7 @@ This file controls which CUDA versions will be checked against CUDA extenson bui
- [x] check on PR
- [x] regular check
- [x] manual dispatch
- [x] compatiblity check
- [x] compatibility check
- [x] check on PR
- [x] manual dispatch
- [x] auto test when release
Expand Down
178 changes: 160 additions & 18 deletions .github/workflows/build_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,93 @@ name: Build on PR

on:
pull_request:
types: [synchronize, labeled]
types: [synchronize, opened, reopened, ready_for_review, closed, edited]
branches:
- "main"
- "develop"
- "feature/**"
paths:
- ".github/workflows/build_on_pr.yml" # run command & env variables change
- "colossalai/**" # source code change
- "!colossalai/**.md" # ignore doc change
- "op_builder/**" # cuda extension change
- "!op_builder/**.md" # ignore doc change
- "requirements/**" # requirements change
- "tests/**" # test change
- "!tests/**.md" # ignore doc change
- "pytest.ini" # test config change
- "setup.py" # install command change
create:
delete:

jobs:
prepare_cache:
name: Prepare testmon cache
if: |
github.event_name == 'create' &&
github.event.ref_type == 'branch' &&
github.event.repository.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
steps:
- name: Copy testmon cache
run: | # branch name may contain slash, we need to replace it with space
export REF_BRANCH=$(echo ${{ github.event.ref }} | sed "s/\// /")
if [ -d /github/home/testmon_cache/${MAIN_BRANCH} ]; then
cp -p -r /github/home/testmon_cache/${MAIN_BRANCH} "/github/home/testmon_cache/${REF_BRANCH}"
fi
env:
MAIN_BRANCH: ${{ github.event.master_branch }}

prepare_cache_for_pr:
name: Prepare testmon cache for PR
if: |
github.event_name == 'pull_request' &&
(github.event.action == 'opened' || github.event.action == 'reopened' || (github.event.action == 'edited' && github.event.changes.base != null)) &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
concurrency:
group: ${{ github.head_ref }}
cancel-in-progress: false
steps:
- name: Copy testmon cache
run: | # branch name may contain slash, we need to replace it with space
export BASE=$(echo ${{ github.event.pull_request.base.ref }} | sed "s/\// /")
if [ -d "/github/home/testmon_cache/${BASE}" ] && [ ! -z "$(ls -A "/github/home/testmon_cache/${BASE}")" ]; then
mkdir -p /github/home/testmon_cache/_pull/${PR_NUMBER} && cp -p -r "/github/home/testmon_cache/${BASE}"/.testmondata* /github/home/testmon_cache/_pull/${PR_NUMBER}
fi
env:
PR_NUMBER: ${{ github.event.number }}

detect:
name: Detect file change
if: |
github.event.pull_request.draft == false &&
github.base_ref == 'main' &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI' &&
contains( github.event.pull_request.labels.*.name, 'Run Build and Test')
github.event_name == 'pull_request' &&
(github.event.action == 'synchronize' || github.event.action == 'opened' || github.event.action == 'reopened' || github.event.action == 'ready_for_review') &&
github.event.pull_request.draft == false &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
outputs:
changedExtenisonFiles: ${{ steps.find-extension-change.outputs.all_changed_files }}
anyExtensionFileChanged: ${{ steps.find-extension-change.outputs.any_changed }}
changedLibraryFiles: ${{ steps.find-lib-change.outputs.all_changed_files }}
anyLibraryFileChanged: ${{ steps.find-lib-change.outputs.any_changed }}
runs-on: ubuntu-latest
concurrency:
group: ${{ github.head_ref }}
cancel-in-progress: false
steps:
- uses: actions/checkout@v2
with:
Expand All @@ -27,10 +98,10 @@ jobs:
- name: Locate base commit
id: locate-base-sha
run: |
curBranch=$(git rev-parse --abbrev-ref HEAD)
commonCommit=$(git merge-base origin/main $curBranch)
echo $commonCommit
echo "baseSHA=$commonCommit" >> $GITHUB_OUTPUT
curBranch=$(git rev-parse --abbrev-ref HEAD)
commonCommit=$(git merge-base origin/main $curBranch)
echo $commonCommit
echo "baseSHA=$commonCommit" >> $GITHUB_OUTPUT

- name: Find the changed extension-related files
id: find-extension-change
Expand Down Expand Up @@ -63,18 +134,21 @@ jobs:
echo "$file was changed"
done


build:
name: Build and Test Colossal-AI
needs: detect
if: needs.detect.outputs.anyLibraryFileChanged == 'true'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.11.0-11.3.0
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --gpus all --rm -v /data/scratch/cifar-10:/data/scratch/cifar-10
timeout-minutes: 40
timeout-minutes: 60
defaults:
run:
shell: bash
concurrency:
group: ${{ github.head_ref }}
cancel-in-progress: false
steps:
- name: Checkout TensorNVMe
uses: actions/checkout@v2
Expand All @@ -85,7 +159,9 @@ jobs:

- name: Restore TensorNVMe Cache
run: |
[ ! -z "$(ls -A /github/home/tensornvme_cache/)" ] && cp -p -r /github/home/tensornvme_cache/* /__w/ColossalAI/ColossalAI/TensorNVMe
if [ -d /github/home/tensornvme_cache ] && [ ! -z "$(ls -A /github/home/tensornvme_cache/)" ]; then
cp -p -r /github/home/tensornvme_cache/* /__w/ColossalAI/ColossalAI/TensorNVMe
fi

- name: Install TensorNVMe
run: |
Expand All @@ -108,10 +184,11 @@ jobs:
if: needs.detect.outputs.anyExtensionFileChanged != 'true'
run: |
# -p flag is required to preserve the file timestamp to avoid ninja rebuild
[ ! -z "$(ls -A /github/home/cuda_ext_cache/)" ] && cp -p -r /github/home/cuda_ext_cache/* /__w/ColossalAI/ColossalAI/
if [ -d /github/home/cuda_ext_cache ] && [ ! -z "$(ls -A /github/home/cuda_ext_cache/)" ]; then
cp -p -r /github/home/cuda_ext_cache/* /__w/ColossalAI/ColossalAI/
fi

- name: Install Colossal-AI
if: needs.detect.outputs.anyLibraryFileChanged == 'true'
run: |
CUDA_EXT=1 pip install -v -e .
pip install -r requirements/requirements-test.txt
Expand All @@ -121,15 +198,29 @@ jobs:
# -p flag is required to preserve the file timestamp to avoid ninja rebuild
cp -p -r /__w/ColossalAI/ColossalAI/build /github/home/cuda_ext_cache/

- name: Restore Testmon Cache
run: |
if [ -d /github/home/testmon_cache/_pull/${PR_NUMBER} ] && [ ! -z "$(ls -A /github/home/testmon_cache/_pull/${PR_NUMBER})" ]; then
cp -p -r /github/home/testmon_cache/_pull/${PR_NUMBER}/.testmondata* /__w/ColossalAI/ColossalAI/
fi
env:
PR_NUMBER: ${{ github.event.number }}

- name: Execute Unit Testing
if: needs.detect.outputs.anyLibraryFileChanged == 'true'
run: |
PYTHONPATH=$PWD pytest --cov=. --cov-report xml tests/
CURL_CA_BUNDLE="" PYTHONPATH=$PWD pytest --testmon --testmon-cov=. --durations=10 tests/
env:
DATA: /data/scratch/cifar-10
NCCL_SHM_DISABLE: 1
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64

- name: Store Testmon Cache
run: |
mkdir -p /github/home/testmon_cache/_pull/${PR_NUMBER}
cp -p -r /__w/ColossalAI/ColossalAI/.testmondata* /github/home/testmon_cache/_pull/${PR_NUMBER}/
env:
PR_NUMBER: ${{ github.event.number }}

- name: Collate artifact
env:
PR_NUMBER: ${{ github.event.number }}
Expand All @@ -141,7 +232,7 @@ jobs:
echo $PR_NUMBER > ./report/pr_number

# generate coverage.xml if any
if [ "$anyLibraryFileChanged" == "true" ]; then
if [ "$anyLibraryFileChanged" == "true" ] && [ -e .coverage ]; then
allFiles=""
for file in $changedLibraryFiles; do
if [ "$allFiles" == "" ]; then
Expand All @@ -166,3 +257,54 @@ jobs:
with:
name: report
path: report/

store_cache:
name: Store testmon cache for PR
if: |
github.event_name == 'pull_request' &&
github.event.action == 'closed' &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
steps:
- name: Store testmon cache if possible
if: github.event.pull_request.merged == true
run: | # branch name may contain slash, we need to replace it with space
export BASE=$(echo ${{ github.event.pull_request.base.ref }} | sed "s/\// /")
if [ -d /github/home/testmon_cache/_pull/${PR_NUMBER} ] && [ ! -z "$(ls -A /github/home/testmon_cache/_pull/${PR_NUMBER})" ]; then
cp -p -r /github/home/testmon_cache/_pull/${PR_NUMBER}/.testmondata* "/github/home/testmon_cache/${BASE}/"
fi
env:
PR_NUMBER: ${{ github.event.pull_request.number }}

- name: Remove testmon cache
run: |
rm -rf /github/home/testmon_cache/_pull/${PR_NUMBER}
env:
PR_NUMBER: ${{ github.event.pull_request.number }}

remove_cache:
name: Remove testmon cache
if: |
github.event_name == 'delete' &&
github.event.ref_type == 'branch' &&
github.event.repository.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
steps:
- name: Remove testmon cache
run: | # branch name may contain slash, we need to replace it with space
export BASE=$(echo ${{ github.event.ref }} | sed "s/\// /")
rm -rf "/github/home/testmon_cache/${BASE}"
Loading