[sync] sync feature/dtensor with develop by FrankLeeeee · Pull Request #3916 · hpcaitech/ColossalAI

FrankLeeeee · 2023-06-07T03:49:23Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

N/A

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

This PR syncs the feature/dtensor with the develop branch.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

* fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/

* [workflow] fixed testmon cache in build CI * polish code

* [doc] update meet_gemini.md * [doc] update meet_gemini.md * [doc] fix parentheses * [doc] fix parentheses * [doc] fix doc test * [doc] fix doc test * [doc] fix doc

* fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/

* fix a bug when the config file contains one category but the answer file doesn't contains that category * fix Chinese prompt file * support gpt-3.5-turbo and gpt-4 evaluation * polish and update README * resolve pr comments --------- Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

* [example]update gemini examples * [example]update gemini examples

* [doc] fix title of mixed precision * [doc]update document of zero with chunk * [doc] update document of zero with chunk, fix * [doc] update document of zero with chunk, fix * [doc] update document of zero with chunk, fix * [doc] update document of zero with chunk, add doc test * [doc] update document of zero with chunk, add doc test * [doc] update document of zero with chunk, fix installation * [doc] update document of zero with chunk, fix zero with chunk doc * [doc] update document of zero with chunk, fix zero with chunk doc

* fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel

* [lazy] remove old lazy init * [lazy] refactor lazy init folder structure * [lazy] fix lazy tensor deepcopy * [test] update lazy init test

* [doc]update-moe * [doc]update-moe * [doc]update-moe * [doc]update-moe * [doc]update-moe

* [bf16] add bf16 support for fused adam (hpcaitech#3844) * [bf16] fused adam kernel support bf16 * [test] update fused adam kernel test * [test] update fused adam test * [bf16] cpu adam and hybrid adam optimizers support bf16 (hpcaitech#3860) * [bf16] implement mixed precision mixin and add bf16 support for low level zero (hpcaitech#3869) * [bf16] add mixed precision mixin * [bf16] low level zero optim support bf16 * [text] update low level zero test * [text] fix low level zero grad acc test * [bf16] add bf16 support for gemini (hpcaitech#3872) * [bf16] gemini support bf16 * [test] update gemini bf16 test * [doc] update gemini docstring * [bf16] add bf16 support for plugins (hpcaitech#3877) * [bf16] add bf16 support for legacy zero (hpcaitech#3879) * [zero] init context support bf16 * [zero] legacy zero support bf16 * [test] add zero bf16 test * [doc] add bf16 related docstring for legacy zero

* fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel * fix typo colossalai/nn * revert change warmuped

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

* [devops] improving testmon cache * [devops] fix branch name with slash * [devops] fix branch name with slash * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] update readme

* fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel * fix typo colossalai/nn * revert change warmuped * fix typo colossalai/pipeline tensor nn

* [devops] hotfix CI about testmon cache * [devops] fix testmon cahe on pr

* Detached ppo (#9) * run the base * working on dist ppo * sync * detached trainer * update detached trainer. no maker update function * facing init problem * 1 maker 1 trainer detached run. but no model update * facing cuda problem * fix save functions * verified maker update * nothing * add ignore * analyize loss issue * remove some debug codes * facing 2m1t stuck issue * 2m1t verified * do not use torchrun * working on 2m2t * working on 2m2t * initialize strategy in ray actor env * facing actor's init order issue * facing ddp model update issue (need unwarp ddp) * unwrap ddp actor * checking 1m2t stuck problem * nothing * set timeout for trainer choosing. It solves the stuck problem! * delete some debug output * rename to sync with upstream * rename to sync with upstream * coati rename * nothing * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations * experience_maker_holder performs target-revolving _send_experience() instead of length comparison. * move code to ray subfolder * working on pipeline inference * apply comments * working on pipeline strategy. in progress. * remove pipeline code. clean this branch * update remote parameters by state_dict. no test * nothing * state_dict sharding transfer * merge debug branch * gemini _unwrap_model fix * simplify code * simplify code & fix LoRALinear AttributeError * critic unwrapped state_dict --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add perfomance evaluator and fix bugs (#10) * [chat] add performance evaluator for ray * [chat] refactor debug arg * [chat] support hf config * [chat] fix generation * [chat] add 1mmt dummy example * [chat] fix gemini ckpt * split experience to send (hpcaitech#11) Co-authored-by: csric <richcsr256@gmail.com> * [chat] refactor trainer and maker (hpcaitech#12) * [chat] refactor experience maker holder * [chat] refactor model init * [chat] refactor trainer args * [chat] refactor model init * [chat] refactor trainer * [chat] refactor experience sending logic and training loop args (hpcaitech#13) * [chat] refactor experience send logic * [chat] refactor trainer * [chat] refactor trainer * [chat] refactor experience maker * [chat] refactor pbar * [chat] refactor example folder (hpcaitech#14) * [chat] support quant (hpcaitech#15) * [chat] add quant * [chat] add quant example * prompt example (hpcaitech#16) * prompt example * prompt load csv data * remove legacy try --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add mmmt dummy example and refactor experience sending (hpcaitech#17) * [chat] add mmmt dummy example * [chat] refactor naive strategy * [chat] fix struck problem * [chat] fix naive strategy * [chat] optimize experience maker sending logic * [chat] refactor sending assignment * [chat] refactor performance evaluator (hpcaitech#18) * Prompt Example & requires_grad state_dict & sharding state_dict (hpcaitech#19) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design --------- Co-authored-by: csric <richcsr256@gmail.com> * state_dict sending adapts to new unwrap function (hpcaitech#20) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design * opt benchmark * better script * nothing * [chat] strategy refactor unwrap model * [chat] strategy refactor save model * [chat] add docstr * [chat] refactor trainer save model * [chat] fix strategy typing * [chat] refactor trainer save model * [chat] update readme * [chat] fix unit test * working on lora reconstruction * state_dict sending adapts to new unwrap function * remove comments --------- Co-authored-by: csric <richcsr256@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * [chat-ray] add readme (hpcaitech#21) * add readme * transparent graph * add note background --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] get images from url (hpcaitech#22) * Refactor/chat ray (hpcaitech#23) * [chat] lora add todo * [chat] remove unused pipeline strategy * [chat] refactor example structure * [chat] setup ci for ray * [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (hpcaitech#24) * lora support prototype * lora support * 1mmt lora & remove useless code --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] fix test ci for ray * [chat] fix test ci requirements for ray * [chat] fix ray runtime env * [chat] fix ray runtime env * [chat] fix example ci docker args * [chat] add debug info in trainer * [chat] add nccl debug info * [chat] skip ray test * [doc] fix typo --------- Co-authored-by: csric <59389055+CsRic@users.noreply.github.com> Co-authored-by: csric <richcsr256@gmail.com>

[sync] update develop with main

digger-yu and others added 27 commits May 24, 2023 13:57

fix typo docs/

e90fdb1

[workflow] fixed testmon cache in build CI (hpcaitech#3806)

84500b7

* [workflow] fixed testmon cache in build CI * polish code

[booster] add warning for torch fsdp plugin doc (hpcaitech#3833)

3229f93

[workflow] supported test on CUDA 10.2 (hpcaitech#3841)

54e97ed

[doc] update document of gemini instruction. (hpcaitech#3842)

a64df3f

* [doc] update meet_gemini.md * [doc] update meet_gemini.md * [doc] fix parentheses * [doc] fix parentheses * [doc] fix doc test * [doc] fix doc test * [doc] fix doc

[release] bump to v0.3.0 (hpcaitech#3830)

d42b1be

[workflow] fixed workflow check for docker build (hpcaitech#3849)

ae959a7

[doc] update nvme offload documents. (hpcaitech#3850)

b047487

[example] update gemini examples (hpcaitech#3868)

5f79008

* [example]update gemini examples * [example]update gemini examples

[lazy] refactor lazy init (hpcaitech#3891)

dbb3269

* [lazy] remove old lazy init * [lazy] refactor lazy init folder structure * [lazy] fix lazy tensor deepcopy * [test] update lazy init test

Modify torch version requirement to adapt torch 2.0 (hpcaitech#3896)

8065cc5

[doc]update moe chinese document. (hpcaitech#3890)

07cb211

* [doc]update-moe * [doc]update-moe * [doc]update-moe * [doc]update-moe * [doc]update-moe

support evaluation for english (hpcaitech#3880)

57a6d76

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

[doc] fix docs about booster api usage (hpcaitech#3898)

c1535cc

[devops] hotfix CI about testmon cache (hpcaitech#3910)

41fb723

* [devops] hotfix CI about testmon cache * [devops] fix testmon cahe on pr

[lazy] fix compatibility problem on torch 1.13 (hpcaitech#3911)

9c88b6c

Merge pull request hpcaitech#3915 from FrankLeeeee/update/develop

c622bb3

[sync] update develop with main

FoolPlayer approved these changes Jun 7, 2023

View reviewed changes

FrankLeeeee merged commit d51e83d into hpcaitech:feature/dtensor Jun 7, 2023

FrankLeeeee deleted the sync/dtensor-with-develop branch June 7, 2023 03:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sync] sync feature/dtensor with develop#3916

[sync] sync feature/dtensor with develop#3916
FrankLeeeee merged 27 commits intohpcaitech:feature/dtensorfrom
FrankLeeeee:sync/dtensor-with-develop

FrankLeeeee commented Jun 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

FrankLeeeee commented Jun 7, 2023

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants