L by jamesthesnake · Pull Request #65 · jamesthesnake/ColossalAI

jamesthesnake · 2023-06-14T07:22:46Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

Ra

as

* [dtensor] polish sharding spec docstring * [dtensor] polish sharding spec example docstring

raa

Ra

f

Ra

* [devops] improving testmon cache * [devops] fix branch name with slash * [devops] fix branch name with slash * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] update readme

* fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel * fix typo colossalai/nn * revert change warmuped * fix typo colossalai/pipeline tensor nn

* [devops] hotfix CI about testmon cache * [devops] fix testmon cahe on pr

* Detached ppo (#9) * run the base * working on dist ppo * sync * detached trainer * update detached trainer. no maker update function * facing init problem * 1 maker 1 trainer detached run. but no model update * facing cuda problem * fix save functions * verified maker update * nothing * add ignore * analyize loss issue * remove some debug codes * facing 2m1t stuck issue * 2m1t verified * do not use torchrun * working on 2m2t * working on 2m2t * initialize strategy in ray actor env * facing actor's init order issue * facing ddp model update issue (need unwarp ddp) * unwrap ddp actor * checking 1m2t stuck problem * nothing * set timeout for trainer choosing. It solves the stuck problem! * delete some debug output * rename to sync with upstream * rename to sync with upstream * coati rename * nothing * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations * experience_maker_holder performs target-revolving _send_experience() instead of length comparison. * move code to ray subfolder * working on pipeline inference * apply comments * working on pipeline strategy. in progress. * remove pipeline code. clean this branch * update remote parameters by state_dict. no test * nothing * state_dict sharding transfer * merge debug branch * gemini _unwrap_model fix * simplify code * simplify code & fix LoRALinear AttributeError * critic unwrapped state_dict --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add perfomance evaluator and fix bugs (#10) * [chat] add performance evaluator for ray * [chat] refactor debug arg * [chat] support hf config * [chat] fix generation * [chat] add 1mmt dummy example * [chat] fix gemini ckpt * split experience to send (#11) Co-authored-by: csric <richcsr256@gmail.com> * [chat] refactor trainer and maker (#12) * [chat] refactor experience maker holder * [chat] refactor model init * [chat] refactor trainer args * [chat] refactor model init * [chat] refactor trainer * [chat] refactor experience sending logic and training loop args (#13) * [chat] refactor experience send logic * [chat] refactor trainer * [chat] refactor trainer * [chat] refactor experience maker * [chat] refactor pbar * [chat] refactor example folder (#14) * [chat] support quant (#15) * [chat] add quant * [chat] add quant example * prompt example (#16) * prompt example * prompt load csv data * remove legacy try --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add mmmt dummy example and refactor experience sending (#17) * [chat] add mmmt dummy example * [chat] refactor naive strategy * [chat] fix struck problem * [chat] fix naive strategy * [chat] optimize experience maker sending logic * [chat] refactor sending assignment * [chat] refactor performance evaluator (#18) * Prompt Example & requires_grad state_dict & sharding state_dict (#19) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design --------- Co-authored-by: csric <richcsr256@gmail.com> * state_dict sending adapts to new unwrap function (#20) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design * opt benchmark * better script * nothing * [chat] strategy refactor unwrap model * [chat] strategy refactor save model * [chat] add docstr * [chat] refactor trainer save model * [chat] fix strategy typing * [chat] refactor trainer save model * [chat] update readme * [chat] fix unit test * working on lora reconstruction * state_dict sending adapts to new unwrap function * remove comments --------- Co-authored-by: csric <richcsr256@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * [chat-ray] add readme (#21) * add readme * transparent graph * add note background --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] get images from url (#22) * Refactor/chat ray (#23) * [chat] lora add todo * [chat] remove unused pipeline strategy * [chat] refactor example structure * [chat] setup ci for ray * [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24) * lora support prototype * lora support * 1mmt lora & remove useless code --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] fix test ci for ray * [chat] fix test ci requirements for ray * [chat] fix ray runtime env * [chat] fix ray runtime env * [chat] fix example ci docker args * [chat] add debug info in trainer * [chat] add nccl debug info * [chat] skip ray test * [doc] fix typo --------- Co-authored-by: csric <59389055+CsRic@users.noreply.github.com> Co-authored-by: csric <richcsr256@gmail.com>

[sync] update develop with main

…develop [sync] sync feature/dtensor with develop

[feature] updated device mesh and dtensor

* init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example

…caitech#3816) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit

* [shardformer] refactored the user api * polish code

* update readme with modules content * remove img

…caitech#3856) * add dropout layer, add dropout test * modify seed manager as context manager * add a copy of col_nn.layer * add dist_crossentropy loss; separate module test * polish the code * fix dist crossentropy loss

…pcaitech#3883) * add gpt2 policy and modify shard and slicer to support * remove unused code * polish code

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

[example] Adding an example of training dreambooth with the new booster API

…ardformer [sync] sync feature/shardformer with develop

…elop-to-shardformer Revert "[sync] sync feature/shardformer with develop"

fz

* refactor: separate log_probs fn from Actor forward fn * refactor: separate generate fn from Actor class * feat: update unwrap_model and get_base_model * unwrap_model returns model not wrapped by Strategy * get_base_model returns HF model for Actor, Critic and RewardModel * feat: simplify Strategy.prepare * style: remove get_base_model method of Actor * perf: tokenize text in batches * refactor: move calc_action_log_probs to utils of model * test: update test with new forward fn * style: rename forward fn args * fix: do not unwrap model in save_model fn of naive strategy * test: add gemini test for train_prompts * fix: fix _set_default_generate_kwargs

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

f

jamesthesnake and others added 30 commits May 8, 2023 12:57

Merge pull request #38 from jamesthesnake/ra

b68f7f9

Ra

Merge pull request #41 from hpcaitech/main

20873a5

as

[dtensor] polish sharding spec docstring (hpcaitech#3838)

7c9f2ed

* [dtensor] polish sharding spec docstring * [dtensor] polish sharding spec example docstring

Modify torch version requirement to adapt torch 2.0

46503c3

Merge pull request #50 from hpcaitech/main

fb06bd0

raa

Add a new example of Dreambooth training using the booster API

60ec33b

roll back

42e3232

Merge pull request #55 from jamesthesnake/ra

5fc120c

Ra

modify path

25447d4

Merge pull request #56 from hpcaitech/main

3898942

f

Merge pull request #58 from jamesthesnake/ra

be6afda

Ra

[doc] fix docs about booster api usage (hpcaitech#3898)

c1535cc

update performance evaluation

176010f

update shell file

b56c7f4

fixing insecure hash function

1c1f71c

change directory

b29e1f0

fixed model saving bugs

d3379f0

fixed port

79c9f77

fixed port

b4437e8

[devops] hotfix CI about testmon cache (hpcaitech#3910)

41fb723

* [devops] hotfix CI about testmon cache * [devops] fix testmon cahe on pr

modify file path

4fc8bc6

[lazy] fix compatibility problem on torch 1.13 (hpcaitech#3911)

9c88b6c

Merge pull request hpcaitech#3915 from FrankLeeeee/update/develop

c622bb3

[sync] update develop with main

Merge pull request hpcaitech#3916 from FrankLeeeee/sync/dtensor-with-…

d51e83d

…develop [sync] sync feature/dtensor with develop

[devops] hotfix testmon cache clean logic (hpcaitech#3917)

c25d421

[workflow] added docker latest tag for release (hpcaitech#3920)

5e2132d

[booster] update bert example, using booster api (hpcaitech#3885)

a55fb00

MaruyamaAya and others added 29 commits June 8, 2023 11:15

modify shell for check

cf4792c

[example] update opt example using booster api (hpcaitech#3918)

e417dd0

modify shell for check

039854b

modify shell for check

49567d5

modify shell for check

730a092

fix typo examples/community/roberta (hpcaitech#3925)

407aa48

Merge pull request hpcaitech#3926 from hpcaitech/feature/dtensor

a98e16e

[feature] updated device mesh and dtensor

modify shell for check

9b5e7ce

[shardformer] updated readme (hpcaitech#3827)

bc19024

[shardformer] refactored the user api (hpcaitech#3828)

537a52b

* [shardformer] refactored the user api * polish code

[shardformer] update readme with modules implement doc (hpcaitech#3834)

997544c

* update readme with modules content * remove img

update README (hpcaitech#3909)

6370a93

[shardformer] add gpt2 policy and modify shard and slicer to support (h…

ef15377

…pcaitech#3883) * add gpt2 policy and modify shard and slicer to support * remove unused code * polish code

fix typo examples and docs (hpcaitech#3932)

33eef71

support UniEval and add CHRF metric (hpcaitech#3924)

21c4c0b

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

Merge pull request hpcaitech#3905 from MaruyamaAya/dreambooth

e277534

[example] Adding an example of training dreambooth with the new booster API

Merge pull request hpcaitech#3931 from FrankLeeeee/sync/develop-to-sh…

24651fd

…ardformer [sync] sync feature/shardformer with develop

Revert "[sync] sync feature/shardformer with develop"

ddcf58c

Merge pull request hpcaitech#3942 from hpcaitech/revert-3931-sync/dev…

bd2c7c3

…elop-to-shardformer Revert "[sync] sync feature/shardformer with develop"

fix typo tests/ (hpcaitech#3936)

e61ffc7

fix typo .github/workflows/scripts/ (hpcaitech#3946)

1aadeed

[example] update ViT example using booster api (hpcaitech#3940)

b3ab7fb

Merge pull request #62 from hpcaitech/main

eabae7a

fz

[evaluate] support gpt evaluation with reference (hpcaitech#3972)

2925f47

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

Merge pull request #64 from hpcaitech/main

49246fb

f

jamesthesnake merged commit 52918cc into ra Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

L#65

L#65
jamesthesnake merged 69 commits intorafrom
l

jamesthesnake commented Jun 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

jamesthesnake commented Jun 14, 2023

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants