L by jamesthesnake · Pull Request #24 · jamesthesnake/ColossalAI

jamesthesnake · 2023-04-20T22:29:15Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

…pcaitech#3536) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

…ming format of hf checkpoint files (hpcaitech#3479) * [checkpoint] support huggingface style sharded checkpoint, to be compatible with hf file naming format * [checkpoint] support huggingface style sharded checkpoint, to be compatible with hf file naming format * [checkpoint] Shard saved checkpoint add 'variant' field to customize filename * [checkpoint] Shard saved checkpoint add 'variant' field to customize filename * [checkpoint] Shard saved checkpoint add 'variant' field to customize filename * [checkpoint] Shard saved checkpoint add 'variant' field to customize filename --------- Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local> Co-authored-by: luchen <luchen@luchendeMBP.lan>

* [gemini] fix nvme optimizer init * [gemini] gemini supports lazy init * [gemini] add init example * [gemini] add fool model * [zero] update gemini ddp * [zero] update init example * add chunk method * add chunk method * [lazyinit] fix lazy tensor tolist * [gemini] fix buffer materialization * [misc] remove useless file * [booster] update gemini plugin * [test] update gemini plugin test * [test] fix gemini plugin test * [gemini] fix import * [gemini] fix import * [lazyinit] use new metatensor * [lazyinit] use new metatensor * [lazyinit] fix __set__ method

Fixing document link errors using absolute paths

Delete more ")"

Format Optimization ,Add [] outside of DeepSpeed

* [chat] clean up duplicate tutorial * [chat] clean up duplicate tutorial * [chat] clean up duplicate tutorial * [chat] clean up duplicate tutorial

* [feat][chatgpt]train prompts on ray example * [fix]simplify code * [fix]remove depreciated parameter * [fix]add dependencies * [fix]method calling * [fix]experience maker * [fix]missing loss function * [fix]init optimizer * [feat]add usage comment * [fix]rename files * [fix]add readme * [fix]file path * [fix]move directory --------- Co-authored-by: jiangwen <zxl265370@antgroup.com>

Display format optimization, fix bug#3562 Specific changes 1. "This is called a column-parallel fashion" Translate to Chinese 2. use the ```math code block syntax to display a math expression as a block, No modification of formula content Please check that the math formula is displayed correctly If OK, I will change the format of the English version of the formula in parallel

* [misc] add print verbose * [gemini] add print verbose * [zero] add print verbose for low level * [misc] add print verbose for op builder

Display format optimization , same as fix#3562 Simultaneous modification of en version

* run the base * working on dist ppo * sync * detached trainer * update detached trainer. no maker update function * facing init problem * 1 maker 1 trainer detached run. but no model update * facing cuda problem * fix save functions * verified maker update * nothing * add ignore * analyize loss issue * remove some debug codes * facing 2m1t stuck issue * 2m1t verified * do not use torchrun * working on 2m2t * working on 2m2t * initialize strategy in ray actor env * facing actor's init order issue * facing ddp model update issue (need unwarp ddp) * unwrap ddp actor * checking 1m2t stuck problem * nothing * set timeout for trainer choosing. It solves the stuck problem! * delete some debug output * rename to sync with upstream * rename to sync with upstream * coati rename * nothing * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations * experience_maker_holder performs target-revolving _send_experience() instead of length comparison. * move code to ray subfolder * working on pipeline inference * apply comments --------- Co-authored-by: csric <richcsr256@gmail.com>

Optimization Code I think there were two extra $ entered here, which have been deleted

* [gemini] support state dict shard * [gemini] add test state dict shard * [gemini] polish docstr * [gemini] fix merge * [gemini] polish code

Adjusted the style of Community Examples to be consistent with other titles

update Update test_ci.sh Update test_ci.sh Update test_ci.sh Update test_ci.sh Update test_ci.sh Update test_ci.sh Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update test_ci.sh Update test_ci.sh update Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml update ci Update test_ci.sh Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update test_ci.sh Update test_ci.sh Update run_chatgpt_examples.yml Update test_ci.sh Update test_ci.sh Update test_ci.sh update test ci RoBERTa for RLHF Stage 2 & 3 (still in testing) Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)" This reverts commit 06741d8. Add RoBERTa for RLHF stage 2 & 3 1. add roberta folder under model folder 2. add roberta option in train_reward_model.py 3. add some test in testci Update test_ci.sh Revert "Update test_ci.sh" This reverts commit 9c7352b. Add RoBERTa for RLHF Stage 2 & 3 (test) RoBERTa for RLHF Stage 2 & 3 (still in testing) Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)" This reverts commit 06741d8. Add RoBERTa for RLHF stage 2 & 3 1. add roberta folder under model folder 2. add roberta option in train_reward_model.py 3. add some test in testci Update test_ci.sh Revert "Update test_ci.sh" This reverts commit 9c7352b. update roberta with coati chat ci update Revert "chat ci update" This reverts commit 17ae7ae. [test]chat_update_ci Update test_ci.sh Update test_ci.sh test Update gpt_critic.py Update gpt_critic.py Update run_chatgpt_unit_tests.yml update test ci update update update update Update test_ci.sh update Update test_ci.sh Update test_ci.sh Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml

* [meta] fix torch 1.13.1 * [meta] fix torch 2.0.0 * [meta] fix torch 1.13.0 * [meta] polish code

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

Optimization Code The source code has not been modified, only a few spelling errors in the comments have been changed

Co-authored-by: github-actions <github-actions@github.com>

* [gemini] save state dict support fp16 * [gemini] save state dict shard support fp16 * [gemini] fix state dict * [gemini] fix state dict

Optimization Code change "requries" to "requires"

Optimization Code change "vairable" to "variable"

Fixed several word spelling errors change "compatiblity" to "compatibility" etc.

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

,m

Ra

* Detached ppo (#9) * run the base * working on dist ppo * sync * detached trainer * update detached trainer. no maker update function * facing init problem * 1 maker 1 trainer detached run. but no model update * facing cuda problem * fix save functions * verified maker update * nothing * add ignore * analyize loss issue * remove some debug codes * facing 2m1t stuck issue * 2m1t verified * do not use torchrun * working on 2m2t * working on 2m2t * initialize strategy in ray actor env * facing actor's init order issue * facing ddp model update issue (need unwarp ddp) * unwrap ddp actor * checking 1m2t stuck problem * nothing * set timeout for trainer choosing. It solves the stuck problem! * delete some debug output * rename to sync with upstream * rename to sync with upstream * coati rename * nothing * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations * experience_maker_holder performs target-revolving _send_experience() instead of length comparison. * move code to ray subfolder * working on pipeline inference * apply comments * working on pipeline strategy. in progress. * remove pipeline code. clean this branch * update remote parameters by state_dict. no test * nothing * state_dict sharding transfer * merge debug branch * gemini _unwrap_model fix * simplify code * simplify code & fix LoRALinear AttributeError * critic unwrapped state_dict --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add perfomance evaluator and fix bugs (#10) * [chat] add performance evaluator for ray * [chat] refactor debug arg * [chat] support hf config * [chat] fix generation * [chat] add 1mmt dummy example * [chat] fix gemini ckpt * split experience to send (#11) Co-authored-by: csric <richcsr256@gmail.com> * [chat] refactor trainer and maker (#12) * [chat] refactor experience maker holder * [chat] refactor model init * [chat] refactor trainer args * [chat] refactor model init * [chat] refactor trainer * [chat] refactor experience sending logic and training loop args (#13) * [chat] refactor experience send logic * [chat] refactor trainer * [chat] refactor trainer * [chat] refactor experience maker * [chat] refactor pbar * [chat] refactor example folder (#14) * [chat] support quant (#15) * [chat] add quant * [chat] add quant example * prompt example (#16) * prompt example * prompt load csv data * remove legacy try --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add mmmt dummy example and refactor experience sending (#17) * [chat] add mmmt dummy example * [chat] refactor naive strategy * [chat] fix struck problem * [chat] fix naive strategy * [chat] optimize experience maker sending logic * [chat] refactor sending assignment * [chat] refactor performance evaluator (#18) * Prompt Example & requires_grad state_dict & sharding state_dict (#19) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design --------- Co-authored-by: csric <richcsr256@gmail.com> * state_dict sending adapts to new unwrap function (#20) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design * opt benchmark * better script * nothing * [chat] strategy refactor unwrap model * [chat] strategy refactor save model * [chat] add docstr * [chat] refactor trainer save model * [chat] fix strategy typing * [chat] refactor trainer save model * [chat] update readme * [chat] fix unit test * working on lora reconstruction * state_dict sending adapts to new unwrap function * remove comments --------- Co-authored-by: csric <richcsr256@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * [chat-ray] add readme (#21) * add readme * transparent graph * add note background --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] get images from url (#22) * Refactor/chat ray (#23) * [chat] lora add todo * [chat] remove unused pipeline strategy * [chat] refactor example structure * [chat] setup ci for ray * [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24) * lora support prototype * lora support * 1mmt lora & remove useless code --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] fix test ci for ray * [chat] fix test ci requirements for ray * [chat] fix ray runtime env * [chat] fix ray runtime env * [chat] fix example ci docker args * [chat] add debug info in trainer * [chat] add nccl debug info * [chat] skip ray test * [doc] fix typo --------- Co-authored-by: csric <59389055+CsRic@users.noreply.github.com> Co-authored-by: csric <richcsr256@gmail.com>

chengeharrison and others added 30 commits April 12, 2023 15:47

[chat]add examples of training with limited resources in chat readme (h…

7182ac2

…pcaitech#3536) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

Polish Code

de84c03

[doc] Update README-zh-Hans.md (hpcaitech#3541)

a3ac48e

Fixing document link errors using absolute paths

Update README.md (hpcaitech#3548)

3f760da

Delete more ")"

[doc] Update README.md (hpcaitech#3549)

77efdfe

Format Optimization ,Add [] outside of DeepSpeed

[chat] polish tutorial doc (hpcaitech#3551)

535b896

* [chat] clean up duplicate tutorial * [chat] clean up duplicate tutorial * [chat] clean up duplicate tutorial * [chat] clean up duplicate tutorial

[example] reorganize for community examples (hpcaitech#3557)

f1b3d60

[lazyinit] fix clone and deepcopy (hpcaitech#3553)

4341f5e

[misc] add verbose arg for zero and op builder (hpcaitech#3552)

173dad0

* [misc] add print verbose * [gemini] add print verbose * [zero] add print verbose for low level * [misc] add print verbose for op builder

[doc] Update 1D_tensor_parallel.md (hpcaitech#3573)

9edeadf

Display format optimization , same as fix#3562 Simultaneous modification of en version

Add docstr for zero3 chunk search utils (hpcaitech#3572)

d329c29

[chat] update reward model sh (hpcaitech#3578)

cc1eec2

[coati] add costom model suppor tguide (hpcaitech#3579)

6b1a39b

[doc] Update .github/workflows/README.md (hpcaitech#3577)

6e7e43c

Optimization Code I think there were two extra $ entered here, which have been deleted

fix: fix sft (hpcaitech#3568)

7788e0b

[gemini] support save state dict in shards (hpcaitech#3581)

f313bab

* [gemini] support state dict shard * [gemini] add test state dict shard * [gemini] polish docstr * [gemini] fix merge * [gemini] polish code

[example] fix community doc (hpcaitech#3586)

d0fbd4b

Adjusted the style of Community Examples to be consistent with other titles

[fx] fix meta tensor registration (hpcaitech#3589)

dac127d

* [meta] fix torch 1.13.1 * [meta] fix torch 2.0.0 * [meta] fix torch 1.13.0 * [meta] polish code

reconstruct chat trainer and fix training script (hpcaitech#3588)

1ec0d38

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

[coati] fix install cmd (hpcaitech#3592)

5a79cff

[misc] op_builder/builder.py (hpcaitech#3593)

d96567b

Optimization Code The source code has not been modified, only a few spelling errors in the comments have been changed

[bot] Automated submodule synchronization (hpcaitech#3596)

d544ed4

Co-authored-by: github-actions <github-actions@github.com>

[gemini] state dict supports fp16 (hpcaitech#3590)

12eff9e

* [gemini] save state dict support fp16 * [gemini] save state dict shard support fp16 * [gemini] fix state dict * [gemini] fix state dict

[doc] fix op_builder/README.md (hpcaitech#3597)

7570d9a

Optimization Code change "requries" to "requires"

digger-yu and others added 6 commits April 19, 2023 17:28

[doc] fix setup.py typo (hpcaitech#3603)

becd3b0

Optimization Code change "vairable" to "variable"

[doc] .github/workflows/README.md (hpcaitech#3605)

633bac2

Fixed several word spelling errors change "compatiblity" to "compatibility" etc.

Chat evaluate (hpcaitech#3608)

c4709d3

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

[chat] polish code note typo (hpcaitech#3612)

d7bf284

Merge pull request #22 from hpcaitech/main

e8cd2e4

,m

Merge pull request #23 from jamesthesnake/ra

f662eac

Ra

jamesthesnake merged commit d4e4396 into co Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

L#24

L#24
jamesthesnake merged 36 commits intocofrom
l

jamesthesnake commented Apr 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

jamesthesnake commented Apr 20, 2023

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants