Raf by jamesthesnake · Pull Request #119 · jamesthesnake/ColossalAI

jamesthesnake · 2023-08-04T07:58:48Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

Co

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

support session-based training (hpcaitech#4313)

* refactor low level zero * fix zero2 and support cpu offload * avg gradient and modify unit test * refactor grad store, support layer drop * refactor bucket store, support grad accumulation * fix and update unit test of zero and ddp * compatible with tp, ga and unit test * fix memory leak and polish * add zero layer drop unittest * polish code * fix import err in unit test * support diffenert comm dtype, modify docstring style * polish code * test padding and fix * fix unit test of low level zero * fix pad recording in bucket store * support some models * polish

* support no sync for zero1 plugin * polish * polish

* allow passing process group to zero12 * union tp-zero and normal-zero * polish code

* add state dict for zero * fix unit test * polish

* support shard optimizer of zero * polish code * support sync grad manually

* optimize the optimizer step time * fix corner case * polish * replace all-reduce with all-gather * set comm device to cuda

* [release] update version * [devops] hotfix cuda extension building * [devops] pytest ignore useless folders

* [test] remove legacy zero test * [test] remove lazy distribute test * [test] remove outdated checkpoint io

* style: rename replay buffer Experience replay is typically for off policy algorithms. Use this name in PPO maybe misleading. * fix: fix wrong zero2 default arg * test: update experience tests * style: rename zero_pad fn * fix: defer init in CycledDataLoader * test: add benchmark test * style: rename internal fn of generation * style: rename internal fn of lora * fix: remove unused loss fn * fix: remove unused utils fn * refactor: remove generate_with_actor fn * fix: fix type annotation * test: add models tests * fix: skip llama due to long execution time * style: modify dataset * style: apply formatter * perf: update reward dataset * fix: fix wrong IGNORE_INDEX in sft dataset * fix: remove DataCollatorForSupervisedDataset * test: add dataset tests * style: apply formatter * style: rename test_ci to test_train * feat: add llama in inference * test: add inference tests * test: change test scripts directory * fix: update ci * fix: fix typo * fix: skip llama due to oom * fix: fix file mod * style: apply formatter * refactor: remove duplicated llama_gptq * style: apply formatter * to: update rm test * feat: add tokenizer arg * feat: add download model script * test: update train tests * fix: modify gemini load and save pretrained * test: update checkpoint io test * to: modify nproc_per_node * fix: do not remove existing dir * fix: modify save path * test: add random choice * fix: fix sft path * fix: enlarge nproc_per_node to avoid oom * fix: add num_retry * fix: make lora config of rm and critic consistent * fix: add warning about lora weights * fix: skip some gpt2 tests * fix: remove grad ckpt in rm and critic due to errors * refactor: directly use Actor in train_sft * test: add more arguments * fix: disable grad ckpt when using lora * fix: fix save_pretrained and related tests * test: enable zero2 tests * revert: remove useless fn * style: polish code * test: modify test args

ffa

Merge pull request #119 from jamesthesnake/ra

jamesthesnake and others added 16 commits July 26, 2023 22:30

Merge pull request #109 from jamesthesnake/co

44e2351

Co

support session-based training (hpcaitech#4313)

5187c96

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

Merge pull request #112 from hpcaitech/main

530ef3a

support session-based training (hpcaitech#4313)

[zero]support no_sync method for zero1 plugin (hpcaitech#4138)

79cf1b5

* support no sync for zero1 plugin * polish * polish

[zero] allow passing process group to zero12 (hpcaitech#4153)

c668801

* allow passing process group to zero12 * union tp-zero and normal-zero * polish code

[zero] add state dict for low level zero (hpcaitech#4179)

dd7cc58

* add state dict for zero * fix unit test * polish

[zero] support shard optimizer state dict of zero (hpcaitech#4194)

1a49a5e

* support shard optimizer of zero * polish code * support sync grad manually

[zero] optimize the optimizer step time (hpcaitech#4221)

45b08f0

* optimize the optimizer step time * fix corner case * polish * replace all-reduce with all-gather * set comm device to cuda

fix localhost measurement (hpcaitech#4320)

03654c0

[chat] fix compute_approx_kl (hpcaitech#4338)

75c5389

[release] update version (hpcaitech#4332)

8064771

* [release] update version * [devops] hotfix cuda extension building * [devops] pytest ignore useless folders

[hotfix] update gradio 3.11 to 3.34.0 (hpcaitech#4329)

16c0acc

[test] remove useless tests (hpcaitech#4359)

16bf4c0

* [test] remove legacy zero test * [test] remove lazy distribute test * [test] remove outdated checkpoint io

Merge pull request #117 from hpcaitech/main

218ae34

ffa

jamesthesnake merged commit 1038c2b into co Aug 4, 2023

jamesthesnake added a commit that referenced this pull request Aug 4, 2023

Merge pull request #120 from jamesthesnake/co

c299445

Merge pull request #119 from jamesthesnake/ra

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raf#119

Raf#119
jamesthesnake merged 16 commits intocofrom
ra

jamesthesnake commented Aug 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jamesthesnake commented Aug 4, 2023

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants