[MoE/ZeRO] Moe refactor with zero refactor by Hz188 · Pull Request #5821 · hpcaitech/ColossalAI

Hz188 · 2024-06-14T10:13:47Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs
I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

* cherry pick from refractor-moe branch * tests passed * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support ep + zero --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…b workflow

…ayer and remove useless test

[Feauture] MoE refactor

* [zero] refactor low level optimizer * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Fix/Example] Fix Llama Inference Loading Data Type (#5763) * [fix/example] fix llama inference loading dtype * revise loading dtype of benchmark llama3 * [release] update version (#5752) * [release] update version * [devops] update compatibility test * [devops] update compatibility test * [devops] update compatibility test * [devops] update compatibility test * [test] fix ddp plugin test * [test] fix gptj and rpc test * [devops] fix cuda ext compatibility * [inference] fix flash decoding test * [inference] fix flash decoding test * fix (#5765) * [test] Fix/fix testcase (#5770) * [fix] branch for fix testcase; * [fix] fix test_analyzer & test_auto_parallel; * [fix] remove local change about moe; * [fix] rm local change moe; * [Hotfix] Add missing init file in inference.executor (#5774) * [CI/tests] simplify some test case to reduce testing time (#5755) * [ci/tests] simplify some test case to reduce testing time * [ci/tests] continue to remove test case to reduce ci time cost * restore some test config * [ci/tests] continue to reduce ci time cost * [misc] update dockerfile (#5776) * [misc] update dockerfile * [misc] update dockerfile * [devops] fix docker ci (#5780) * [Inference]Add Streaming LLM (#5745) * Add Streaming LLM * add some parameters to llama_generation.py * verify streamingllm config * add test_streamingllm.py * modified according to the opinions of review * add Citation * change _block_tables tolist * [hotfix] fix llama flash attention forward (#5777) * [misc] Accelerate CI for zero and dist optim (#5758) * remove fp16 from lamb * remove d2h copy in checking states --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> * [Test/CI] remove test cases to reduce CI duration (#5753) * [test] smaller gpt2 test case * [test] reduce test cases: tests/test_zero/test_gemini/test_zeroddp_state_dict.py * [test] reduce test cases: tests/test_zero/test_gemini/test_grad_accum.py * [test] reduce test cases tests/test_zero/test_gemini/test_optim.py * Revert "[test] smaller gpt2 test case" Some tests might depend on the size of model (num of chunks) This reverts commit df705a5. * [test] reduce test cases: tests/test_checkpoint_io/test_gemini_checkpoint_io.py * [CI] smaller test model for two mwo the two modifid cases * [CI] hardcode gpt model for tests/test_zero/test_gemini/test_search.py since we need a fixed answer there * [hotfix] fix testcase in test_fx/test_tracer (#5779) * [fix] branch for fix testcase; * [fix] fix test_analyzer & test_auto_parallel; * [fix] remove local change about moe; * [fix] rm local change moe; * [fix] fix test_deepfm_model & test_dlrf_model； * [fix] fix test_hf_albert & test_hf_gpt; * [gemini] optimize reduce scatter d2h copy (#5760) * [gemini] optimize reduce scatter d2h copy * [fix] fix missing reduce variable * [refactor] remove legacy async reduce scatter code * [gemini] missing sync * Revert "[refactor] remove legacy async reduce scatter code" This reverts commit 58ad76d. * [gemini] further optimize with async all reduce * [fix] pass flag from manager to chunk * Allow building cuda extension without a device. (#5535) Added FORCE_CUDA environment variable support, to enable building extensions where a GPU device is not present but cuda libraries are. * [misc] fix dist logger (#5782) * [install]fix setup (#5786) * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [misc] update requirements (#5787) * [shardformer] fix import (#5788) * upgrade colossal-chat support tp_group>1, add sp for sft * upgrade ppo dpo rm script * run pre-commit * moupdate ci tests, st ci test cases passed, tp failed in generation for ppo, sp is buggy * fix training script * fix ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix transformers version * remove duplicated test * fix datasets version * remove models that require huggingface auth from ci * remove local data path * update ci * remove baichuan from template test due to transformer version conflict * merge * Refactor modeling by adding attention backend Signed-off-by: char-1ee <xingjianli59@gmail.com> * Fix tests and naming Signed-off-by: char-1ee <xingjianli59@gmail.com> * Pass inference model shard configs for module init Signed-off-by: char-1ee <xingjianli59@gmail.com> * Clean up Signed-off-by: char-1ee <xingjianli59@gmail.com> * replace the customized dataloader setup with the build-in one * replace the customized dataloader setup with the build-in one * Remove flash attention backend Signed-off-by: char-1ee <xingjianli59@gmail.com> * fix readme * Fix test import Signed-off-by: char-1ee <xingjianli59@gmail.com> * update sft trainning script * [Inference]refactor baichuan (#5791) * refactor baichuan * remove unused code and add TODO for lazyinit * [test] fix chatglm test kit (#5793) * [shardformer] fix modeling of bloom and falcon (#5796) * [test] fix qwen2 pytest distLarge (#5797) * [Inference] Fix flash-attn import and add model test (#5794) * Fix torch int32 dtype Signed-off-by: char-1ee <xingjianli59@gmail.com> * Fix flash-attn import Signed-off-by: char-1ee <xingjianli59@gmail.com> * Add generalized model test Signed-off-by: char-1ee <xingjianli59@gmail.com> * Remove exposed path to model Signed-off-by: char-1ee <xingjianli59@gmail.com> * Add default value for use_flash_attn Signed-off-by: char-1ee <xingjianli59@gmail.com> * Rename model test Signed-off-by: char-1ee <xingjianli59@gmail.com> --------- Signed-off-by: char-1ee <xingjianli59@gmail.com> * [Gemini] Use async stream to prefetch and h2d data moving (#5781) * use async stream to prefetch and h2d data moving * Remove redundant code * [gemini] quick fix on possible async operation (#5803) * [gemini] quick fix on possible async operation * [gemini] quick fix on possible async operation * [shardformer] upgrade transformers to 4.39.3 (#5815) * [shardformer]upgrade transformers for gpt2/gptj/whisper (#5807) * [shardformer] fix modeling of gpt2 and gptj * [shardformer] fix whisper modeling * [misc] update requirements --------- Co-authored-by: ver217 <lhx0217@gmail.com> * [shardformer]upgrade transformers for mistral (#5808) * upgrade transformers for mistral * fix * fix * [shardformer]upgrade transformers for llama (#5809) * update transformers fix * fix * fix * [inference] upgrade transformers (#5810) * update transformers fix * fix * fix * fix * fix * [gemini] update transformers for gemini (#5814) --------- Co-authored-by: ver217 <lhx0217@gmail.com> * Support 4d parallel + flash attention (#5789) * support tp + sp + pp * remove comments --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> --------- Signed-off-by: char-1ee <xingjianli59@gmail.com> Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Co-authored-by: Hongxin Liu <lhx0217@gmail.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: duanjunwen <935724073@qq.com> Co-authored-by: yuehuayingxueluo <867460659@qq.com> Co-authored-by: Edenzzzz <wenxuan.tan@wisc.edu> Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: botbw <wang1570@e.ntu.edu.sg> Co-authored-by: Charles Coulombe <ccoulombe@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: YeAnbang <anbangy2@outlook.com> Co-authored-by: char-1ee <xingjianli59@gmail.com> Co-authored-by: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Co-authored-by: YeAnbang <44796419+YeAnbang@users.noreply.github.com> Co-authored-by: Guangyao Zhang <xjtu521@qq.com>

* [zero] fix param & refactor * [zero] add back original low level opt * [zero] remove moe related * [zero] pass zero tests * [zero] refactor * [chore] add del func back

* [zero] modify api * [test] remove _grad_store access in tests

…ve logger into function

FrankLeeeee and others added 30 commits May 29, 2024 16:39

[moe] removed openmoe-coupled code and rectify mixstral code (#5471)

f1d4167

add mixtral auto policy & move pipeline forward code to modeling folder

d49fd63

[moe refactor] modify kernel test without Route Class

d2e07fc

[moe refactor] add moe tensor test path environment variable to githu…

7556b8f

…b workflow

fix typos

16329d5

fix moe test bug due to the code rebase

b934437

[moe refactor] fix moe zero test, and little bug in low level zero

a792e83

fix typo

d203ba8

add moe tensor path to github workflow

55c7416

remove some useless code

8915e9d

fix typo & unify global variable XX_AXIS logic without using -1

7963fb0

fix typo & prettifier the code

32ced74

remove print code & support zero 2 test

3100c1b

remove useless code

928ee39

reanme function

6dc0cfc

fix typo

4417840

fix typo

eb35655

Further improve the test code

d1d446b

remove print code

09a5188

[moe refactor] change test model from fake moe model to mixtral moe l…

4c6ea42

…ayer and remove useless test

[moe refactor] skip some unit test which will be refactored later

80b6586

[moe refactor] fix unit import error

7d06220

[moe refactor] fix circular import issues

fb41f42

[moe refactor] remove debug code

e99b69c

[moe refactor] update github workflow

af9ade6

Merge pull request #5775 from Hz188/feature/moe

49d74f3

[Feauture] MoE refactor

[Feature] MoE refactor with newest version of ZeRO (#5801)

88f318a

[zero] remove redundant members in BucketStore (#5802)

b2ac7e5

botbw and others added 7 commits June 17, 2024 17:08

[zero] fix missing hook removal (#5824)

4cd4a1f

[zero] fix hook bug

d9ea6d4

Merge branch 'main' into feature/moe

b04e99c

[zero] add low level optimizer back (#5839)

62cd25d

* [zero] fix param & refactor * [zero] add back original low level opt * [zero] remove moe related * [zero] pass zero tests * [zero] refactor * [chore] add del func back

[zero] comments and naming (#5840)

204d25c

[zero] modify api (#5843)

efdfa06

* [zero] modify api * [test] remove _grad_store access in tests

Hz188 self-assigned this Jun 25, 2024

botbw and others added 3 commits June 26, 2024 11:08

[test] fix (#5857)

44aeccc

[CI] skip openmoe CI check

9398484

[CI] fox pre-commit

5e551f8

ver217 reviewed Jun 27, 2024

View reviewed changes

Comment thread colossalai/zero/low_level/bookkeeping/gradient_store.py

Comment thread colossalai/zero/low_level/low_level_optim.py Outdated

[zero] remove redundant memebr init (#5862)

2ff332c

ver217 reviewed Jun 27, 2024

View reviewed changes

Comment thread colossalai/checkpoint_io/moe_checkpoint.py Outdated

Comment thread colossalai/checkpoint_io/moe_checkpoint.py Outdated

ver217 reviewed Jun 27, 2024

View reviewed changes

Comment thread tests/test_moe/test_moe_checkpoint.py Outdated

Hz188 and others added 4 commits June 27, 2024 08:52

[misc] remove useless code, modify the pg mesh implementation

75be843

Merge branch 'hpcaitech:feature/moe' into feature/moe

1855442

[misc] remove useless code, modify the pg mesh implementation

3a25166

[misc] use tempfile

502e514

Hz188 force-pushed the feature/moe branch from b606612 to 502e514 Compare June 27, 2024 10:27

Hz188 added 3 commits June 27, 2024 11:49

resolve conflict with main branch

494b8a2

resolve conflict with main branch

961e96f

[misc] use tempfile in test_moe_checkpoint.py

95c4c0b

Hz188 changed the title ~~[MoE/ZeRO] Moe refactor with newest version of low level zero~~ [MoE/ZeRO] Moe refactor with zero refactor Jun 27, 2024

Hz188 added 2 commits June 28, 2024 03:47

[misc] remove useless code, add assertion about sequence parallel, mo…

9e966b9

…ve logger into function

[misc] remove useless code

165e894

ver217 approved these changes Jun 28, 2024

View reviewed changes

ver217 merged commit 416580b into main Jun 28, 2024

ver217 deleted the feature/moe branch June 28, 2024 06:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE/ZeRO] Moe refactor with zero refactor#5821

[MoE/ZeRO] Moe refactor with zero refactor#5821
ver217 merged 54 commits intomainfrom
feature/moe

Hz188 commented Jun 14, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Hz188 commented Jun 14, 2024

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants