Skip to content

[moe] support hybrid zero strategy. #4877

Merged
ver217 merged 32 commits intohpcaitech:feature/MoEfrom
oahzxl:bench
Oct 11, 2023
Merged

[moe] support hybrid zero strategy. #4877
ver217 merged 32 commits intohpcaitech:feature/MoEfrom
oahzxl:bench

Conversation

@oahzxl
Copy link
Copy Markdown
Contributor

@oahzxl oahzxl commented Oct 10, 2023

  • Support hybrid zero strategy.
    • For non-moe param, zero will be activated for the global dp group.
    • For moe param, we will use expert parallel and zero, where ep group * zero group = global dp group. So zero will be activated only for moe zero group here.
  • add unit test for hybrid zero
  • enable moe unit test
  • fix previous unit test

@github-actions
Copy link
Copy Markdown
Contributor

The code coverage for the changed files is %.

Click me to view the complete report
Name                                                                 Stmts   Miss  Cover
----------------------------------------------------------------------------------------
colossalai/booster/plugin/hybrid_parallel_plugin.py                    214     13    94%
colossalai/booster/plugin/moe_hybrid_parallel_plugin.py                 92     26    72%
colossalai/context/__init__.py                                           6      0   100%
colossalai/context/random/__init__.py                                    2      0   100%
colossalai/context/random/_helper.py                                    46      9    80%
colossalai/initialize.py                                               180    133    26%
colossalai/kernel/triton/__init__.py                                    14      3    79%
colossalai/kernel/triton/llama_act_combine_kernel.py                    89     67    25%
colossalai/legacy/engine/gradient_handler/__init__.py                    6      0   100%
colossalai/legacy/engine/gradient_handler/_moe_gradient_handler.py      20     20     0%
colossalai/moe/__init__.py                                               6      0   100%
colossalai/moe/_operation.py                                           159     44    72%
colossalai/moe/checkpoint.py                                           133     31    77%
colossalai/moe/experts.py                                               95     14    85%
colossalai/moe/layers.py                                               114     34    70%
colossalai/moe/loss.py                                                  21     21     0%
colossalai/moe/manager.py                                               81      7    91%
colossalai/moe/routers.py                                              176     15    91%
colossalai/moe/utils.py                                                 79     26    67%
colossalai/nn/layer/moe/__init__.py                                     12      0   100%
colossalai/nn/loss/__init__.py                                           0      0   100%
colossalai/tensor/moe_tensor/__init__.py                                 0      0   100%
colossalai/tensor/moe_tensor/api.py                                     24      2    92%
colossalai/tensor/moe_tensor/moe_info.py                                13      0   100%
colossalai/zero/low_level/low_level_optim.py                           431     32    93%
examples/language/openmoe/model/__init__.py                              0      0   100%
examples/language/openmoe/model/modeling_openmoe.py                    462    328    29%
examples/language/openmoe/model/openmoe_policy.py                      248    157    37%
tests/test_infer_ops/triton/test_llama_act_combine.py                   40     23    42%
tests/test_moe/moe_utils.py                                             97      0   100%
tests/test_moe/test_grad_handler.py                                     59      1    98%
tests/test_moe/test_kernel.py                                           57      1    98%
tests/test_moe/test_moe_checkpoint.py                                   69      2    97%
tests/test_moe/test_moe_ep_tp.py                                        48      1    98%
tests/test_moe/test_moe_group.py                                        54      2    96%
tests/test_moe/test_moe_hybrid_zero.py                                  67      2    97%
tests/test_moe/test_moe_local.py                                        49      1    98%
tests/test_moe/test_moe_router.py                                       25      1    96%
tests/test_moe/test_moe_zero_fwd_bwd.py                                 77      3    96%
tests/test_moe/test_moe_zero_optim.py                                   66      9    86%
----------------------------------------------------------------------------------------
TOTAL                                                                 3431   1028    70%

Comment thread colossalai/zero/low_level/low_level_optim.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

The code coverage for the changed files is %.

Click me to view the complete report
Name                                                                 Stmts   Miss  Cover
----------------------------------------------------------------------------------------
colossalai/booster/plugin/hybrid_parallel_plugin.py                    214     13    94%
colossalai/booster/plugin/moe_hybrid_parallel_plugin.py                 92     26    72%
colossalai/context/__init__.py                                           6      0   100%
colossalai/context/random/__init__.py                                    2      0   100%
colossalai/context/random/_helper.py                                    46     14    70%
colossalai/initialize.py                                               180    134    26%
colossalai/kernel/triton/__init__.py                                    14      3    79%
colossalai/kernel/triton/llama_act_combine_kernel.py                    89     67    25%
colossalai/legacy/engine/gradient_handler/__init__.py                    6      0   100%
colossalai/legacy/engine/gradient_handler/_moe_gradient_handler.py      20     20     0%
colossalai/moe/__init__.py                                               6      0   100%
colossalai/moe/_operation.py                                           159     44    72%
colossalai/moe/checkpoint.py                                           133     31    77%
colossalai/moe/experts.py                                               95     14    85%
colossalai/moe/layers.py                                               114     34    70%
colossalai/moe/loss.py                                                  21     21     0%
colossalai/moe/manager.py                                               81      7    91%
colossalai/moe/routers.py                                              176     15    91%
colossalai/moe/utils.py                                                 79     26    67%
colossalai/nn/layer/moe/__init__.py                                     12      0   100%
colossalai/nn/loss/__init__.py                                           0      0   100%
colossalai/tensor/moe_tensor/__init__.py                                 0      0   100%
colossalai/tensor/moe_tensor/api.py                                     24      2    92%
colossalai/tensor/moe_tensor/moe_info.py                                13      0   100%
colossalai/zero/low_level/low_level_optim.py                           431     32    93%
examples/language/openmoe/model/__init__.py                              0      0   100%
examples/language/openmoe/model/modeling_openmoe.py                    462    328    29%
examples/language/openmoe/model/openmoe_policy.py                      248    157    37%
tests/test_moe/moe_utils.py                                             97      0   100%
tests/test_moe/test_grad_handler.py                                     59      1    98%
tests/test_moe/test_kernel.py                                           57      1    98%
tests/test_moe/test_moe_checkpoint.py                                   69      2    97%
tests/test_moe/test_moe_ep_tp.py                                        48      1    98%
tests/test_moe/test_moe_group.py                                        54      2    96%
tests/test_moe/test_moe_hybrid_zero.py                                  66      2    97%
tests/test_moe/test_moe_local.py                                        49      1    98%
tests/test_moe/test_moe_router.py                                       25      1    96%
tests/test_moe/test_moe_zero_fwd_bwd.py                                 77      3    96%
tests/test_moe/test_moe_zero_optim.py                                   66      9    86%
----------------------------------------------------------------------------------------
TOTAL                                                                 3390   1011    70%

Comment thread colossalai/booster/plugin/hybrid_parallel_plugin.py Outdated
Comment thread colossalai/moe/manager.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

The code coverage for the changed files is 67%.

Click me to view the complete report
Name                                                                 Stmts   Miss  Cover
----------------------------------------------------------------------------------------
colossalai/booster/plugin/hybrid_parallel_plugin.py                    214     13    94%
colossalai/booster/plugin/moe_hybrid_parallel_plugin.py                102     32    69%
colossalai/context/__init__.py                                           6      0   100%
colossalai/context/random/__init__.py                                    2      0   100%
colossalai/context/random/_helper.py                                    46     14    70%
colossalai/initialize.py                                               180    134    26%
colossalai/kernel/triton/__init__.py                                    14      3    79%
colossalai/kernel/triton/llama_act_combine_kernel.py                    89     67    25%
colossalai/legacy/engine/gradient_handler/__init__.py                    6      0   100%
colossalai/legacy/engine/gradient_handler/_moe_gradient_handler.py      20     20     0%
colossalai/moe/__init__.py                                               6      0   100%
colossalai/moe/_operation.py                                           159     44    72%
colossalai/moe/checkpoint.py                                           133     31    77%
colossalai/moe/experts.py                                               95     14    85%
colossalai/moe/layers.py                                               114     34    70%
colossalai/moe/loss.py                                                  21     21     0%
colossalai/moe/manager.py                                               82      7    91%
colossalai/moe/routers.py                                              176     34    81%
colossalai/moe/utils.py                                                 79     26    67%
colossalai/nn/layer/moe/__init__.py                                     12      0   100%
colossalai/nn/loss/__init__.py                                           0      0   100%
colossalai/tensor/moe_tensor/__init__.py                                 0      0   100%
colossalai/tensor/moe_tensor/api.py                                     24      2    92%
colossalai/tensor/moe_tensor/moe_info.py                                13      0   100%
colossalai/zero/low_level/low_level_optim.py                           431    115    73%
examples/language/openmoe/model/__init__.py                              0      0   100%
examples/language/openmoe/model/modeling_openmoe.py                    462    328    29%
examples/language/openmoe/model/openmoe_policy.py                      248    157    37%
tests/test_moe/moe_utils.py                                             97      0   100%
tests/test_moe/test_grad_handler.py                                     59      1    98%
tests/test_moe/test_kernel.py                                           57      1    98%
tests/test_moe/test_moe_checkpoint.py                                   69      2    97%
tests/test_moe/test_moe_ep_tp.py                                        48      1    98%
tests/test_moe/test_moe_group.py                                        54      2    96%
tests/test_moe/test_moe_hybrid_zero.py                                  66      2    97%
tests/test_moe/test_moe_local.py                                        49      1    98%
tests/test_moe/test_moe_zero_fwd_bwd.py                                 77      3    96%
tests/test_moe/test_moe_zero_optim.py                                   66      9    86%
----------------------------------------------------------------------------------------
TOTAL                                                                 3376   1118    67%

@ver217 ver217 merged commit 0d599f6 into hpcaitech:feature/MoE Oct 11, 2023
@oahzxl oahzxl deleted the bench branch October 18, 2023 02:21
oahzxl added a commit to oahzxl/ColossalAI that referenced this pull request Oct 26, 2023
* overlap comm

* fix typo

* update bench script

* add option

* update script

* update bench

* param init

* support dp zero

* fix zero dp

* fxi bug

* update pg bug

* update experts

* fix optim bug

* update config

* kaishen niubi

* fix bug

* embed

* Merge branch 'feature/MoE' of https://github.com/hpcaitech/ColossalAI into bench

* update bench

* update optim

* update doc

* update sync

* fix test

* fix arg

* update ckpt

* update test

* fix

* remove print

* polish code

* update hybrid zero optim

* update print
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants