Update sharded_moe.py to support top2 gate with Tutel by xenshinu · Pull Request #6948 · deepspeedai/DeepSpeed

xenshinu · 2025-01-14T20:11:16Z

Tutel is forced to be unused on k > 1 since #2053
Given the fact that multiple experts per token is very common, and the gather and scatter operation without Tutel is so inefficient, I added support of tutel to top2 gate and tested on pipeline engine. This can be done for any k actually, I'll push that later when I have time to test,

xenshinu · 2025-01-14T20:13:05Z

@xenshinu please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="University of Michigan"

xenshinu · 2025-01-14T20:25:57Z

Not sure if this check is needed

    if use_tutel:
        # Tutel doesn't support index values masked with zero
        # so we need to replace masked indices with -1
        indices1_mask = mask1.sum(dim=1) * num_experts - 1
        indices1_s = torch.min(indices1_s, indices1_mask)
        indices2_mask = mask2.sum(dim=1) * num_experts - 1
        indices2_s = torch.min(indices2_s, indices2_mask)

I see that in top1gate,
https://github.com/microsoft/DeepSpeed/blob/66d3d3e94dbdfbbf6535cab66256c238983fc7c3/deepspeed/moe/sharded_moe.py#L252
but when I refer to examples from Tutel
https://github.com/microsoft/Tutel/blob/ab7937bb929bc78111d74261b490da25657a7e5c/tutel/impls/fast_dispatch.py#L143

I didn't see any specify check for non-zero mask.

deepspeed/moe/sharded_moe.py

loadams · 2025-02-10T19:45:26Z

Hi @xenshinu - FYI you'll need to update from the CLA to DCO requirements in this PR if you're able to?

xenshinu · 2025-02-10T19:52:53Z

Hi @xenshinu - FYI you'll need to update from the CLA to DCO requirements in this PR if you're able to?

Just reset the code to the previous commit and add my signed-and-off.

@loadams Could you please take care of this PR?

Signed-off-by: Xueshen Liu <liuxs@umich.edu>

loadams · 2025-02-14T19:37:22Z

Hi @xenshinu - FYI you'll need to update from the CLA to DCO requirements in this PR if you're able to?

Just reset the code to the previous commit and add my signed-and-off.

@loadams Could you please take care of this PR?

@xenshinu - looks like the formatting check is failing, could you run the pre-commit formatter? pre-commit run --all-files and push those changes?

loadams · 2025-03-25T15:55:33Z

Hi @xenshinu - FYI you'll need to update from the CLA to DCO requirements in this PR if you're able to?

Just reset the code to the previous commit and add my signed-and-off.
@loadams Could you please take care of this PR?

@xenshinu - looks like the formatting check is failing, could you run the pre-commit formatter? pre-commit run --all-files and push those changes?

@xenshinu - curious if you would be able to follow up on this?

xenshinu · 2025-03-27T19:49:29Z

Hi @xenshinu - FYI you'll need to update from the CLA to DCO requirements in this PR if you're able to?

Just reset the code to the previous commit and add my signed-and-off.
@loadams Could you please take care of this PR?

@xenshinu - looks like the formatting check is failing, could you run the pre-commit formatter? pre-commit run --all-files and push those changes?

@xenshinu - curious if you would be able to follow up on this?

Thanks for your reminder. I'll push the TopK version soon and finish the format issue.

xenshinu requested a review from tohtana as a code owner January 14, 2025 20:11

loadams requested a review from hwchen2017 January 14, 2025 23:39

loadams reviewed Jan 15, 2025

View reviewed changes

deepspeed/moe/sharded_moe.py Outdated Show resolved Hide resolved

add tutel support for top2

fc15332

Signed-off-by: Xueshen Liu <liuxs@umich.edu>

xenshinu force-pushed the patch-1 branch from e856af5 to fc15332 Compare February 10, 2025 20:13

xenshinu requested a review from tjruwase as a code owner February 10, 2025 20:13

xenshinu and others added 3 commits February 10, 2025 15:13

Merge branch 'master' into patch-1

898415c

Merge branch 'master' into patch-1

945c6bb

Merge branch 'master' into patch-1

6782418

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update sharded_moe.py to support top2 gate with Tutel#6948

Update sharded_moe.py to support top2 gate with Tutel#6948
xenshinu wants to merge 4 commits intodeepspeedai:masterfrom
xenshinu:patch-1

xenshinu commented Jan 14, 2025 •

edited

Loading

Uh oh!

xenshinu commented Jan 14, 2025

Uh oh!

xenshinu commented Jan 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

loadams commented Feb 10, 2025

Uh oh!

xenshinu commented Feb 10, 2025 •

edited

Loading

Uh oh!

loadams commented Feb 14, 2025

Uh oh!

loadams commented Mar 25, 2025

Uh oh!

xenshinu commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xenshinu commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xenshinu commented Jan 14, 2025

Uh oh!

xenshinu commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

loadams commented Feb 10, 2025

Uh oh!

xenshinu commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

loadams commented Feb 14, 2025

Uh oh!

loadams commented Mar 25, 2025

Uh oh!

xenshinu commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xenshinu commented Jan 14, 2025 •

edited

Loading

xenshinu commented Jan 14, 2025 •

edited

Loading

xenshinu commented Feb 10, 2025 •

edited

Loading