[WIP] Fix naive for loops for MoE models resulting in sub 20% downstream MFU for training with trl, e.t.c (Qwen3, Deepseek V3, Ernie 4.5, GLM 4.5, Dots1) by perinmclaughlin · Pull Request #40016 · huggingface/transformers

perinmclaughlin · 2025-08-07T21:30:05Z

What does this PR do?

Fixes the longstanding issues with MoE training being bottlenecked by naive for loops for models with > 8 experts.
This can result in sub 20% MFU in downstream training frameworks such as unsloth and trl. (Qwen3 30B on H800)

There have been several downstream issues already from training frameworks such as unslothai/unsloth#2582, and open source community members have made custom patches such as https://huggingface.co/Doctor-Shotgun/Qwen3-235B-A22B-Instruct-2507-ScatterMoE. Although not publicly available, I've also heard several complaints in the Axolotl and BeaverAI discords on this issue.

This PR mainly replaces the moe() method from Deepseek V3 with the mathematically equivalent but faster Scatter MoE implementation and makes other sparse moe blocks inherit from DeepseekV3MoE in addition to accordingly modifying the forward and init of those modules to use moe()

Also, from modular_deepseek_v3.py:
"""
CALL FOR CONTRIBUTION! I don't have time to optimise this right now, but expert weights need to be fused
to not have to do a loop here (deepseek has 256 experts soooo yeah).
"""

Before submitting

[N] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[Y] Did you read the contributor guideline,
Pull Request section?
[Y] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Models:

text models: @ArthurZucker

…d and init accordingly to use moe()

…the forward to use moe()

DocShotgun · 2025-08-09T15:41:00Z

Interesting!

One question I have is - will this be compatible with PEFT?

I uploaded a few of those Qwen 3 scattermoe conversions based on Charles Goddard’s original remote code implementation, but the problem I got stuck on was that the fused MoE layers were not compatible with PEFT and we could only target the attention tensors and router during lora training.

perinmclaughlin · 2025-08-09T20:09:41Z

This PR in it's current state should be fully compatible with PEFT and more specifically LoRA even without the target_parameters PR in peft as it does not currently fuse the experts.
I initially believed that there was little point to fusing the experts as most fused expert implementations use bmm which either requires significant wasted computation and memory access for padding or token dropping.
However I helpfully realized about a day after posting this PR that you could avoid padding by grouping experts by number of assigned tokens and using bmm on each group to avoid padding, so I may switch to fused experts if that method turns out to be significantly faster.

ArthurZucker · 2025-08-12T16:06:55Z

Hey!
I was about to review!

ArthurZucker · 2025-08-12T16:07:26Z

Happy to have a better version than what we currently have, and also making sure it is TP compatible. For the best performance we cana also use https://huggingface.co/kernels-community/megablocks/tree/main/torch-ext/megablocks

github-actions · 2025-08-12T16:10:27Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: deepseek_v3, ernie4_5_moe, qwen3_moe

perinmclaughlin · 2025-08-12T16:22:45Z

MB; I kinda realized that without a custom kernel the performance would likely still be poor and that there aren't any good torch ops to for fused experts, plus that I was maybe a bit out of my depth.
Torch bmm has the aforementioned issue of requiring significant padding if expert load is not equivalent.
I looked into the torchtune moe implementation and they're using an undocumented torch function, which also seems to have some quirky compatibility issues as usual.
Currently mostly just pulled some ops out of the for loop for experts and batched them, but I couldn't find a good way in pure torch to get rid of the kernel launch overhead for each expert.
Megablocks does look promising though.

ArthurZucker · 2025-08-13T12:19:39Z

get rid of the kernel launch overhead for each expert.
does cudagraph not help (compile with reduce-overhead) ?

ArthurZucker · 2025-08-13T12:19:47Z

Thanks a lot for the detailed explanation

woct0rdho · 2025-09-04T03:26:39Z

One question I have is - will this be compatible with PEFT?

Great to see progress on this! @DocShotgun I've written something that may help: https://github.com/woct0rdho/transformers-qwen3-moe-fused

ArthurZucker · 2025-12-01T10:10:49Z

#41580 fixed this :)

perinmclaughlin added 3 commits August 7, 2025 09:37

ScatterMoE for V3

01d0c60

Make Qwen3MoeSparseMoeBlock inherit from DeepseekV3MoE; modify forwar…

5aec5fe

…d and init accordingly to use moe()

Make Ernie4_5MoeSparceMoeBlock inherit from DeepseekV3MoE and modify …

156b842

…the forward to use moe()

perinmclaughlin added 2 commits August 7, 2025 15:41

Minor project formatting fixes (import order, tabs on newline)

61783e1

Merge branch 'main' into V3ScatterMoE

21a71c2

perinmclaughlin closed this Aug 12, 2025

perinmclaughlin deleted the V3ScatterMoE branch August 12, 2025 15:56

perinmclaughlin restored the V3ScatterMoE branch August 12, 2025 16:09

perinmclaughlin reopened this Aug 12, 2025

perinmclaughlin closed this Aug 13, 2025

woct0rdho mentioned this pull request Sep 4, 2025

feat(deepseek_v3): Add grouped GEMM kernel for faster MoE computation #40583

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix naive for loops for MoE models resulting in sub 20% downstream MFU for training with trl, e.t.c (Qwen3, Deepseek V3, Ernie 4.5, GLM 4.5, Dots1) #40016

[WIP] Fix naive for loops for MoE models resulting in sub 20% downstream MFU for training with trl, e.t.c (Qwen3, Deepseek V3, Ernie 4.5, GLM 4.5, Dots1) #40016
perinmclaughlin wants to merge 5 commits intohuggingface:mainfrom
perinmclaughlin:V3ScatterMoE

perinmclaughlin commented Aug 7, 2025

Uh oh!

DocShotgun commented Aug 9, 2025

Uh oh!

perinmclaughlin commented Aug 9, 2025

Uh oh!

ArthurZucker commented Aug 12, 2025

Uh oh!

ArthurZucker commented Aug 12, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Aug 12, 2025

Uh oh!

perinmclaughlin commented Aug 12, 2025 •

edited

Loading

Uh oh!

ArthurZucker commented Aug 13, 2025

Uh oh!

ArthurZucker commented Aug 13, 2025

Uh oh!

woct0rdho commented Sep 4, 2025 •

edited

Loading

Uh oh!

ArthurZucker commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

perinmclaughlin commented Aug 7, 2025

What does this PR do?

Before submitting

Uh oh!

DocShotgun commented Aug 9, 2025

Uh oh!

perinmclaughlin commented Aug 9, 2025

Uh oh!

ArthurZucker commented Aug 12, 2025

Uh oh!

ArthurZucker commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Aug 12, 2025

Uh oh!

perinmclaughlin commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Aug 13, 2025

Uh oh!

ArthurZucker commented Aug 13, 2025

Uh oh!

woct0rdho commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ArthurZucker commented Aug 12, 2025 •

edited

Loading

perinmclaughlin commented Aug 12, 2025 •

edited

Loading

woct0rdho commented Sep 4, 2025 •

edited

Loading