feat(deepseek_v3): Add grouped GEMM kernel for faster MoE computation#40583
feat(deepseek_v3): Add grouped GEMM kernel for faster MoE computation#40583bzantium wants to merge 3 commits intohuggingface:mainfrom
Conversation
|
[For maintainers] Suggested jobs to run (before merge) run-slow: deepseek_v3, dots1, glm4_moe, glm4v_moe |
|
Great to see progress on this! Previously a common concern is how to support PEFT/LoRA, such as in #40016 . I've written something that may help: https://github.com/woct0rdho/transformers-qwen3-moe-fused |
|
@woct0rdho Thanks to share great work! I will check this out. |
|
Will have a look thanks for the PR, glad to see you here again! 🤗 |
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks a lot for the PR!
This IS planned, but not this way!
#40132 is taking care of isolating the expert class, it will then be followed up by a on the fly weight conversion to avoid having _fuse_experts!
We also don't want to add extra deps now that we have kernels!
But the "naive" path will be using torch's gemm (and probably with a fallback for older torch version with a naive for loop / bmm)
|
Thank you for the review and the clear direction! I appreciate you pointing me to #40132. The plan to isolate the expert class first and then use on-the-fly weight conversion makes a lot of sense. |
|
Awesome, first pr is close to being merged! |
|
Waiting for support! |
|
Waiting for support as well, would really appreciate MoE LoRA capability! |
|
Will all moe models benefit from this? Even trainer? |
|
Thanks for many attention to this feature! I will start working on it since #40132 have successfully merged. |
|
Since #42697 is merged, I will close this PR. |
What does this PR do?
This PR introduces an optimization that significantly accelerates the Mixture-of-Experts (MoE) layer computations in the
DeepseekV3model by integrating thegrouped_gemmlibrary. This enhances performance for both training and inference.The original MoE implementation processed expert networks sequentially using a Python loop, which created a performance bottleneck due to high GPU kernel launch overhead. This PR addresses that issue with the following key changes:
🚀 Grouped GEMM Kernel Integration: A new
grouped_forwardoperational path replaces the iterative Python loop over experts with a single, high-performance kernel call from thegrouped_gemmlibrary. This minimizes GPU overhead and maximizes throughput.🧩 Expert Module Fusing: To efficiently leverage the
grouped_gemmkernel, this PR implements afuse_experts()utility and aGroupedDeepseekV3MLPmodule. These tools combine the weights of multiple experts into a single, contiguous tensor, which is a prerequisite for the kernel.⚙️ Configuration and Usability: A
use_grouped_gemmflag has been added toDeepseekV3Configto enable this optimization.Important: If you set
use_grouped_gemm=Truedirectly when loading a model (e.g., in.from_pretrained()), you must provide astate_dictwhere the expert weights have already been fused. For standard checkpoints, the recommended workflow is to load the model normally and then call themodel.fuse_experts()method.use_grouped_gemmis enabled but the library isn't found, it raises a clearImportErrorwith installation instructions.How to use
Install the required library:
Load a standard model and fuse the experts after loading (Recommended method):
For checkpoints that have already been saved in the fused format, you can load them directly by setting
use_grouped_gemm=Truein the.from_pretrained()call.Fixes #40582
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker @Rocketknight1