feat(deepseek_v3): Add grouped GEMM kernel for faster MoE computation by bzantium · Pull Request #40583 · huggingface/transformers

bzantium · 2025-09-01T07:12:34Z

What does this PR do?

This PR introduces an optimization that significantly accelerates the Mixture-of-Experts (MoE) layer computations in the DeepseekV3 model by integrating the grouped_gemm library. This enhances performance for both training and inference.

The original MoE implementation processed expert networks sequentially using a Python loop, which created a performance bottleneck due to high GPU kernel launch overhead. This PR addresses that issue with the following key changes:

🚀 Grouped GEMM Kernel Integration: A new grouped_forward operational path replaces the iterative Python loop over experts with a single, high-performance kernel call from the grouped_gemm library. This minimizes GPU overhead and maximizes throughput.
🧩 Expert Module Fusing: To efficiently leverage the grouped_gemm kernel, this PR implements a fuse_experts() utility and a GroupedDeepseekV3MLP module. These tools combine the weights of multiple experts into a single, contiguous tensor, which is a prerequisite for the kernel.
⚙️ Configuration and Usability: A use_grouped_gemm flag has been added to DeepseekV3Config to enable this optimization.
Important: If you set use_grouped_gemm=True directly when loading a model (e.g., in .from_pretrained()), you must provide a state_dict where the expert weights have already been fused. For standard checkpoints, the recommended workflow is to load the model normally and then call the model.fuse_experts() method.
⚠️ Dependency Handling: The model now gracefully handles the optional dependency. If use_grouped_gemm is enabled but the library isn't found, it raises a clear ImportError with installation instructions.

How to use

Install the required library:

pip install git+https://github.com/fanshiqing/grouped_gemm@main

Load a standard model and fuse the experts after loading (Recommended method):

from transformers import AutoModelForCausalLM

# Load a standard checkpoint from the Hub
model = AutoModelForCausalLM.from_pretrained(
    "moonshotai/Moonlight-16B-A3B-Instruct",
    device_map="cuda:0",
    # Do not set use_grouped_gemm=True here
)

# Fuse the experts in-place
model.fuse_experts()

# The model is now optimized and ready for faster inference or training

For checkpoints that have already been saved in the fused format, you can load them directly by setting use_grouped_gemm=True in the .from_pretrained() call.

Fixes #40582

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @Rocketknight1

github-actions · 2025-09-01T07:21:38Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: deepseek_v3, dots1, glm4_moe, glm4v_moe

woct0rdho · 2025-09-04T03:09:49Z

Great to see progress on this! Previously a common concern is how to support PEFT/LoRA, such as in #40016 . I've written something that may help: https://github.com/woct0rdho/transformers-qwen3-moe-fused

bzantium · 2025-09-04T03:55:10Z

@woct0rdho Thanks to share great work! I will check this out.
@ArthurZucker @Rocketknight1 please review this and let me know what I should do more to integrate this kind of works.

ArthurZucker · 2025-09-11T08:25:22Z

Will have a look thanks for the PR, glad to see you here again! 🤗

ArthurZucker

Thanks a lot for the PR!
This IS planned, but not this way!

#40132 is taking care of isolating the expert class, it will then be followed up by a on the fly weight conversion to avoid having _fuse_experts!

We also don't want to add extra deps now that we have kernels!
But the "naive" path will be using torch's gemm (and probably with a fallback for older torch version with a naive for loop / bmm)

bzantium · 2025-09-20T06:15:03Z

Thank you for the review and the clear direction! I appreciate you pointing me to #40132. The plan to isolate the expert class first and then use on-the-fly weight conversion makes a lot of sense.
to: @ArthurZucker

ArthurZucker · 2025-10-01T13:09:32Z

Awesome, first pr is close to being merged!

glide-the · 2025-10-01T16:53:31Z

Waiting for support！

litanli · 2025-10-21T17:58:53Z

Waiting for support as well, would really appreciate MoE LoRA capability!

zenyanbo · 2025-10-24T10:11:57Z

Will all moe models benefit from this? Even trainer?

bzantium · 2025-10-30T05:41:18Z

Thanks for many attention to this feature! I will start working on it since #40132 have successfully merged.

bzantium · 2026-01-08T02:33:49Z

Since #42697 is merged, I will close this PR.

json.bourne added 2 commits September 1, 2025 16:06

feat(deepseek_v3): Add grouped GEMM kernel for faster MoE computation

7e2815a

apply make quality and fix-copies

c66e40d

add use_grouped_gemm on configuration

e73e3dd

ArthurZucker reviewed Sep 18, 2025

View reviewed changes

ArthurZucker added the Mixture of Experts label Sep 18, 2025

bzantium closed this Jan 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(deepseek_v3): Add grouped GEMM kernel for faster MoE computation#40583

feat(deepseek_v3): Add grouped GEMM kernel for faster MoE computation#40583
bzantium wants to merge 3 commits intohuggingface:mainfrom
bzantium:feature/#40582

bzantium commented Sep 1, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Sep 1, 2025

Uh oh!

woct0rdho commented Sep 4, 2025

Uh oh!

bzantium commented Sep 4, 2025 •

edited

Loading

Uh oh!

ArthurZucker commented Sep 11, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

bzantium commented Sep 20, 2025 •

edited

Loading

Uh oh!

ArthurZucker commented Oct 1, 2025

Uh oh!

glide-the commented Oct 1, 2025

Uh oh!

litanli commented Oct 21, 2025

Uh oh!

zenyanbo commented Oct 24, 2025

Uh oh!

bzantium commented Oct 30, 2025

Uh oh!

bzantium commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

bzantium commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How to use

Before submitting

Who can review?

Uh oh!

github-actions Bot commented Sep 1, 2025

Uh oh!

woct0rdho commented Sep 4, 2025

Uh oh!

bzantium commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Sep 11, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

bzantium commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Oct 1, 2025

Uh oh!

glide-the commented Oct 1, 2025

Uh oh!

litanli commented Oct 21, 2025

Uh oh!

zenyanbo commented Oct 24, 2025

Uh oh!

bzantium commented Oct 30, 2025

Uh oh!

bzantium commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bzantium commented Sep 1, 2025 •

edited

Loading

bzantium commented Sep 4, 2025 •

edited

Loading

bzantium commented Sep 20, 2025 •

edited

Loading