[Feature] support chunked moe #4575

Wanglongzhi2001 · 2025-10-24T02:32:36Z

Motivation

In some scenario, the input token of MoE will be very large, to reduce the activation memory of MoE, this pr develop chunked MoE to split the input of MoE into multi parts.

Modifications

new feature

Usage or Command

Just add two extra params: enable-chunked-moe and chunked-moe-size:

python -m fastdeploy.entrypoints.openai.multi_api_server \
       --ports "8280,8281,8282,8283,8284,8285,8286,8287" \
       --metrics-ports "8480,8481,8482,8488,8484,8485,8486,8487" \
       --num-servers 8 \
       --args --tensor-parallel-size 1 \
       --data-parallel-size 8 \
       --enable-expert-parallel \
       --enable-chunked-moe \
       --chunked-moe-size 1024 \
       --engine-worker-queue-port "8380,8381,8382,8383,8384,8385,8386,8387" \
       --max-model-len 16384 \
       --max-num-seqs 128 \
       --gpu-memory-utilization 0.9 \
       --model "$MODEL_PATH" \
       --num-gpu-blocks-override 12288 \
       --enable-mm-output \
       --prealloc-dec-block-slot-num-threshold 15 \
       --no-enable-prefix-caching \
       --quantization block_wise_fp8 \
       --ips $ip_list \

Accuracy Tests

Don't affect model outputs.

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-10-24T02:32:42Z

Thanks for your contribution!

RichardWooSJTU

Is it possible to force using low_latency_dispatch if the chunk size is limited to 256?

RichardWooSJTU · 2025-10-24T02:40:25Z

fastdeploy/model_executor/layers/moe/moe.py

+                    if i == num_chunk - 1:
+                        out[i * chunk_size:,:] = self.quant_method.apply(self, x[i * chunk_size:,:], gate)
+                    else:
+                        out[i * chunk_size:(i+1)*chunk_size,:] = self.quant_method.apply(self, x[i * chunk_size:(i+1)*chunk_size,:], gate)


Consider using paddle.split before loop instead of slice in each step, which will cause lot of launch overhead

Same to out, which can use paddle.concat after loop

YuanRisheng · 2025-10-24T02:44:01Z

fastdeploy/model_executor/layers/moe/moe.py

+        assert out is not None, "FusedMOE forward got error result"
        return out
+
+    def forward_chunked_moe(self, x, gate):


这个函数现在ci能测到吗，如果测不到需要补充单测

好的，稍后补充单测

yuanlehome · 2025-10-24T03:59:45Z

fastdeploy/model_executor/layers/moe/moe.py

+            if num_chunk == max_num_chunk:
+                for i in range(num_chunk):
+                    out_split_list[i] = self.quant_method.apply(self, x_split_list[i], gate)
+            else: # num_chunk < max_num_chunk


这几行可以删去吧，冗余逻辑

yuanlehome · 2025-10-24T04:04:29Z

fastdeploy/model_executor/layers/moe/moe.py

        """
-        out = self.quant_method.apply(self, x, gate)
+        out = None
+        if self.fd_config.parallel_config.use_ep and self.fd_config.parallel_config.enable_chunked_moe:


非EP下也可以用吧，能否写得通用些

Copilot

Pull request overview

This PR introduces chunked MoE (Mixture of Experts) support, enabling MoE layers to process inputs in configurable chunks to optimize memory usage and synchronization across distributed ranks.

Key changes:

Added enable_chunked_moe and chunked_moe_size configuration parameters
Implemented chunked MoE forward pass that splits inputs into chunks and synchronizes across ranks
Added distributed status collection to coordinate chunk sizes across ranks

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
tests/layers/test_chunked_moe.py	New test file validating chunked MoE functionality with multi-rank setup
fastdeploy/worker/worker_process.py	Added CLI arguments for chunked MoE configuration
fastdeploy/worker/model_runner_base.py	Introduced dataclasses for tracking distributed status
fastdeploy/worker/gpu_model_runner.py	Implemented distributed status collection and chunk size coordination
fastdeploy/model_executor/layers/moe/moe.py	Added chunked MoE forward pass implementation
fastdeploy/engine/engine.py	Updated worker service to pass chunked MoE configuration
fastdeploy/engine/async_llm.py	Updated async worker service to pass chunked MoE configuration
fastdeploy/engine/args_utils.py	Added CLI argument definitions for chunked MoE
fastdeploy/config.py	Added chunked MoE configuration fields to ParallelConfig

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Copilot · 2025-11-26T07:19:02Z

fastdeploy/model_executor/layers/moe/moe.py

+                if i <= self.fd_config.parallel_config.moe_num_chunk - 1:
+                    out_split_list[i] = self.quant_method.apply(self, x_split_list[i], gate)
+                else:
+                    self.quant_method.apply(self, x, gate)


The result of quant_method.apply() is discarded when i > moe_num_chunk - 1. This appears to be a synchronization mechanism but wastes computation. If synchronization is needed, consider using explicit barrier operations instead of dummy computation. If this is intentional for performance reasons, add a comment explaining why.

Suggested change

self.quant_method.apply(self, x, gate)

# Synchronization is required here to ensure all ranks are aligned.

# Replacing dummy computation with an explicit barrier for clarity and efficiency.

paddle.distributed.barrier()

Copilot · 2025-11-26T07:19:03Z

fastdeploy/model_executor/layers/moe/moe.py

+            for i in range(self.fd_config.parallel_config.max_moe_num_chunk):
+                out = self.quant_method.apply(self, x, gate)


When token_num <= chunk_size, the same computation is repeated max_moe_num_chunk times, with only the last result being used. This appears to be for cross-rank synchronization but wastes significant compute resources. Consider using explicit communication primitives (e.g., paddle.distributed.barrier()) instead of redundant computation.

Suggested change

for i in range(self.fd_config.parallel_config.max_moe_num_chunk):

out = self.quant_method.apply(self, x, gate)

out = self.quant_method.apply(self, x, gate)

paddle.distributed.barrier()

fastdeploy/worker/gpu_model_runner.py

Copilot · 2025-11-26T07:19:04Z

fastdeploy/worker/model_runner_base.py

+
+@dataclass
+class DistributedOut:
+    if_only_decode: bool = None


Using None as default for a boolean field is unconventional in dataclasses. Consider using Optional[bool] = None with the proper import, or use a default boolean value like False if a default state can be defined.

Suggested change

if_only_decode: bool = None

if_only_decode: Optional[bool] = None

Copilot · 2025-11-26T07:19:04Z

fastdeploy/engine/args_utils.py

+            "--chunked-moe-size",
+            type=int,
+            default=EngineArgs.chunked_moe_size,
+            help="chunked size of moe input.",


Inconsistent capitalization in help text. The help text should start with a capital letter or follow the pattern of other help messages in the file. Change to 'Chunked size of moe input.' or 'Chunk size of MoE input.'

Suggested change

help="chunked size of moe input.",

help="Chunked size of MoE input.",

Copilot · 2025-11-26T07:19:05Z

fastdeploy/model_executor/layers/moe/moe.py

+
+        return out
+
+    def forward_normal(self, x, gate):


Missing docstring for forward_normal method. Add documentation explaining this is the standard non-chunked MoE forward pass.

Suggested change

def forward_normal(self, x, gate):

def forward_normal(self, x, gate):

"""

Standard non-chunked MoE forward pass.

Args:

x (Tensor): Input tensor to the MoE layer.

gate (nn.Layer): Gating layer for expert selection.

Returns:

Tensor: Output tensor after applying the MoE experts.

"""

fastdeploy/worker/gpu_model_runner.py

Copilot · 2025-11-26T07:19:06Z

tests/layers/test_chunked_moe.py

+        self.quant_method = MockQuantMethod()
+
+    def forward(self, x, gate):
+        return self.quant_method.apply(x, gate)


Mock implementation incorrectly passes only 2 arguments to quant_method.apply(), but the actual implementation (line 630, 632, 637 in moe.py) passes 3 arguments: self, x, gate. This makes the mock inconsistent with the real code and may not catch interface bugs. Update to return self.quant_method.apply(self, x, gate).

Suggested change

return self.quant_method.apply(x, gate)

return self.quant_method.apply(self, x, gate)

Copilot · 2025-11-26T07:19:06Z

tests/layers/test_chunked_moe.py

+    def apply(self, layer, x, gate):
+        return x


The mock MockQuantMethod.apply ignores the layer and gate parameters and simply returns x. This doesn't validate that the actual chunked MoE logic correctly passes these parameters to the quant method. Consider adding assertions to verify the parameters are passed correctly.

fastdeploy/worker/gpu_model_runner.py

codecov-commenter · 2025-11-28T07:49:36Z

Codecov Report

❌ Patch coverage is 97.22222% with 2 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@051b82b). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/layers/moe/moe.py	90.90%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #4575   +/-   ##
==========================================
  Coverage           ?   59.74%           
==========================================
  Files              ?      324           
  Lines              ?    39669           
  Branches           ?     5965           
==========================================
  Hits               ?    23701           
  Misses             ?    14087           
  Partials           ?     1881

Flag	Coverage Δ
GPU	`59.74% <97.22%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

RichardWooSJTU

LGTM

yuanlehome · 2025-12-01T07:01:06Z

fastdeploy/worker/gpu_model_runner.py

+
+        if_only_decode = dist_status.if_only_decode
+        if self.fd_config.parallel_config.enable_chunked_moe:
+            self.fd_config.parallel_config.max_moe_num_chunk = dist_status.max_moe_num_chunk


怎么不传递forward_meta到moe layer了？

forward_meta 的 pr 在这：https://github.com/PaddlePaddle/FastDeploy/pull/5138，涉及面有点多，暂时没那么快能合入，等后面合入后 MoE forward_meta 支持 forward_meta 了再同步修改

yuanlehome · 2025-12-01T07:05:18Z

fastdeploy/config.py

+        self.enable_chunked_moe = False
+        self.chunked_moe_size = 256
+        self.max_moe_num_chunk = 1
+        self.moe_num_chunk = 1


chunked moe相关的参数有点多了，多于2个的，建议打包成字典

好的，跟着下个forward meta的pr一起改吧，今天提测需要

找 @gongshaotian approve下吧～

yuanlehome · 2025-12-01T07:05:52Z

fastdeploy/model_executor/layers/moe/moe.py

+            x_split_list = paddle.tensor_split(x, self.fd_config.parallel_config.moe_num_chunk, axis=0)
+            out_split_list = [None] * self.fd_config.parallel_config.moe_num_chunk
+
+            for i in range(self.fd_config.parallel_config.max_moe_num_chunk):


max_moe_num_chunk是动态变化的嘛，我怎么感觉这个进不了cudaGraph?

max_moe_num_chunk是动态变化的嘛，我怎么感觉这个进不了cudaGraph?

目前只在生图上用，并且设置的 chunked size 1024以上，因此在 max_moe_num_chunk 变化的时候是不进 cudagraph 的

max_moe_num_chunk是动态变化的嘛，我怎么感觉这个进不了cudaGraph?

我理解一般需要这个功能的场景也是 token 数目非常大的时候，也是不兼容 cudagraph 的场景。我可以下个 pr 加个 assert，这个和 cudagraph 不能同时打开

gongshaotian

LGTM

RichardWooSJTU reviewed Oct 24, 2025

View reviewed changes

YuanRisheng reviewed Oct 24, 2025

View reviewed changes

yuanlehome reviewed Oct 24, 2025

View reviewed changes

Wanglongzhi2001 force-pushed the chunked_moe branch from 77cc7d4 to d948eba Compare October 27, 2025 12:30

Copilot AI review requested due to automatic review settings November 26, 2025 07:14

Wanglongzhi2001 force-pushed the chunked_moe branch from d948eba to bf84b18 Compare November 26, 2025 07:14

Copilot started reviewing on behalf of Wanglongzhi2001 November 26, 2025 07:14 View session

Copilot finished reviewing on behalf of Wanglongzhi2001 November 26, 2025 07:16

Copilot AI reviewed Nov 26, 2025

View reviewed changes

Wanglongzhi2001 force-pushed the chunked_moe branch from 991c45e to d7f9f6e Compare November 26, 2025 09:00

Wanglongzhi2001 added 6 commits November 27, 2025 16:32

[Feature] support chunked moe

aecdf10

update

bd76bc4

update

db1beb6

fix and add test

ca37df2

update

c0a0629

fix conflict and modity test

da6c12d

Wanglongzhi2001 force-pushed the chunked_moe branch from d7f9f6e to da6c12d Compare November 27, 2025 08:44

Wanglongzhi2001 added 8 commits November 28, 2025 11:07

fix fused_moe

cce162c

fix fused_moe

997d171

fix docstring

dba2d9e

fix

7869214

fix typo

b3d2e47

fix test

1b41da0

fix

a368788

fix

0a37673

Wanglongzhi2001 added 2 commits December 1, 2025 10:45

fix test

ca7a2d4

fix test

0e7d4e0

RichardWooSJTU approved these changes Dec 1, 2025

View reviewed changes

yuanlehome reviewed Dec 1, 2025

View reviewed changes

gongshaotian approved these changes Dec 1, 2025

View reviewed changes

Wanglongzhi2001 merged commit add524d into PaddlePaddle:develop Dec 1, 2025
15 of 19 checks passed

Wanglongzhi2001 deleted the chunked_moe branch January 29, 2026 13:47

-                    self.quant_method.apply(self, x, gate)
+                    # Synchronization is required here to ensure all ranks are aligned.
+                    # Replacing dummy computation with an explicit barrier for clarity and efficiency.
+                    paddle.distributed.barrier()

		for i in range(self.fd_config.parallel_config.max_moe_num_chunk):
		out = self.quant_method.apply(self, x, gate)

	if_only_decode: bool = None
	if_only_decode: Optional[bool] = None

	help="chunked size of moe input.",
	help="Chunked size of MoE input.",

-    def forward_normal(self, x, gate):
+    def forward_normal(self, x, gate):
+        """
+        Standard non-chunked MoE forward pass.
+        Args:
+            x (Tensor): Input tensor to the MoE layer.
+            gate (nn.Layer): Gating layer for expert selection.
+        Returns:
+            Tensor: Output tensor after applying the MoE experts.
+        """

	return self.quant_method.apply(x, gate)
	return self.quant_method.apply(self, x, gate)

[Feature] support chunked moe #4575

[Feature] support chunked moe #4575

Uh oh!

Conversation

Wanglongzhi2001 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Oct 24, 2025

Uh oh!

RichardWooSJTU left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuanlehome Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

RichardWooSJTU left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Wanglongzhi2001 Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Wanglongzhi2001 commented Oct 24, 2025 •

edited

Loading

yuanlehome Oct 24, 2025 •

edited

Loading

codecov-commenter commented Nov 28, 2025 •

edited

Loading

Wanglongzhi2001 Dec 1, 2025 •

edited

Loading