support qwen3moe #3084

bukejiyu · 2025-07-30T09:17:50Z

qwen3moe support default loader v1
todo:
1.增加新版loader单测
新版loader模型测试topp=0:

qwen3moe 逐token对齐

旧版loader模型测试：ci通过

paddle-bot · 2025-07-30T09:17:55Z

Thanks for your contribution!

YuanRisheng · 2025-08-04T03:02:50Z

fastdeploy/model_executor/layers/backends/gcu/quantization/weight_only.py

+        layer.weight = layer.create_parameter(
+            shape=layer.weight_shape,
+            dtype=layer.weight_dtype,
+            is_bias=False,
+            default_initializer=paddle.nn.initializer.Constant(0),
+        )
+
+        layer.weight = layer.create_parameter(
+            shape=layer.weight_shape,
+            dtype=layer.weight_dtype,
+            is_bias=False,
+        )


layer.weight怎么赋值了俩次，这里有问题吧

YuanRisheng · 2025-08-04T03:04:55Z

fastdeploy/model_executor/layers/linear.py

+        # if fd_config.quant_config:
+        #     self.quant_method = fd_config.quant_config.get_quant_method(self)
+        #     self.quant_method.create_weights(self)


注释代码删掉

YuanRisheng · 2025-08-04T03:05:22Z

fastdeploy/model_executor/layers/linear.py

+        # self.init_weight()
+
+    # def init_weight(self):
+    #     """
+    #     Initialize the weights and biases.
+    #     """
+    #     if self.skip_quant:
+    #         self.weight_dtype = self._dtype
+
+    #     self.weight = self.create_parameter(
+    #         shape=self.weight_shape,
+    #         dtype=self.weight_dtype,
+    #         is_bias=False,
+    #         default_initializer=paddle.nn.initializer.Constant(0),
+    #     )
+
+    #     self.bias = None
+    #     if self.with_bias:
+    #         self.bias = self.create_parameter(
+    #             shape=[self.hidden_size],
+    #             dtype=self._dtype,
+    #             is_bias=True,
+    #         )
+
+    #     if self.nranks > 0:
+    #         # row parallel
+    #         _set_var_distributed(self.weight, split_axis=0)
+
+    #     # smooth quant
+    #     self.linear_shift = None
+    #     self.linear_smooth = None


这代码还有用吗

YuanRisheng · 2025-08-04T03:11:41Z

fastdeploy/model_executor/models/ernie4_5_moe.py

            weight_key_map=weight_key_map,
        )

+        self.gate = ReplicatedLinear(


这里为何多了self.gate

把gate 从 fusemoe里面提了出来对齐vllm ，让linear相关的权重创建放到linear.py里面控制

yuanlehome · 2025-08-04T07:09:07Z

fastdeploy/model_executor/models/ernie4_5_vl/ernie4_5_vl_moe.py

    fake_hidden_states: Optional[paddle.Tensor] = None


+class Ernie4_5_VMoeBlock(nn.Layer):


Ernie4_5_VMoeBlock ->Ernie4_5_VLMoeBlock ?

yuanlehome · 2025-08-04T07:11:53Z

fastdeploy/model_executor/models/ernie4_5_moe.py

            }

-        self.fused_moe = FusedMoE(
+        self.expert = FusedMoE(


有的叫self.experts有的又叫self.expert ?

yuanlehome · 2025-08-04T07:13:04Z

fastdeploy/model_executor/layers/linear.py

+        if self.weight_key not in state_dict:
+            # TODO(bukejiyu): Temporary hack for Ernie4.5-VL, remove after loader refactor
+            self.weight_key = self.weight_key + "_1"


?为什么要写在这里？

改成可以传特殊的weight_key

yuanlehome · 2025-08-04T07:13:57Z

fastdeploy/model_executor/models/deepseek_v3.py

        )

-        if forward_meta.max_enc_len_this_time:
+        if forward_meta.max_len_tensor_cpu[1]:


为什么修改这里？

deepseek跑不通不改成这个，这个我让瑞安也review下

yuanlehome · 2025-08-04T07:15:11Z

fastdeploy/model_executor/layers/linear.py

+        # mlp.gate.weight is precision-sensitive, so we cast it to float32 for computation
+        if layer.weight.dtype != weights.dtype:
+            weights = weights.cast(layer.weight.dtype)


不是在self.gate初始化的时候指定了weight_dtype了吗？这里为什么需要cast？

因为disk weights里面还是bf16

yuanlehome · 2025-08-04T07:18:34Z

fastdeploy/model_executor/models/utils.py

+            # mlp.gate.weight is precision-sensitive, so we cast it to float32 for computation
+            if param.dtype != loaded_weight.dtype:
+                loaded_weight = loaded_weight.cast(param.dtype)


这里为什么需要cast ?

因为新版的 loader，普通的linear会走这个函数来load 权重，但是gate的权重在磁盘上是 bf16 但是 param是 float32 所以需要cast一下

磁盘上不是fp32类型吗？

后面有需要再修改吧，这个pr只打算保持和develop统一

qingqing01 · 2025-08-05T13:04:54Z

fastdeploy/model_executor/layers/linear.py

+        if hasattr(layer, "nranks") and layer.nranks > 0:
+            split_axis = extra_weight_attrs.get("split_axis")
+            _set_var_distributed(layer.weight, split_axis=split_axis)
+            set_weight_attrs(layer.weight, {"output_dim": extra_weight_attrs.get("output_dim")})


output_dim 是啥含义？

用来控制 tp切分的维度

qingqing01 · 2025-08-05T13:06:56Z

fastdeploy/model_executor/layers/linear.py

+        self.quant_method.create_weights(
+            self,
+            split_axis=1,
+            output_dim=True,


最好加一些注释， layout是什么，split_axis 是哪个维度，output_dim是什么含义

qingqing01 · 2025-08-05T13:07:08Z

fastdeploy/model_executor/layers/linear.py

+        self.quant_method.create_weights(
+            self,
+            split_axis=0,
+            output_dim=False,


qingqing01 · 2025-08-05T13:07:33Z

fastdeploy/model_executor/layers/moe/fused_moe_backend_base.py

+
+class UnquantizedFusedMoEMethod(MoEMethodBase):
+    def create_weights(self, layer: nn.Layer, **extra_weight_attrs):
+        from fastdeploy.platforms import current_platform


这个为啥在这里 import?

调整至上面，done

qingqing01 · 2025-08-05T13:08:33Z

fastdeploy/model_executor/layers/moe/fused_moe_backend_base.py

+            self.down_proj_weight_shape = [layer.num_experts, layer.moe_intermediate_size, layer.hidden_size]
        else:
-            return self.apply_tp(layer, x, gate_out)
+            self.up_gate_proj_weight_shape = [layer.num_experts, layer.moe_intermediate_size * 2, layer.hidden_size]


GPU之外的硬件 weight 的 layout 都是这个嘛？

对的目前我们适配的硬件都会走 else

qingqing01 · 2025-08-05T13:26:33Z

fastdeploy/model_executor/layers/moe/moe.py

            tp_size={self.tp_size}."
        )

+    def weight_loader(self, param, loaded_weight, expert_id, shard_id: Optional[str] = None):


增加输入参数注解，比如 shard_id含义是啥，另外，id通常代表 int 而不是 string 吧

weight_loader里面都带了 assert shard_id in ["gate", "down", "up"] 只是和vllm叫了一个名字

qingqing01 · 2025-08-05T13:35:24Z

fastdeploy/model_executor/models/ernie4_5_vl/ernie4_5_vl_moe.py

+                "up_gate_proj_expert_in_scale_key": f"{prefix}.experts.{{}}.up_gate_proj.activation_scale",
+                "down_proj_expert_in_scale_key": f"{prefix}.experts.{{}}.down_proj.activation_scale",
+            }
+        else:


else的moe_quant_type是什么类型，可以加下注释

qingqing01 · 2025-08-05T13:38:30Z

test/ci_use/EB_VL_Lite/baseline.txt

@@ -401,834 +401,834 @@ resampler_model.mlp.weight
 resampler_model.mlp.bias


这个文件是自动生成的，还是手动改的？

自动生成然后手动改了下不然rl这边没法过ci

* fix noaux_tc op * fix * update * fix qk norm * fix linear for prequant loader * test * fix * fix * rm some print * fix noaux_tc op * test * Fix the confused enable_early_stop when only set early_stop_config (#3214) * fix the confused early_stop_config when only set early_stop_config * pre-commit * write a general method * Add ci case for min token and max token (#3229) Co-authored-by: xujing43 <xujing43@baidu.com> * add some evil cases (#3240) * add repitation early stop cases * add repitation early stop cases * add bad cases * add bad cases * add evil cases * qwen3_moe (#3084) * [Feature] support seed parameter (#3161) * support seed * fix * add SamplingMetadata seed test * The next_tokens values are inconsistent! * add air and rejection seed test * fix * add SamplingParams seed test * fix seed=0 * Default to defualt * fix * fix args_utils * fix review * fix review * fix * fix * add xpu,gcu,iluvatar support seed * fix * 【Fix Bug】修复 fa3 支持集中式bug (#3235) * fix fa3 集中式bug * 增加qknorm参数 * fix qk norm * fix * update * fix linear for prequant loader * fix * fix * rm some print * fix * fix moe init weight&scale * fix moe init weight&scale --------- Co-authored-by: bukejiyu <395822456@qq.com> Co-authored-by: yuanxiaolan <yuanxiaolan01@baidu.com> Co-authored-by: Zero Rains <linjunlu@zerorains.top> Co-authored-by: xjkmfa <108254620+xjkmfa@users.noreply.github.com> Co-authored-by: xujing43 <xujing43@baidu.com> Co-authored-by: Divano <dddivano@outlook.com> Co-authored-by: bukejiyu <52310069+bukejiyu@users.noreply.github.com> Co-authored-by: lizexu123 <39205361+lizexu123@users.noreply.github.com> Co-authored-by: yangjianfengo1 <125249383+yangjianfengo1@users.noreply.github.com> Co-authored-by: qingqing01 <dangqingqing@baidu.com>

bukejiyu force-pushed the refactor_moe branch 4 times, most recently from 1429b97 to 0a03421 Compare August 1, 2025 07:31

bukejiyu changed the title ~~qwen3moe~~ support qwen3moe Aug 1, 2025

bukejiyu force-pushed the refactor_moe branch from 0a03421 to 5ff6267 Compare August 1, 2025 10:03

bukejiyu requested review from YuanRisheng, qingqing01 and yuanlehome August 1, 2025 10:34

YuanRisheng reviewed Aug 4, 2025

View reviewed changes

yuanlehome reviewed Aug 4, 2025

View reviewed changes

bukejiyu force-pushed the refactor_moe branch from 5b2e062 to 59657cf Compare August 5, 2025 09:14

qingqing01 reviewed Aug 5, 2025

View reviewed changes

bukejiyu force-pushed the refactor_moe branch 2 times, most recently from 8f2a735 to 2534428 Compare August 5, 2025 17:25

qwen3_moe

b18a980

bukejiyu force-pushed the refactor_moe branch from 2534428 to b18a980 Compare August 5, 2025 17:27

YuanRisheng approved these changes Aug 6, 2025

View reviewed changes

Jiang-Jia-Jun merged commit 20839ab into PaddlePaddle:develop Aug 6, 2025
16 of 21 checks passed

gzy19990617 pushed a commit to gzy19990617/FastDeploy that referenced this pull request Aug 7, 2025

qwen3_moe (PaddlePaddle#3084)

fa5dcfc

		fake_hidden_states: Optional[paddle.Tensor] = None


		class Ernie4_5_VMoeBlock(nn.Layer):

		@@ -401,834 +401,834 @@ resampler_model.mlp.weight
		resampler_model.mlp.bias

support qwen3moe #3084

support qwen3moe #3084

Uh oh!

Conversation

bukejiyu commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddle-bot bot commented Jul 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

bukejiyu commented Jul 30, 2025 •

edited

Loading