[Feature] Unify quant ops #6021

fxyfxy777 · 2026-01-13T08:02:40Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

统一框架和推理所使用的量化算子，实现训推一致

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-01-13T08:02:46Z

Thanks for your contribution!

fxyfxy777 · 2026-01-14T01:46:07Z

/re-run all-failed

codecov-commenter · 2026-01-14T05:00:34Z

Codecov Report

❌ Patch coverage is 65.38462% with 9 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.4@9a91a5c). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
..._executor/layers/moe/fused_moe_deepgemm_backend.py	0.00%	8 Missing ⚠️
fastdeploy/model_executor/layers/activation.py	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.4    #6021   +/-   ##
==============================================
  Coverage               ?   58.70%           
==============================================
  Files                  ?      329           
  Lines                  ?    41046           
  Branches               ?     6261           
==============================================
  Hits                   ?    24094           
  Misses                 ?    15065           
  Partials               ?     1887

Flag	Coverage Δ
GPU	`58.70% <65.38%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fxyfxy777 · 2026-01-14T13:01:31Z

/re-run all-failed

into dev_2.4

…o dev_2.4

Copilot

Pull request overview

这个 PR 的目的是统一训练框架和推理引擎中使用的量化算子，将自定义的 per_token_quant 和 per_token_quant_padding 算子替换为 Paddle 框架提供的标准 API paddle.incubate.nn.functional.fp8_quant_blockwise，以实现训推一致性。

Changes:

在 fastdeploy/model_executor/layers/utils.py 中新增了 scale_wrapper 函数，用于实现 FP8 量化的缩放逻辑
更新了 per_block_cast_to_fp8 函数以使用新的 scale_wrapper 函数
在多个 MoE 相关文件中将 fastdeploy.model_executor.ops.gpu.per_token_quant 替换为 paddle.incubate.nn.functional.fp8_quant_blockwise
在 SiluAndMul 激活层中添加了条件判断，当没有 bias 且 quant_scale 为 -1 时直接使用 paddle.nn.functional.swiglu
更新了相应的测试用例以适配新的行为

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
fastdeploy/model_executor/layers/utils.py	新增 scale_wrapper 函数并更新 per_block_cast_to_fp8 以使用统一的缩放逻辑
fastdeploy/model_executor/layers/quantization/block_wise_fp8.py	将自定义的 per_token_quant_padding 替换为 paddle 标准 API fp8_quant_blockwise
fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py	在 triton 后端的 MoE 实现中统一使用 fp8_quant_blockwise API
fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py	在 deepgemm 后端的 MoE 实现中统一使用 fp8_quant_blockwise API 并调整 scale tensor 处理
fastdeploy/model_executor/layers/activation.py	在 SiluAndMul 层的 forward_cuda 方法中添加快速路径
tests/layers/test_activation.py	更新测试以验证新的条件分支逻辑

Copilot · 2026-01-23T10:16:55Z

fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py

-            x, self.quant_config.weight_block_size[0]
+        x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
+            x, using_pow2_scale=False, output_scale_transpose=False
        )


在调用 paddle.incubate.nn.functional.fp8_quant_blockwise 时，使用了 output_scale_transpose=False，然后对返回的 scale tensor 进行了切片 x_scale_tensor[: x.shape[0]]。这个切片操作出现在多个地方（第 161、232、389、433 行），建议添加统一的注释说明为什么需要这个切片操作，或者检查是否所有调用都需要这个切片。

Suggested change

)

)

# fp8_quant_blockwise may return an extra padded dimension on the scale tensor

# when output_scale_transpose=False. Slice by x.shape[0] to keep only the

# valid batch entries so that x_scale_tensor matches the layout expected by EP.

Copilot · 2026-01-23T10:16:56Z

fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py

-            ffn_in_x, ffn_in_x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant(
-                ffn_out, self.quant_config.weight_block_size[0]
+            ffn_in_x, ffn_in_x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
+                ffn_out, using_pow2_scale=False


这里调用 fp8_quant_blockwise 时没有指定 output_scale_transpose 参数，根据代码上下文和其他地方的用法，这里应该显式指定 output_scale_transpose=True（因为后续有 .T 操作）或者 output_scale_transpose=False。建议明确指定这个参数以提高代码的可读性和一致性。

Suggested change

ffn_out, using_pow2_scale=False

ffn_out,

using_pow2_scale=False,

output_scale_transpose=True,

Copilot · 2026-01-23T10:16:56Z

tests/layers/test_activation.py

+        if layer.bias is None and layer.quant_scale == -1:
+            self.assertTrue((out.numpy() == 0.73105854).all())
+        else:
+            self.assertTrue((out.numpy() == 1).all())
+            mock_fused.assert_called_once()


测试逻辑存在问题。在 test_forward_cuda 中，由于 layer 是使用 DummyFDConfig() 初始化的（没有传入 bias 参数），所以 layer.bias 为 None，并且 layer.quant_scale 为 -1（默认值）。这意味着条件 layer.bias is None and layer.quant_scale == -1 总是为真，因此 else 分支永远不会执行，mock_fused.assert_called_once() 也永远不会被调用。

根据代码修改，当 bias 为 None 且 quant_scale 为 -1 时，forward_cuda 会直接调用 paddle.nn.functional.swiglu(x)，而不是调用 fused_bias_act。因此这个测试的第二个分支（else 部分）是无法到达的死代码。

建议：

测试应该分成两个独立的测试用例：一个测试有 bias 或 quant_scale != -1 的情况，另一个测试 bias 为 None 且 quant_scale == -1 的情况

或者在这个测试中创建两个不同的 layer 实例来测试两种路径

Copilot · 2026-01-23T10:16:57Z

fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py

-            ffn_in_x, ffn_in_x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant(
-                ffn_out, self.quant_config.weight_block_size[0]
+            ffn_in_x, ffn_in_x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
+                ffn_out, using_pow2_scale=False


调用 paddle.incubate.nn.functional.fp8_quant_blockwise 时没有指定 output_scale_transpose 参数。根据其他地方的用法，当没有指定这个参数时，默认行为可能是 False。在这里，返回的 scale tensor 会进行转置操作（.T）并切片。

为了提高代码的一致性和可读性，建议：

显式指定 output_scale_transpose=True（如果默认行为可以满足需求）

或者显式指定 output_scale_transpose=False 并保留 .T 操作

与 block_wise_fp8.py 中第 229 行的用法进行对比，那里使用了 output_scale_transpose=True 并且也有 .T 操作，这可能表明存在冗余操作

Suggested change

ffn_out, using_pow2_scale=False

ffn_out,

using_pow2_scale=False,

output_scale_transpose=False,

Copilot · 2026-01-23T10:16:57Z

fastdeploy/model_executor/layers/utils.py

+    Args:
+        x_amax (paddle.Tensor): amax tensor (float32 recommended)
+        eps (float): epsilon to avoid division by zero
+    Returns:
+        paddle.Tensor: scale tensor, same shape as x_amax
+    """
+    fp8_max = 448.0
+    float_max = paddle.finfo(paddle.float32).max
+    amax_mod = paddle.maximum(
+        x_amax,
+        paddle.full_like(x_amax, eps),
+    )
+    scale = fp8_max / amax_mod
+    scale = paddle.where(
+        amax_mod == 0,
+        paddle.ones_like(scale),
+        scale,
+    )
+    scale = paddle.where(


scale_wrapper 函数的逻辑存在冗余。在第 234-237 行，代码使用 paddle.maximum 将 x_amax 与 eps 进行比较，确保 amax_mod 至少为 eps。然后在第 239-242 行，代码检查 amax_mod == 0 的情况。但是由于 amax_mod 已经被保证至少为 eps（当 eps > 0 时），这个检查永远不会为真（假设 eps > 0）。

建议：

如果 eps 参数总是大于 0，那么第 239-242 行的检查可以移除，因为它永远不会触发

如果 eps 可以为 0，那么第 234-237 行的逻辑应该改为使用 paddle.clip(x_amax, min=eps) 或者调整条件检查的顺序

或者，将第 239-242 行的条件改为检查 x_amax == 0 而不是 amax_mod == 0

另外，建议在函数文档中说明 eps 参数的预期值范围。

Suggested change

Args:

x_amax (paddle.Tensor): amax tensor (float32 recommended)

eps (float): epsilon to avoid division by zero

Returns:

paddle.Tensor: scale tensor, same shape as x_amax

"""

fp8_max = 448.0

float_max = paddle.finfo(paddle.float32).max

amax_mod = paddle.maximum(

x_amax,

paddle.full_like(x_amax, eps),

)

scale = fp8_max / amax_mod

scale = paddle.where(

amax_mod == 0,

paddle.ones_like(scale),

scale,

)

scale = paddle.where(

Args:

x_amax (paddle.Tensor): amax tensor (float32 recommended)

eps (float): Non-negative epsilon to avoid division by zero.

When eps == 0.0 and x_amax contains zeros, the corresponding

scale values are set to 1.0 to avoid infinite results.

Returns:

paddle.Tensor: scale tensor, same shape as x_amax

"""

fp8_max = 448.0

float_max = paddle.finfo(paddle.float32).max

amax_floor = paddle.full_like(x_amax, eps)

amax_mod = paddle.maximum(x_amax, amax_floor)

scale = fp8_max / amax_mod

# Only apply zero-guard when eps <= 0.0; for eps > 0.0, amax_mod is

# already guaranteed to be at least eps, so this condition would be redundant.

if eps <= 0.0:

scale = paddle.where(

amax_mod == 0,

paddle.ones_like(scale),

scale,

)

scale = paddle.where(

Copilot · 2026-01-23T10:16:58Z

fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py

-            intermediate_cache2, self.quant_config.weight_block_size[0]
+        x_q, x_scale = paddle.incubate.nn.functional.fp8_quant_blockwise(
+            intermediate_cache2, using_pow2_scale=False, output_scale_transpose=False
        )


这里对 x_scale 进行了切片操作 x_scale[: x_q.shape[0]]，与上面类似的代码保持一致。但是应该添加注释说明为什么需要这个切片操作，以提高代码的可维护性。

Suggested change

)

)

# Align the activation scale with the quantized activation rows.

# fp8_quant_blockwise may return extra scale rows due to block padding,

# but the fused Triton kernel expects one scale row per row in x_q.

Copilot · 2026-01-23T10:16:58Z

fastdeploy/model_executor/layers/quantization/block_wise_fp8.py

 import paddle

-import fastdeploy
 from fastdeploy import envs


在 block_wise_fp8.py 中删除了对 fastdeploy 模块的导入，但在其他文件（如 fused_moe_deepgemm_backend.py）中仍然使用 fastdeploy.model_executor.ops.gpu 的其他函数。请确认删除这个导入是否会影响其他功能，以及是否需要更新其他相关的导入语句。

Copilot · 2026-01-23T10:16:59Z

fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py

        # 2. Dynamic compute blockwise quantization scales
-        x, x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant(
-            x, self.quant_config.weight_block_size[0]
+        x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(


PR 的描述部分缺少详细信息。根据自定义规范，PR 描述至少应该解释为什么要进行这些修改以及解决了什么问题。当前的 "Modifications" 部分只有一句话"统一框架和推理所使用的量化算子，实现训推一致"，建议补充以下信息：

具体替换了哪些算子（从什么算子替换到什么算子）

为什么要进行这次统一（例如，是否有性能提升、维护性改进等）

这次修改对现有功能的影响

是否进行了充分的测试以确保训推一致性

此外，Checklist 中的多个项目未勾选，特别是单元测试和准确性测试，请补充相关信息或勾选相应的项目。

Copilot · 2026-01-23T10:16:59Z

fastdeploy/model_executor/layers/quantization/block_wise_fp8.py

+        x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
+            x, using_pow2_scale=False, output_scale_transpose=True
        )
+        x_scale_tensor = x_scale_tensor.T


这里使用了 output_scale_transpose=True，然后在第 231 行又进行了转置操作 x_scale_tensor.T。这看起来是冗余的操作。如果 output_scale_transpose=True 已经返回了转置后的结果，那么再次转置会得到原始的形状。

建议：

检查 paddle.incubate.nn.functional.fp8_quant_blockwise 的文档，确认 output_scale_transpose=True 的确切行为

如果 output_scale_transpose=True 已经进行了转置，那么第 231 行的 .T 操作应该移除

或者如果需要进行两次转置来达到特定的维度顺序，应该添加注释说明原因

与 fused_moe_deepgemm_backend.py 中的用法保持一致

Suggested change

x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(

x, using_pow2_scale=False, output_scale_transpose=True

)

x_scale_tensor = x_scale_tensor.T

# output_scale_transpose=True returns the scale tensor in the layout required by deep_gemm

x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(

x, using_pow2_scale=False, output_scale_transpose=True

)

Copilot · 2026-01-23T10:16:59Z

tests/layers/test_activation.py

-        self.assertTrue((out.numpy() == 1).all())
-        mock_fused.assert_called_once()
+        if layer.bias is None and layer.quant_scale == -1:
+            self.assertTrue((out.numpy() == 0.73105854).all())


测试中使用了精确的浮点数比较 out.numpy() == 0.73105854。由于浮点数运算可能存在精度误差，建议使用 numpy.allclose 或 self.assertAlmostEqual 进行近似比较，而不是使用 == 进行精确比较。例如：numpy.allclose(out.numpy(), 0.73105854, rtol=1e-6)

This reverts commit 9a48206.

This reverts commit da9b356.

fxyfxy777 added 6 commits January 13, 2026 15:36

quant stash

9e15dfc

blockwise_quant

db6202a

rm tensor.cut

3bc2769

tp ok

4ca6e63

add paddle swiglu

2ce15d8

21B test ok

63dcefa

fxyfxy777 had a problem deploying to Metax_ci January 13, 2026 08:02 — with GitHub Actions Failure

pre-commit

01bed15

fxyfxy777 had a problem deploying to Metax_ci January 13, 2026 08:09 — with GitHub Actions Failure

fxyfxy777 had a problem deploying to Metax_ci January 14, 2026 01:46 — with GitHub Actions Failure

fxyfxy777 had a problem deploying to Metax_ci January 14, 2026 13:01 — with GitHub Actions Failure

fxyfxy777 and others added 2 commits January 15, 2026 17:21

fix ut error

7c1bd99

Merge branch 'release/2.4' into dev_2.4

6b0d374

yuanlehome had a problem deploying to Metax_ci January 15, 2026 13:08 — with GitHub Actions Failure

fxyfxy777 added 2 commits January 15, 2026 21:16

Merge branch 'release/2.4' of https://github.com/PaddlePaddle/FastDeploy

af4fdd7

into dev_2.4

Merge branch 'dev_2.4' of https://github.com/fxyfxy777/FastDeploy int…

a9d23ea

…o dev_2.4

fxyfxy777 had a problem deploying to Metax_ci January 15, 2026 13:17 — with GitHub Actions Failure

fix block quant

3261a53

fxyfxy777 had a problem deploying to Metax_ci January 23, 2026 08:41 — with GitHub Actions Failure

Merge branch 'release/2.4' into dev_2.4

fcd8ef0

fxyfxy777 had a problem deploying to Metax_ci January 23, 2026 08:44 — with GitHub Actions Failure

Jiang-Jia-Jun requested a review from Copilot January 23, 2026 10:10

Copilot started reviewing on behalf of Jiang-Jia-Jun January 23, 2026 10:11 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

edit whl

2e20113

fxyfxy777 had a problem deploying to Metax_ci January 23, 2026 11:26 — with GitHub Actions Failure

e baseline

b9e180f

fxyfxy777 had a problem deploying to Metax_ci January 24, 2026 03:14 — with GitHub Actions Failure

e baseline 2

0ef0a00

fxyfxy777 had a problem deploying to Metax_ci January 24, 2026 03:20 — with GitHub Actions Failure

chore: remove extra whitespace in test_EB_VL_Lite_serving.py

7f8c74e

EmmonsCurse had a problem deploying to Metax_ci January 24, 2026 04:47 — with GitHub Actions Failure

chore: keep paddlepaddle-xpu unchanged

b72882c

EmmonsCurse had a problem deploying to Metax_ci January 24, 2026 06:45 — with GitHub Actions Failure

yuanlehome approved these changes Jan 24, 2026

View reviewed changes

yuanlehome merged commit 9a48206 into PaddlePaddle:release/2.4 Jan 24, 2026
21 of 28 checks passed

zhoutianzi666 pushed a commit to zhoutianzi666/FastDeploy that referenced this pull request Jan 27, 2026

Revert "[Feature] Unify quant ops (PaddlePaddle#6021)"

da9b356

This reverts commit 9a48206.

gongshaotian added a commit to zhoutianzi666/FastDeploy that referenced this pull request Jan 28, 2026

Reapply "[Feature] Unify quant ops (PaddlePaddle#6021)"

2855be9

This reverts commit da9b356.

-        )
+        )
+        # fp8_quant_blockwise may return an extra padded dimension on the scale tensor
+        # when output_scale_transpose=False. Slice by x.shape[0] to keep only the
+        # valid batch entries so that x_scale_tensor matches the layout expected by EP.

-        )
+        )
+        # Align the activation scale with the quantized activation rows.
+        # fp8_quant_blockwise may return extra scale rows due to block padding,
+        # but the fused Triton kernel expects one scale row per row in x_q.

[Feature] Unify quant ops #6021

[Feature] Unify quant ops #6021

Uh oh!

Conversation

fxyfxy777 commented Jan 13, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Jan 13, 2026

Uh oh!

fxyfxy777 commented Jan 14, 2026

Uh oh!

codecov-commenter commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fxyfxy777 commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jan 14, 2026 •

edited

Loading