Skip to content

Conversation

@fxyfxy777
Copy link
Contributor

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

统一框架和推理所使用的量化算子,实现训推一致

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Jan 13, 2026

Thanks for your contribution!

@fxyfxy777
Copy link
Contributor Author

/re-run all-failed

@codecov-commenter
Copy link

codecov-commenter commented Jan 14, 2026

Codecov Report

❌ Patch coverage is 65.38462% with 9 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.4@9a91a5c). Learn more about missing BASE report.

Files with missing lines Patch % Lines
..._executor/layers/moe/fused_moe_deepgemm_backend.py 0.00% 8 Missing ⚠️
fastdeploy/model_executor/layers/activation.py 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.4    #6021   +/-   ##
==============================================
  Coverage               ?   58.70%           
==============================================
  Files                  ?      329           
  Lines                  ?    41046           
  Branches               ?     6261           
==============================================
  Hits                   ?    24094           
  Misses                 ?    15065           
  Partials               ?     1887           
Flag Coverage Δ
GPU 58.70% <65.38%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fxyfxy777
Copy link
Contributor Author

/re-run all-failed

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

这个 PR 的目的是统一训练框架和推理引擎中使用的量化算子,将自定义的 per_token_quantper_token_quant_padding 算子替换为 Paddle 框架提供的标准 API paddle.incubate.nn.functional.fp8_quant_blockwise,以实现训推一致性。

Changes:

  • fastdeploy/model_executor/layers/utils.py 中新增了 scale_wrapper 函数,用于实现 FP8 量化的缩放逻辑
  • 更新了 per_block_cast_to_fp8 函数以使用新的 scale_wrapper 函数
  • 在多个 MoE 相关文件中将 fastdeploy.model_executor.ops.gpu.per_token_quant 替换为 paddle.incubate.nn.functional.fp8_quant_blockwise
  • SiluAndMul 激活层中添加了条件判断,当没有 bias 且 quant_scale 为 -1 时直接使用 paddle.nn.functional.swiglu
  • 更新了相应的测试用例以适配新的行为

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
fastdeploy/model_executor/layers/utils.py 新增 scale_wrapper 函数并更新 per_block_cast_to_fp8 以使用统一的缩放逻辑
fastdeploy/model_executor/layers/quantization/block_wise_fp8.py 将自定义的 per_token_quant_padding 替换为 paddle 标准 API fp8_quant_blockwise
fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py 在 triton 后端的 MoE 实现中统一使用 fp8_quant_blockwise API
fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py 在 deepgemm 后端的 MoE 实现中统一使用 fp8_quant_blockwise API 并调整 scale tensor 处理
fastdeploy/model_executor/layers/activation.py 在 SiluAndMul 层的 forward_cuda 方法中添加快速路径
tests/layers/test_activation.py 更新测试以验证新的条件分支逻辑

x, self.quant_config.weight_block_size[0]
x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
x, using_pow2_scale=False, output_scale_transpose=False
)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在调用 paddle.incubate.nn.functional.fp8_quant_blockwise 时,使用了 output_scale_transpose=False,然后对返回的 scale tensor 进行了切片 x_scale_tensor[: x.shape[0]]。这个切片操作出现在多个地方(第 161、232、389、433 行),建议添加统一的注释说明为什么需要这个切片操作,或者检查是否所有调用都需要这个切片。

Suggested change
)
)
# fp8_quant_blockwise may return an extra padded dimension on the scale tensor
# when output_scale_transpose=False. Slice by x.shape[0] to keep only the
# valid batch entries so that x_scale_tensor matches the layout expected by EP.

Copilot uses AI. Check for mistakes.
ffn_in_x, ffn_in_x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant(
ffn_out, self.quant_config.weight_block_size[0]
ffn_in_x, ffn_in_x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
ffn_out, using_pow2_scale=False
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里调用 fp8_quant_blockwise 时没有指定 output_scale_transpose 参数,根据代码上下文和其他地方的用法,这里应该显式指定 output_scale_transpose=True(因为后续有 .T 操作)或者 output_scale_transpose=False。建议明确指定这个参数以提高代码的可读性和一致性。

Suggested change
ffn_out, using_pow2_scale=False
ffn_out,
using_pow2_scale=False,
output_scale_transpose=True,

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +91
if layer.bias is None and layer.quant_scale == -1:
self.assertTrue((out.numpy() == 0.73105854).all())
else:
self.assertTrue((out.numpy() == 1).all())
mock_fused.assert_called_once()
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

测试逻辑存在问题。在 test_forward_cuda 中,由于 layer 是使用 DummyFDConfig() 初始化的(没有传入 bias 参数),所以 layer.bias 为 None,并且 layer.quant_scale 为 -1(默认值)。这意味着条件 layer.bias is None and layer.quant_scale == -1 总是为真,因此 else 分支永远不会执行,mock_fused.assert_called_once() 也永远不会被调用。

根据代码修改,当 bias 为 None 且 quant_scale 为 -1 时,forward_cuda 会直接调用 paddle.nn.functional.swiglu(x),而不是调用 fused_bias_act。因此这个测试的第二个分支(else 部分)是无法到达的死代码。

建议:

  1. 测试应该分成两个独立的测试用例:一个测试有 bias 或 quant_scale != -1 的情况,另一个测试 bias 为 None 且 quant_scale == -1 的情况
  2. 或者在这个测试中创建两个不同的 layer 实例来测试两种路径

Copilot uses AI. Check for mistakes.
ffn_in_x, ffn_in_x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant(
ffn_out, self.quant_config.weight_block_size[0]
ffn_in_x, ffn_in_x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
ffn_out, using_pow2_scale=False
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

调用 paddle.incubate.nn.functional.fp8_quant_blockwise 时没有指定 output_scale_transpose 参数。根据其他地方的用法,当没有指定这个参数时,默认行为可能是 False。在这里,返回的 scale tensor 会进行转置操作(.T)并切片。

为了提高代码的一致性和可读性,建议:

  1. 显式指定 output_scale_transpose=True(如果默认行为可以满足需求)
  2. 或者显式指定 output_scale_transpose=False 并保留 .T 操作
  3. 与 block_wise_fp8.py 中第 229 行的用法进行对比,那里使用了 output_scale_transpose=True 并且也有 .T 操作,这可能表明存在冗余操作
Suggested change
ffn_out, using_pow2_scale=False
ffn_out,
using_pow2_scale=False,
output_scale_transpose=False,

Copilot uses AI. Check for mistakes.
Comment on lines +226 to +244
Args:
x_amax (paddle.Tensor): amax tensor (float32 recommended)
eps (float): epsilon to avoid division by zero
Returns:
paddle.Tensor: scale tensor, same shape as x_amax
"""
fp8_max = 448.0
float_max = paddle.finfo(paddle.float32).max
amax_mod = paddle.maximum(
x_amax,
paddle.full_like(x_amax, eps),
)
scale = fp8_max / amax_mod
scale = paddle.where(
amax_mod == 0,
paddle.ones_like(scale),
scale,
)
scale = paddle.where(
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scale_wrapper 函数的逻辑存在冗余。在第 234-237 行,代码使用 paddle.maximum 将 x_amax 与 eps 进行比较,确保 amax_mod 至少为 eps。然后在第 239-242 行,代码检查 amax_mod == 0 的情况。但是由于 amax_mod 已经被保证至少为 eps(当 eps > 0 时),这个检查永远不会为真(假设 eps > 0)。

建议:

  1. 如果 eps 参数总是大于 0,那么第 239-242 行的检查可以移除,因为它永远不会触发
  2. 如果 eps 可以为 0,那么第 234-237 行的逻辑应该改为使用 paddle.clip(x_amax, min=eps) 或者调整条件检查的顺序
  3. 或者,将第 239-242 行的条件改为检查 x_amax == 0 而不是 amax_mod == 0

另外,建议在函数文档中说明 eps 参数的预期值范围。

Suggested change
Args:
x_amax (paddle.Tensor): amax tensor (float32 recommended)
eps (float): epsilon to avoid division by zero
Returns:
paddle.Tensor: scale tensor, same shape as x_amax
"""
fp8_max = 448.0
float_max = paddle.finfo(paddle.float32).max
amax_mod = paddle.maximum(
x_amax,
paddle.full_like(x_amax, eps),
)
scale = fp8_max / amax_mod
scale = paddle.where(
amax_mod == 0,
paddle.ones_like(scale),
scale,
)
scale = paddle.where(
Args:
x_amax (paddle.Tensor): amax tensor (float32 recommended)
eps (float): Non-negative epsilon to avoid division by zero.
When eps == 0.0 and x_amax contains zeros, the corresponding
scale values are set to 1.0 to avoid infinite results.
Returns:
paddle.Tensor: scale tensor, same shape as x_amax
"""
fp8_max = 448.0
float_max = paddle.finfo(paddle.float32).max
amax_floor = paddle.full_like(x_amax, eps)
amax_mod = paddle.maximum(x_amax, amax_floor)
scale = fp8_max / amax_mod
# Only apply zero-guard when eps <= 0.0; for eps > 0.0, amax_mod is
# already guaranteed to be at least eps, so this condition would be redundant.
if eps <= 0.0:
scale = paddle.where(
amax_mod == 0,
paddle.ones_like(scale),
scale,
)
scale = paddle.where(

Copilot uses AI. Check for mistakes.
intermediate_cache2, self.quant_config.weight_block_size[0]
x_q, x_scale = paddle.incubate.nn.functional.fp8_quant_blockwise(
intermediate_cache2, using_pow2_scale=False, output_scale_transpose=False
)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里对 x_scale 进行了切片操作 x_scale[: x_q.shape[0]],与上面类似的代码保持一致。但是应该添加注释说明为什么需要这个切片操作,以提高代码的可维护性。

Suggested change
)
)
# Align the activation scale with the quantized activation rows.
# fp8_quant_blockwise may return extra scale rows due to block padding,
# but the fused Triton kernel expects one scale row per row in x_q.

Copilot uses AI. Check for mistakes.
import paddle

import fastdeploy
from fastdeploy import envs
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在 block_wise_fp8.py 中删除了对 fastdeploy 模块的导入,但在其他文件(如 fused_moe_deepgemm_backend.py)中仍然使用 fastdeploy.model_executor.ops.gpu 的其他函数。请确认删除这个导入是否会影响其他功能,以及是否需要更新其他相关的导入语句。

Copilot uses AI. Check for mistakes.
# 2. Dynamic compute blockwise quantization scales
x, x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant(
x, self.quant_config.weight_block_size[0]
x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 的描述部分缺少详细信息。根据自定义规范,PR 描述至少应该解释为什么要进行这些修改以及解决了什么问题。当前的 "Modifications" 部分只有一句话"统一框架和推理所使用的量化算子,实现训推一致",建议补充以下信息:

  1. 具体替换了哪些算子(从什么算子替换到什么算子)
  2. 为什么要进行这次统一(例如,是否有性能提升、维护性改进等)
  3. 这次修改对现有功能的影响
  4. 是否进行了充分的测试以确保训推一致性

此外,Checklist 中的多个项目未勾选,特别是单元测试和准确性测试,请补充相关信息或勾选相应的项目。

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +228 to +231
x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
x, using_pow2_scale=False, output_scale_transpose=True
)
x_scale_tensor = x_scale_tensor.T
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里使用了 output_scale_transpose=True,然后在第 231 行又进行了转置操作 x_scale_tensor.T。这看起来是冗余的操作。如果 output_scale_transpose=True 已经返回了转置后的结果,那么再次转置会得到原始的形状。

建议:

  1. 检查 paddle.incubate.nn.functional.fp8_quant_blockwise 的文档,确认 output_scale_transpose=True 的确切行为
  2. 如果 output_scale_transpose=True 已经进行了转置,那么第 231 行的 .T 操作应该移除
  3. 或者如果需要进行两次转置来达到特定的维度顺序,应该添加注释说明原因
  4. 与 fused_moe_deepgemm_backend.py 中的用法保持一致
Suggested change
x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
x, using_pow2_scale=False, output_scale_transpose=True
)
x_scale_tensor = x_scale_tensor.T
# output_scale_transpose=True returns the scale tensor in the layout required by deep_gemm
x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
x, using_pow2_scale=False, output_scale_transpose=True
)

Copilot uses AI. Check for mistakes.
self.assertTrue((out.numpy() == 1).all())
mock_fused.assert_called_once()
if layer.bias is None and layer.quant_scale == -1:
self.assertTrue((out.numpy() == 0.73105854).all())
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

测试中使用了精确的浮点数比较 out.numpy() == 0.73105854。由于浮点数运算可能存在精度误差,建议使用 numpy.allclose 或 self.assertAlmostEqual 进行近似比较,而不是使用 == 进行精确比较。例如:numpy.allclose(out.numpy(), 0.73105854, rtol=1e-6)

Copilot uses AI. Check for mistakes.
@yuanlehome yuanlehome merged commit 9a48206 into PaddlePaddle:release/2.4 Jan 24, 2026
21 of 28 checks passed
zhoutianzi666 pushed a commit to zhoutianzi666/FastDeploy that referenced this pull request Jan 27, 2026
gongshaotian added a commit to zhoutianzi666/FastDeploy that referenced this pull request Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants