-
Notifications
You must be signed in to change notification settings - Fork 693
[Feature] Unify quant ops #6021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
|
/re-run all-failed |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.4 #6021 +/- ##
==============================================
Coverage ? 58.70%
==============================================
Files ? 329
Lines ? 41046
Branches ? 6261
==============================================
Hits ? 24094
Misses ? 15065
Partials ? 1887
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/re-run all-failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
这个 PR 的目的是统一训练框架和推理引擎中使用的量化算子,将自定义的 per_token_quant 和 per_token_quant_padding 算子替换为 Paddle 框架提供的标准 API paddle.incubate.nn.functional.fp8_quant_blockwise,以实现训推一致性。
Changes:
- 在
fastdeploy/model_executor/layers/utils.py中新增了scale_wrapper函数,用于实现 FP8 量化的缩放逻辑 - 更新了
per_block_cast_to_fp8函数以使用新的scale_wrapper函数 - 在多个 MoE 相关文件中将
fastdeploy.model_executor.ops.gpu.per_token_quant替换为paddle.incubate.nn.functional.fp8_quant_blockwise - 在
SiluAndMul激活层中添加了条件判断,当没有 bias 且 quant_scale 为 -1 时直接使用paddle.nn.functional.swiglu - 更新了相应的测试用例以适配新的行为
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/model_executor/layers/utils.py | 新增 scale_wrapper 函数并更新 per_block_cast_to_fp8 以使用统一的缩放逻辑 |
| fastdeploy/model_executor/layers/quantization/block_wise_fp8.py | 将自定义的 per_token_quant_padding 替换为 paddle 标准 API fp8_quant_blockwise |
| fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py | 在 triton 后端的 MoE 实现中统一使用 fp8_quant_blockwise API |
| fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py | 在 deepgemm 后端的 MoE 实现中统一使用 fp8_quant_blockwise API 并调整 scale tensor 处理 |
| fastdeploy/model_executor/layers/activation.py | 在 SiluAndMul 层的 forward_cuda 方法中添加快速路径 |
| tests/layers/test_activation.py | 更新测试以验证新的条件分支逻辑 |
| x, self.quant_config.weight_block_size[0] | ||
| x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise( | ||
| x, using_pow2_scale=False, output_scale_transpose=False | ||
| ) |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在调用 paddle.incubate.nn.functional.fp8_quant_blockwise 时,使用了 output_scale_transpose=False,然后对返回的 scale tensor 进行了切片 x_scale_tensor[: x.shape[0]]。这个切片操作出现在多个地方(第 161、232、389、433 行),建议添加统一的注释说明为什么需要这个切片操作,或者检查是否所有调用都需要这个切片。
| ) | |
| ) | |
| # fp8_quant_blockwise may return an extra padded dimension on the scale tensor | |
| # when output_scale_transpose=False. Slice by x.shape[0] to keep only the | |
| # valid batch entries so that x_scale_tensor matches the layout expected by EP. |
| ffn_in_x, ffn_in_x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant( | ||
| ffn_out, self.quant_config.weight_block_size[0] | ||
| ffn_in_x, ffn_in_x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise( | ||
| ffn_out, using_pow2_scale=False |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里调用 fp8_quant_blockwise 时没有指定 output_scale_transpose 参数,根据代码上下文和其他地方的用法,这里应该显式指定 output_scale_transpose=True(因为后续有 .T 操作)或者 output_scale_transpose=False。建议明确指定这个参数以提高代码的可读性和一致性。
| ffn_out, using_pow2_scale=False | |
| ffn_out, | |
| using_pow2_scale=False, | |
| output_scale_transpose=True, |
| if layer.bias is None and layer.quant_scale == -1: | ||
| self.assertTrue((out.numpy() == 0.73105854).all()) | ||
| else: | ||
| self.assertTrue((out.numpy() == 1).all()) | ||
| mock_fused.assert_called_once() |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
测试逻辑存在问题。在 test_forward_cuda 中,由于 layer 是使用 DummyFDConfig() 初始化的(没有传入 bias 参数),所以 layer.bias 为 None,并且 layer.quant_scale 为 -1(默认值)。这意味着条件 layer.bias is None and layer.quant_scale == -1 总是为真,因此 else 分支永远不会执行,mock_fused.assert_called_once() 也永远不会被调用。
根据代码修改,当 bias 为 None 且 quant_scale 为 -1 时,forward_cuda 会直接调用 paddle.nn.functional.swiglu(x),而不是调用 fused_bias_act。因此这个测试的第二个分支(else 部分)是无法到达的死代码。
建议:
- 测试应该分成两个独立的测试用例:一个测试有 bias 或 quant_scale != -1 的情况,另一个测试 bias 为 None 且 quant_scale == -1 的情况
- 或者在这个测试中创建两个不同的 layer 实例来测试两种路径
| ffn_in_x, ffn_in_x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant( | ||
| ffn_out, self.quant_config.weight_block_size[0] | ||
| ffn_in_x, ffn_in_x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise( | ||
| ffn_out, using_pow2_scale=False |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
调用 paddle.incubate.nn.functional.fp8_quant_blockwise 时没有指定 output_scale_transpose 参数。根据其他地方的用法,当没有指定这个参数时,默认行为可能是 False。在这里,返回的 scale tensor 会进行转置操作(.T)并切片。
为了提高代码的一致性和可读性,建议:
- 显式指定 output_scale_transpose=True(如果默认行为可以满足需求)
- 或者显式指定 output_scale_transpose=False 并保留 .T 操作
- 与 block_wise_fp8.py 中第 229 行的用法进行对比,那里使用了 output_scale_transpose=True 并且也有 .T 操作,这可能表明存在冗余操作
| ffn_out, using_pow2_scale=False | |
| ffn_out, | |
| using_pow2_scale=False, | |
| output_scale_transpose=False, |
| Args: | ||
| x_amax (paddle.Tensor): amax tensor (float32 recommended) | ||
| eps (float): epsilon to avoid division by zero | ||
| Returns: | ||
| paddle.Tensor: scale tensor, same shape as x_amax | ||
| """ | ||
| fp8_max = 448.0 | ||
| float_max = paddle.finfo(paddle.float32).max | ||
| amax_mod = paddle.maximum( | ||
| x_amax, | ||
| paddle.full_like(x_amax, eps), | ||
| ) | ||
| scale = fp8_max / amax_mod | ||
| scale = paddle.where( | ||
| amax_mod == 0, | ||
| paddle.ones_like(scale), | ||
| scale, | ||
| ) | ||
| scale = paddle.where( |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scale_wrapper 函数的逻辑存在冗余。在第 234-237 行,代码使用 paddle.maximum 将 x_amax 与 eps 进行比较,确保 amax_mod 至少为 eps。然后在第 239-242 行,代码检查 amax_mod == 0 的情况。但是由于 amax_mod 已经被保证至少为 eps(当 eps > 0 时),这个检查永远不会为真(假设 eps > 0)。
建议:
- 如果 eps 参数总是大于 0,那么第 239-242 行的检查可以移除,因为它永远不会触发
- 如果 eps 可以为 0,那么第 234-237 行的逻辑应该改为使用 paddle.clip(x_amax, min=eps) 或者调整条件检查的顺序
- 或者,将第 239-242 行的条件改为检查 x_amax == 0 而不是 amax_mod == 0
另外,建议在函数文档中说明 eps 参数的预期值范围。
| Args: | |
| x_amax (paddle.Tensor): amax tensor (float32 recommended) | |
| eps (float): epsilon to avoid division by zero | |
| Returns: | |
| paddle.Tensor: scale tensor, same shape as x_amax | |
| """ | |
| fp8_max = 448.0 | |
| float_max = paddle.finfo(paddle.float32).max | |
| amax_mod = paddle.maximum( | |
| x_amax, | |
| paddle.full_like(x_amax, eps), | |
| ) | |
| scale = fp8_max / amax_mod | |
| scale = paddle.where( | |
| amax_mod == 0, | |
| paddle.ones_like(scale), | |
| scale, | |
| ) | |
| scale = paddle.where( | |
| Args: | |
| x_amax (paddle.Tensor): amax tensor (float32 recommended) | |
| eps (float): Non-negative epsilon to avoid division by zero. | |
| When eps == 0.0 and x_amax contains zeros, the corresponding | |
| scale values are set to 1.0 to avoid infinite results. | |
| Returns: | |
| paddle.Tensor: scale tensor, same shape as x_amax | |
| """ | |
| fp8_max = 448.0 | |
| float_max = paddle.finfo(paddle.float32).max | |
| amax_floor = paddle.full_like(x_amax, eps) | |
| amax_mod = paddle.maximum(x_amax, amax_floor) | |
| scale = fp8_max / amax_mod | |
| # Only apply zero-guard when eps <= 0.0; for eps > 0.0, amax_mod is | |
| # already guaranteed to be at least eps, so this condition would be redundant. | |
| if eps <= 0.0: | |
| scale = paddle.where( | |
| amax_mod == 0, | |
| paddle.ones_like(scale), | |
| scale, | |
| ) | |
| scale = paddle.where( |
| intermediate_cache2, self.quant_config.weight_block_size[0] | ||
| x_q, x_scale = paddle.incubate.nn.functional.fp8_quant_blockwise( | ||
| intermediate_cache2, using_pow2_scale=False, output_scale_transpose=False | ||
| ) |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里对 x_scale 进行了切片操作 x_scale[: x_q.shape[0]],与上面类似的代码保持一致。但是应该添加注释说明为什么需要这个切片操作,以提高代码的可维护性。
| ) | |
| ) | |
| # Align the activation scale with the quantized activation rows. | |
| # fp8_quant_blockwise may return extra scale rows due to block padding, | |
| # but the fused Triton kernel expects one scale row per row in x_q. |
| import paddle | ||
|
|
||
| import fastdeploy | ||
| from fastdeploy import envs |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在 block_wise_fp8.py 中删除了对 fastdeploy 模块的导入,但在其他文件(如 fused_moe_deepgemm_backend.py)中仍然使用 fastdeploy.model_executor.ops.gpu 的其他函数。请确认删除这个导入是否会影响其他功能,以及是否需要更新其他相关的导入语句。
| # 2. Dynamic compute blockwise quantization scales | ||
| x, x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant( | ||
| x, self.quant_config.weight_block_size[0] | ||
| x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise( |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR 的描述部分缺少详细信息。根据自定义规范,PR 描述至少应该解释为什么要进行这些修改以及解决了什么问题。当前的 "Modifications" 部分只有一句话"统一框架和推理所使用的量化算子,实现训推一致",建议补充以下信息:
- 具体替换了哪些算子(从什么算子替换到什么算子)
- 为什么要进行这次统一(例如,是否有性能提升、维护性改进等)
- 这次修改对现有功能的影响
- 是否进行了充分的测试以确保训推一致性
此外,Checklist 中的多个项目未勾选,特别是单元测试和准确性测试,请补充相关信息或勾选相应的项目。
| x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise( | ||
| x, using_pow2_scale=False, output_scale_transpose=True | ||
| ) | ||
| x_scale_tensor = x_scale_tensor.T |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里使用了 output_scale_transpose=True,然后在第 231 行又进行了转置操作 x_scale_tensor.T。这看起来是冗余的操作。如果 output_scale_transpose=True 已经返回了转置后的结果,那么再次转置会得到原始的形状。
建议:
- 检查 paddle.incubate.nn.functional.fp8_quant_blockwise 的文档,确认 output_scale_transpose=True 的确切行为
- 如果 output_scale_transpose=True 已经进行了转置,那么第 231 行的 .T 操作应该移除
- 或者如果需要进行两次转置来达到特定的维度顺序,应该添加注释说明原因
- 与 fused_moe_deepgemm_backend.py 中的用法保持一致
| x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise( | |
| x, using_pow2_scale=False, output_scale_transpose=True | |
| ) | |
| x_scale_tensor = x_scale_tensor.T | |
| # output_scale_transpose=True returns the scale tensor in the layout required by deep_gemm | |
| x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise( | |
| x, using_pow2_scale=False, output_scale_transpose=True | |
| ) |
| self.assertTrue((out.numpy() == 1).all()) | ||
| mock_fused.assert_called_once() | ||
| if layer.bias is None and layer.quant_scale == -1: | ||
| self.assertTrue((out.numpy() == 0.73105854).all()) |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
测试中使用了精确的浮点数比较 out.numpy() == 0.73105854。由于浮点数运算可能存在精度误差,建议使用 numpy.allclose 或 self.assertAlmostEqual 进行近似比较,而不是使用 == 进行精确比较。例如:numpy.allclose(out.numpy(), 0.73105854, rtol=1e-6)
This reverts commit 9a48206.
This reverts commit da9b356.
Motivation
Modifications
统一框架和推理所使用的量化算子,实现训推一致
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.