Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/_accuracy_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ jobs:
-v "${CACHE_DIR}/ConfigDir:/root/.config" \
-e TZ="Asia/Shanghai" \
--gpus '"device='"${DEVICES}"'"' ${docker_image} /bin/bash -xc '
python -m pip install paddlepaddle-gpu==3.3.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install https://paddle-qa.bj.bcebos.com/paddle-pipeline/Release-TagBuild-Training-Linux-Gpu-Cuda12.6-Cudnn9.5-Trt10.5-Mkl-Avx-Gcc11-SelfBuiltPypiUse/latest/paddlepaddle_gpu-0.0.0-cp310-cp310-linux_x86_64.whl

pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/_base_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ jobs:
-v "${CACHE_DIR}/ConfigDir:/root/.config" \
-e TZ="Asia/Shanghai" \
--gpus '"device='"${DEVICES}"'"' ${docker_image} /bin/bash -xc '
python -m pip install paddlepaddle-gpu==3.3.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install https://paddle-qa.bj.bcebos.com/paddle-pipeline/Release-TagBuild-Training-Linux-Gpu-Cuda12.6-Cudnn9.5-Trt10.5-Mkl-Avx-Gcc11-SelfBuiltPypiUse/latest/paddlepaddle_gpu-0.0.0-cp310-cp310-linux_x86_64.whl

pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/_build_linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ jobs:
elif [[ "${PADDLEVERSION}" != "" ]];then
python -m pip install paddlepaddle-gpu==${PADDLEVERSION} -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
else
python -m pip install paddlepaddle-gpu==3.3.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install https://paddle-qa.bj.bcebos.com/paddle-pipeline/Release-TagBuild-Training-Linux-Gpu-Cuda12.6-Cudnn9.5-Trt10.5-Mkl-Avx-Gcc11-SelfBuiltPypiUse/latest/paddlepaddle_gpu-0.0.0-cp310-cp310-linux_x86_64.whl
fi

pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/_logprob_test_linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ jobs:
-v "${CACHE_DIR}/ConfigDir:/root/.config" \
-e TZ="Asia/Shanghai" \
--gpus '"device='"${DEVICES}"'"' ${docker_image} /bin/bash -xc '
python -m pip install paddlepaddle-gpu==3.3.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install https://paddle-qa.bj.bcebos.com/paddle-pipeline/Release-TagBuild-Training-Linux-Gpu-Cuda12.6-Cudnn9.5-Trt10.5-Mkl-Avx-Gcc11-SelfBuiltPypiUse/latest/paddlepaddle_gpu-0.0.0-cp310-cp310-linux_x86_64.whl

pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

Expand Down Expand Up @@ -185,7 +185,7 @@ jobs:
-d "{\"messages\": [{\"role\": \"user\", \"content\": \"1+1=?\"}], \"logprobs\": true}"
set +e
rm -rf ./baseline_output
cp -r baseline/ERNIE-4.5-0.3B-Paddle ./baseline_output
cp -r baseline_24/ERNIE-4.5-0.3B-Paddle ./baseline_output
LOGPROB_EXIT_CODE=0
python3.10 lanucher.py --request_template TOKEN_LOGPROB --url http://localhost:${FD_API_PORT}/v1/chat/completions --case ./cases/demo.yaml --concurrency 1 --name demo --exe logprob || LOGPROB_EXIT_CODE=$?
echo "LOGPROB_EXIT_CODE=${LOGPROB_EXIT_CODE}" > /workspace/exit_code.env
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/_pre_ce_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ jobs:
--gpus "\"device=${DEVICES}\"" ${docker_image} /bin/bash -c '
git config --global --add safe.directory /workspace/FastDeploy
cd FastDeploy
python -m pip install paddlepaddle-gpu==3.3.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install https://paddle-qa.bj.bcebos.com/paddle-pipeline/Release-TagBuild-Training-Linux-Gpu-Cuda12.6-Cudnn9.5-Trt10.5-Mkl-Avx-Gcc11-SelfBuiltPypiUse/latest/paddlepaddle_gpu-0.0.0-cp310-cp310-linux_x86_64.whl
python -m pip install ${fd_wheel_url}
bash scripts/run_pre_ce.sh
'
2 changes: 1 addition & 1 deletion .github/workflows/_stable_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ jobs:
-v "${CACHE_DIR}/ConfigDir:/root/.config" \
-e TZ="Asia/Shanghai" \
--gpus '"device='"${DEVICES}"'"' ${docker_image} /bin/bash -xc '
python -m pip install paddlepaddle-gpu==3.3.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install https://paddle-qa.bj.bcebos.com/paddle-pipeline/Release-TagBuild-Training-Linux-Gpu-Cuda12.6-Cudnn9.5-Trt10.5-Mkl-Avx-Gcc11-SelfBuiltPypiUse/latest/paddlepaddle_gpu-0.0.0-cp310-cp310-linux_x86_64.whl

pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/_unit_test_coverage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ jobs:
git config --global --add safe.directory /workspace/FastDeploy
cd FastDeploy
git diff origin/${BASE_REF}..HEAD --unified=0 > diff.txt
python -m pip install paddlepaddle-gpu==3.3.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install https://paddle-qa.bj.bcebos.com/paddle-pipeline/Release-TagBuild-Training-Linux-Gpu-Cuda12.6-Cudnn9.5-Trt10.5-Mkl-Avx-Gcc11-SelfBuiltPypiUse/latest/paddlepaddle_gpu-0.0.0-cp310-cp310-linux_x86_64.whl
pip config set global.extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

python -m pip install -r scripts/unittest_requirement.txt
Expand Down
2 changes: 2 additions & 0 deletions fastdeploy/model_executor/layers/activation.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,8 @@ def forward_cuda(self, x: paddle.Tensor) -> paddle.Tensor:
Returns:
Tensor: Output tensor.
"""
if self.bias is None and self.quant_scale == -1:
return paddle.nn.functional.swiglu(x)
return fused_bias_act(
x,
bias=self.bias,
Expand Down
27 changes: 15 additions & 12 deletions fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,9 +155,10 @@ def apply_ep_prefill(
topk_ids_hookfunc(topk_ids=topk_idx)

# 2. Dynamic compute blockwise quantization scales
x, x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant(
x, self.quant_config.weight_block_size[0]
x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 的描述部分缺少详细信息。根据自定义规范,PR 描述至少应该解释为什么要进行这些修改以及解决了什么问题。当前的 "Modifications" 部分只有一句话"统一框架和推理所使用的量化算子,实现训推一致",建议补充以下信息:

  1. 具体替换了哪些算子(从什么算子替换到什么算子)
  2. 为什么要进行这次统一(例如,是否有性能提升、维护性改进等)
  3. 这次修改对现有功能的影响
  4. 是否进行了充分的测试以确保训推一致性

此外,Checklist 中的多个项目未勾选,特别是单元测试和准确性测试,请补充相关信息或勾选相应的项目。

Copilot generated this review using guidance from repository custom instructions.
x, using_pow2_scale=False, output_scale_transpose=False
)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在调用 paddle.incubate.nn.functional.fp8_quant_blockwise 时,使用了 output_scale_transpose=False,然后对返回的 scale tensor 进行了切片 x_scale_tensor[: x.shape[0]]。这个切片操作出现在多个地方(第 161、232、389、433 行),建议添加统一的注释说明为什么需要这个切片操作,或者检查是否所有调用都需要这个切片。

Suggested change
)
)
# fp8_quant_blockwise may return an extra padded dimension on the scale tensor
# when output_scale_transpose=False. Slice by x.shape[0] to keep only the
# valid batch entries so that x_scale_tensor matches the layout expected by EP.

Copilot uses AI. Check for mistakes.
x_scale_tensor = x_scale_tensor[: x.shape[0]]

event = deep_ep.Buffer.capture()
let_another_thread_run()
Expand Down Expand Up @@ -225,11 +226,10 @@ def apply_ep_prefill(
ffn_out = paddle.incubate.nn.functional.swiglu(ffn_out, None)

# down_proj
ffn_in_x, ffn_in_x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant(
ffn_out, self.quant_config.weight_block_size[0]
ffn_in_x, ffn_in_x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
ffn_out, using_pow2_scale=False
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里调用 fp8_quant_blockwise 时没有指定 output_scale_transpose 参数,根据代码上下文和其他地方的用法,这里应该显式指定 output_scale_transpose=True(因为后续有 .T 操作)或者 output_scale_transpose=False。建议明确指定这个参数以提高代码的可读性和一致性。

Suggested change
ffn_out, using_pow2_scale=False
ffn_out,
using_pow2_scale=False,
output_scale_transpose=True,

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

调用 paddle.incubate.nn.functional.fp8_quant_blockwise 时没有指定 output_scale_transpose 参数。根据其他地方的用法,当没有指定这个参数时,默认行为可能是 False。在这里,返回的 scale tensor 会进行转置操作(.T)并切片。

为了提高代码的一致性和可读性,建议:

  1. 显式指定 output_scale_transpose=True(如果默认行为可以满足需求)
  2. 或者显式指定 output_scale_transpose=False 并保留 .T 操作
  3. 与 block_wise_fp8.py 中第 229 行的用法进行对比,那里使用了 output_scale_transpose=True 并且也有 .T 操作,这可能表明存在冗余操作
Suggested change
ffn_out, using_pow2_scale=False
ffn_out,
using_pow2_scale=False,
output_scale_transpose=False,

Copilot uses AI. Check for mistakes.
)
ffn_in_x_scale_tensor = ffn_in_x_scale_tensor.transpose([1, 0]).contiguous()
ffn_in_x_scale_tensor = ffn_in_x_scale_tensor.transpose([1, 0])
ffn_in_x_scale_tensor = ffn_in_x_scale_tensor.T[: ffn_in_x.shape[0]]

ffn_out = paddle.empty(
(ffn_out.shape[0], getattr(layer, self.added_weight_attrs[1]).shape[1]),
Expand Down Expand Up @@ -381,7 +381,12 @@ def apply_tp(

tmp = count_tokens_per_expert_func(topk_ids, layer.num_experts)

recv_x, recv_x_scale = fastdeploy.model_executor.ops.gpu.per_token_quant(x, 128)
recv_x, recv_x_scale = paddle.incubate.nn.functional.fp8_quant_blockwise(
x,
using_pow2_scale=False,
output_scale_transpose=False,
)
recv_x_scale = recv_x_scale[: recv_x.shape[0]]

(
permute_input,
Expand Down Expand Up @@ -422,12 +427,10 @@ def apply_tp(
ffn_out = paddle.incubate.nn.functional.swiglu(ffn_out)

# down_proj
ffn_in_x, ffn_in_x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant(
ffn_out, self.quant_config.weight_block_size[0]
ffn_in_x, ffn_in_x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
ffn_out, using_pow2_scale=False
)

ffn_in_x_scale_tensor = ffn_in_x_scale_tensor.transpose([1, 0]).contiguous()
ffn_in_x_scale_tensor = ffn_in_x_scale_tensor.transpose([1, 0])
ffn_in_x_scale_tensor = ffn_in_x_scale_tensor.T[: ffn_in_x.shape[0]]

ffn_out = paddle.empty(
(ffn_out.shape[0], getattr(layer, self.added_weight_attrs[1]).shape[1]),
Expand Down
10 changes: 7 additions & 3 deletions fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -1525,7 +1525,10 @@ def apply(

from .triton_moe_kernels import fused_moe_kernel_paddle

x_q, x_scale = fastdeploy.model_executor.ops.gpu.per_token_quant(x, self.quant_config.weight_block_size[0])
x_q, x_scale = paddle.incubate.nn.functional.fp8_quant_blockwise(
x, using_pow2_scale=False, output_scale_transpose=False
)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在使用 paddle.incubate.nn.functional.fp8_quant_blockwise 后,这里对 x_scale 进行了切片操作 x_scale[: x.shape[0]]。需要确认 fp8_quant_blockwise 返回的 scale tensor 的形状是否可能大于 x.shape[0],以及这种切片是否会造成数据丢失。如果 fp8_quant_blockwise 的输出是经过 padding 的,应该在注释中说明这一点。建议添加注释解释为什么需要这个切片操作。

Suggested change
)
)
# fp8_quant_blockwise may pad the leading dimension of x_q/x_scale to
# a multiple of the internal block size (e.g. BLOCK_SIZE_M). Only the
# first x.shape[0] entries correspond to real tokens, so we slice here
# to match the original token dimension. The padded region is handled
# separately via max_num_tokens_padded and related Triton kernel args.

Copilot uses AI. Check for mistakes.
x_scale = x_scale[: x.shape[0]]

fused_moe_kernel_paddle[grid](
x_q,
Expand Down Expand Up @@ -1578,9 +1581,10 @@ def apply(
ceil_div(max_num_tokens_padded, config["BLOCK_SIZE_M"]) * ceil_div(hidden_size, config["BLOCK_SIZE_N"]),
)

x_q, x_scale = fastdeploy.model_executor.ops.gpu.per_token_quant(
intermediate_cache2, self.quant_config.weight_block_size[0]
x_q, x_scale = paddle.incubate.nn.functional.fp8_quant_blockwise(
intermediate_cache2, using_pow2_scale=False, output_scale_transpose=False
)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里对 x_scale 进行了切片操作 x_scale[: x_q.shape[0]],与上面类似的代码保持一致。但是应该添加注释说明为什么需要这个切片操作,以提高代码的可维护性。

Suggested change
)
)
# Align the activation scale with the quantized activation rows.
# fp8_quant_blockwise may return extra scale rows due to block padding,
# but the fused Triton kernel expects one scale row per row in x_q.

Copilot uses AI. Check for mistakes.
x_scale = x_scale[: x_q.shape[0]]

fused_moe_kernel_paddle[grid](
x_q,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@

import paddle

import fastdeploy
from fastdeploy import envs
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在 block_wise_fp8.py 中删除了对 fastdeploy 模块的导入,但在其他文件(如 fused_moe_deepgemm_backend.py)中仍然使用 fastdeploy.model_executor.ops.gpu 的其他函数。请确认删除这个导入是否会影响其他功能,以及是否需要更新其他相关的导入语句。

Copilot uses AI. Check for mistakes.
from fastdeploy.model_executor.layers.linear import (
MergedColumnParallelLinear,
Expand Down Expand Up @@ -226,9 +225,10 @@ def process_prequanted_weights(self, layer, state_dict, is_rearrange: bool = Fal
layer.weight_scale_inv.set_value(weight_scale)

def apply(self, layer, x):
x, x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant_padding(
x, self.quant_config.weight_block_size[0]
x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
x, using_pow2_scale=False, output_scale_transpose=True
)
x_scale_tensor = x_scale_tensor.T
Comment on lines +228 to +231
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里使用了 output_scale_transpose=True,然后在第 231 行又进行了转置操作 x_scale_tensor.T。这看起来是冗余的操作。如果 output_scale_transpose=True 已经返回了转置后的结果,那么再次转置会得到原始的形状。

建议:

  1. 检查 paddle.incubate.nn.functional.fp8_quant_blockwise 的文档,确认 output_scale_transpose=True 的确切行为
  2. 如果 output_scale_transpose=True 已经进行了转置,那么第 231 行的 .T 操作应该移除
  3. 或者如果需要进行两次转置来达到特定的维度顺序,应该添加注释说明原因
  4. 与 fused_moe_deepgemm_backend.py 中的用法保持一致
Suggested change
x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
x, using_pow2_scale=False, output_scale_transpose=True
)
x_scale_tensor = x_scale_tensor.T
# output_scale_transpose=True returns the scale tensor in the layout required by deep_gemm
x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
x, using_pow2_scale=False, output_scale_transpose=True
)

Copilot uses AI. Check for mistakes.
linear_out = paddle.empty((x.shape[0], layer.output_size), dtype=paddle.bfloat16)
from fastdeploy.model_executor.ops.gpu import deep_gemm

Expand Down
36 changes: 32 additions & 4 deletions fastdeploy/model_executor/layers/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,35 @@ def group_wise_int4_weight_quantize(weight: paddle.Tensor, group_size: int = 128
return quant_weight.astype(paddle.int8), weight_scale


def scale_wrapper(x_amax: paddle.Tensor, eps: float = 0.0) -> paddle.Tensor:
"""
Paddle implementation of CUDA ScaleWrapper logic.
Args:
x_amax (paddle.Tensor): amax tensor (float32 recommended)
eps (float): epsilon to avoid division by zero
Returns:
paddle.Tensor: scale tensor, same shape as x_amax
"""
fp8_max = 448.0
float_max = paddle.finfo(paddle.float32).max
amax_mod = paddle.maximum(
x_amax,
paddle.full_like(x_amax, eps),
)
scale = fp8_max / amax_mod
scale = paddle.where(
amax_mod == 0,
paddle.ones_like(scale),
scale,
)
scale = paddle.where(
Comment on lines +226 to +244
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scale_wrapper 函数的逻辑存在冗余。在第 234-237 行,代码使用 paddle.maximum 将 x_amax 与 eps 进行比较,确保 amax_mod 至少为 eps。然后在第 239-242 行,代码检查 amax_mod == 0 的情况。但是由于 amax_mod 已经被保证至少为 eps(当 eps > 0 时),这个检查永远不会为真(假设 eps > 0)。

建议:

  1. 如果 eps 参数总是大于 0,那么第 239-242 行的检查可以移除,因为它永远不会触发
  2. 如果 eps 可以为 0,那么第 234-237 行的逻辑应该改为使用 paddle.clip(x_amax, min=eps) 或者调整条件检查的顺序
  3. 或者,将第 239-242 行的条件改为检查 x_amax == 0 而不是 amax_mod == 0

另外,建议在函数文档中说明 eps 参数的预期值范围。

Suggested change
Args:
x_amax (paddle.Tensor): amax tensor (float32 recommended)
eps (float): epsilon to avoid division by zero
Returns:
paddle.Tensor: scale tensor, same shape as x_amax
"""
fp8_max = 448.0
float_max = paddle.finfo(paddle.float32).max
amax_mod = paddle.maximum(
x_amax,
paddle.full_like(x_amax, eps),
)
scale = fp8_max / amax_mod
scale = paddle.where(
amax_mod == 0,
paddle.ones_like(scale),
scale,
)
scale = paddle.where(
Args:
x_amax (paddle.Tensor): amax tensor (float32 recommended)
eps (float): Non-negative epsilon to avoid division by zero.
When eps == 0.0 and x_amax contains zeros, the corresponding
scale values are set to 1.0 to avoid infinite results.
Returns:
paddle.Tensor: scale tensor, same shape as x_amax
"""
fp8_max = 448.0
float_max = paddle.finfo(paddle.float32).max
amax_floor = paddle.full_like(x_amax, eps)
amax_mod = paddle.maximum(x_amax, amax_floor)
scale = fp8_max / amax_mod
# Only apply zero-guard when eps <= 0.0; for eps > 0.0, amax_mod is
# already guaranteed to be at least eps, so this condition would be redundant.
if eps <= 0.0:
scale = paddle.where(
amax_mod == 0,
paddle.ones_like(scale),
scale,
)
scale = paddle.where(

Copilot uses AI. Check for mistakes.
paddle.isinf(scale),
paddle.full_like(scale, float_max),
scale,
)
return scale


def per_block_cast_to_fp8(x: Tensor, block_size: list = [128, 128]) -> Tuple[Tensor, Tensor]:
"""
Only used in deep_gemm block wise quant weight.
Expand All @@ -244,11 +273,10 @@ def per_block_cast_to_fp8(x: Tensor, block_size: list = [128, 128]) -> Tuple[Ten

x_abs = paddle.abs(x_view).astype(paddle.float32)
x_amax = paddle.amax(x_abs, axis=(1, 3), keepdim=True)
x_amax = paddle.clip(x_amax, min=1e-4)
x_scaled = (x_view * (448.0 / x_amax)).astype(paddle.float8_e4m3fn)

scale = scale_wrapper(x_amax)
x_scaled = (x_view * scale).astype(paddle.float8_e4m3fn)
return x_scaled.view_as(x_padded)[:m, :n].contiguous(), (
paddle.view(x_amax / 448.0, (x_view.shape[0], x_view.shape[2]))
paddle.view(1.0 / scale, (x_view.shape[0], x_view.shape[2]))
)


Expand Down
12 changes: 6 additions & 6 deletions tests/ce/server/test_logprobs.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@ def test_unstream_with_logprobs():
# 校验返回内容与概率信息
assert resp_json["choices"][0]["message"]["content"] == "牛顿的"
assert resp_json["choices"][0]["logprobs"]["content"][0]["token"] == "牛顿"
assert resp_json["choices"][0]["logprobs"]["content"][0]["logprob"] == -0.031025361269712448
assert resp_json["choices"][0]["logprobs"]["content"][0]["logprob"] == -0.03113006055355072
assert resp_json["choices"][0]["logprobs"]["content"][0]["top_logprobs"][0] == {
"token": "牛顿",
"logprob": -0.031025361269712448,
"logprob": -0.03113006055355072,
"bytes": [231, 137, 155, 233, 161, 191],
"top_logprobs": None,
}
Expand Down Expand Up @@ -102,10 +102,10 @@ def test_stream_with_logprobs():
# 校验概率字段
assert result_chunk["choices"][0]["delta"]["content"] == "牛顿"
assert result_chunk["choices"][0]["logprobs"]["content"][0]["token"] == "牛顿"
assert result_chunk["choices"][0]["logprobs"]["content"][0]["logprob"] == -0.031025361269712448
assert result_chunk["choices"][0]["logprobs"]["content"][0]["logprob"] == -0.03113006055355072
assert result_chunk["choices"][0]["logprobs"]["content"][0]["top_logprobs"][0] == {
"token": "牛顿",
"logprob": -0.031025361269712448,
"logprob": -0.03113006055355072,
"bytes": [231, 137, 155, 233, 161, 191],
}

Expand Down Expand Up @@ -187,10 +187,10 @@ def test_stream_with_temp_scaled_logprobs():
# 校验概率字段
assert result_chunk["choices"][0]["delta"]["content"] == "牛顿"
assert result_chunk["choices"][0]["logprobs"]["content"][0]["token"] == "牛顿"
assert result_chunk["choices"][0]["logprobs"]["content"][0]["logprob"] == -0.006811376195400953
assert result_chunk["choices"][0]["logprobs"]["content"][0]["logprob"] == -0.0068125599063932896
assert result_chunk["choices"][0]["logprobs"]["content"][0]["top_logprobs"][0] == {
"token": "牛顿",
"logprob": -0.006811376195400953,
"logprob": -0.0068125599063932896,
"bytes": [231, 137, 155, 233, 161, 191],
}

Expand Down
4 changes: 2 additions & 2 deletions tests/ci_use/EB_VL_Lite/test_EB_VL_Lite_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,9 +205,9 @@ def test_consistency_between_runs(api_url, headers, consistent_payload):
# base result
base_path = os.getenv("MODEL_PATH")
if base_path:
base_file = os.path.join(base_path, "ernie-4_5-vl-base-tp2-dev")
base_file = os.path.join(base_path, "ernie-4_5-vl-base-tp2-24")
else:
base_file = "ernie-4_5-vl-base-tp2-dev"
base_file = "ernie-4_5-vl-base-tp2-24"
with open(base_file, "r") as f:
content2 = f.read()

Expand Down
4 changes: 2 additions & 2 deletions tests/e2e/test_EB_VL_Lite_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,9 +204,9 @@ def test_consistency_between_runs(api_url, headers, consistent_payload):
# base result
base_path = os.getenv("MODEL_PATH")
if base_path:
base_file = os.path.join(base_path, "ernie-4_5-vl-base-tp2-dev")
base_file = os.path.join(base_path, "ernie-4_5-vl-base-tp2-24")
else:
base_file = "ernie-4_5-vl-base-tp2-dev"
base_file = "ernie-4_5-vl-base-tp2-24"
with open(base_file, "r") as f:
content2 = f.read()

Expand Down
4 changes: 2 additions & 2 deletions tests/e2e/utils/rollout_routing_replay_test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -151,9 +151,9 @@ def check_routing_replay_chat_completion(openai_client, moe_layer_num: int, mode
cur_save_routing_path = f"./R3_tmp/routing_replay_output_{model_name}/"
model_path = os.getenv("MODEL_PATH")
if model_path:
baseline_path = os.path.join(model_path, f"R3_BaseLine/routing_replay_output_baseline_{model_name}")
baseline_path = os.path.join(model_path, f"R3_BaseLine_24/routing_replay_output_baseline_{model_name}")
else:
baseline_path = f"./R3_BaseLine/routing_replay_output_baseline_{model_name}"
baseline_path = f"./R3_BaseLine_24/routing_replay_output_baseline_{model_name}"
stream_baseline_path = os.path.join(baseline_path, "r3_chat_completion_stream")

nonstream_baseline_path = os.path.join(baseline_path, "r3_chat_completion_nonstream")
Expand Down
7 changes: 5 additions & 2 deletions tests/layers/test_activation.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,8 +84,11 @@ def test_forward_cuda(self, mock_fused, mock_platform):
layer = SiluAndMul(fd_config)
x = paddle.ones([2, 2])
out = layer.forward(x)
self.assertTrue((out.numpy() == 1).all())
mock_fused.assert_called_once()
if layer.bias is None and layer.quant_scale == -1:
self.assertTrue((out.numpy() == 0.73105854).all())
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

测试中使用了精确的浮点数比较 out.numpy() == 0.73105854。由于浮点数运算可能存在精度误差,建议使用 numpy.allclose 或 self.assertAlmostEqual 进行近似比较,而不是使用 == 进行精确比较。例如:numpy.allclose(out.numpy(), 0.73105854, rtol=1e-6)

Copilot uses AI. Check for mistakes.
else:
self.assertTrue((out.numpy() == 1).all())
mock_fused.assert_called_once()
Comment on lines +87 to +91
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

测试逻辑存在问题。在 test_forward_cuda 中,由于 layer 是使用 DummyFDConfig() 初始化的(没有传入 bias 参数),所以 layer.bias 为 None,并且 layer.quant_scale 为 -1(默认值)。这意味着条件 layer.bias is None and layer.quant_scale == -1 总是为真,因此 else 分支永远不会执行,mock_fused.assert_called_once() 也永远不会被调用。

根据代码修改,当 bias 为 None 且 quant_scale 为 -1 时,forward_cuda 会直接调用 paddle.nn.functional.swiglu(x),而不是调用 fused_bias_act。因此这个测试的第二个分支(else 部分)是无法到达的死代码。

建议:

  1. 测试应该分成两个独立的测试用例:一个测试有 bias 或 quant_scale != -1 的情况,另一个测试 bias 为 None 且 quant_scale == -1 的情况
  2. 或者在这个测试中创建两个不同的 layer 实例来测试两种路径

Copilot uses AI. Check for mistakes.

# Test forward computation on GCU platform
@patch(
Expand Down
2 changes: 1 addition & 1 deletion tests/model_loader/test_torch_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ def test_model_against_baseline(

# Get baseline suffix from config
model_config = hugging_face_model_param_map.get(model_name_or_path, {})
baseline_suffix = model_config.get("baseline_suffix", "tp2")
baseline_suffix = model_config.get("baseline_suffix", "tp2-24")
baseline_filename = f"{model_name_or_path}-{baseline_suffix}"

if base_path:
Expand Down
Loading