Skip to content

Conversation

@iosmers
Copy link
Collaborator

@iosmers iosmers commented Sep 11, 2025

本PR实现XPU适配ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle模型
启动命令:
export XPU_VISIBLE_DEVICES="4,5,6,7"
python -m fastdeploy.entrypoints.openai.api_server --model ./PaddlePaddle/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle --port 8188
--tensor-parallel-size 4 --max-model-len 32768 --max-num-seqs 1 --quantization "W4A8" --gpu-memory-utilization 0.9

@paddle-bot
Copy link

paddle-bot bot commented Sep 11, 2025

Thanks for your contribution!

@CLAassistant
Copy link

CLAassistant commented Sep 25, 2025

CLA assistant check
All committers have signed the CLA.

@iosmers iosmers changed the title xpu support w4a8 [XPU] Support W4A8C8-TP4-300B Model Sep 26, 2025
PD_CHECK(ret == 0);
return {out, scale};
} else {
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个代码风格有问题吧?


if fd_config.quant_config and hasattr(fd_config.quant_config, "kv_cache_quant_type"):
self.quant_method: QuantMethodBase = fd_config.quant_config.get_quant_method(self)
print(f"quant_method: {self.quant_method}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是不是得走log 打印?

default_initializer=paddle.nn.initializer.Constant(0),
)

def calculate_md5(self, arr):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个新增的接口用在哪?

XPU compute Fused MoE.
"""
from fastdeploy.model_executor.ops.xpu import xpu_moe_layer
# from fastdeploy.model_executor.ops.xpu import xpu_moe_layer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

无用的代码删掉

or self.quant_type == XPUKvCacheQuantzationTypes.FP8_ZP
or self.quant_type == XPUKvCacheQuantzationTypes.BLOCK_WISE_FP8
):
self.max_bound = 448.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是不是应该报错不支持

):
self.max_bound = 448.0
elif self.quant_type == XPUKvCacheQuantzationTypes.INT4_ZP:
self.max_bound = 7.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

cache_v_scale_tensor = get_tensor(state_dict.pop(self.cache_v_scale_name)).cast("float32").reshape_([-1])

if self.cache_quant_config.has_zero_point: # cache_int4_zp
cache_k_scale = 1.0 / cache_k_scale_tensor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么是倒数关系?是不是加一下注释。

use for loader v1
"""
if layer.cache_k_scale._is_initialized():
layer.cache_k_out_scale.set_value(1 / layer.cache_k_scale)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加个注释说明一下吧

hong19860320
hong19860320 previously approved these changes Sep 28, 2025
Copy link
Collaborator

@hong19860320 hong19860320 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, 先合入一版,有问题在后续PR补上吧

Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@EmmonsCurse EmmonsCurse merged commit 20c7b74 into PaddlePaddle:develop Oct 10, 2025
31 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants