-
Notifications
You must be signed in to change notification settings - Fork 690
[XPU] Support W4A8C8-TP4-300B Model #4068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
027483b to
aeb4eb7
Compare
aeb4eb7 to
dd28881
Compare
d5ce629 to
01908db
Compare
b4fc62a to
3f64940
Compare
fastdeploy/model_executor/layers/backends/xpu/quantization/kv_cache.py
Outdated
Show resolved
Hide resolved
| PD_CHECK(ret == 0); | ||
| return {out, scale}; | ||
| } else { | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个代码风格有问题吧?
|
|
||
| if fd_config.quant_config and hasattr(fd_config.quant_config, "kv_cache_quant_type"): | ||
| self.quant_method: QuantMethodBase = fd_config.quant_config.get_quant_method(self) | ||
| print(f"quant_method: {self.quant_method}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是不是得走log 打印?
| default_initializer=paddle.nn.initializer.Constant(0), | ||
| ) | ||
|
|
||
| def calculate_md5(self, arr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个新增的接口用在哪?
| XPU compute Fused MoE. | ||
| """ | ||
| from fastdeploy.model_executor.ops.xpu import xpu_moe_layer | ||
| # from fastdeploy.model_executor.ops.xpu import xpu_moe_layer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
无用的代码删掉
| or self.quant_type == XPUKvCacheQuantzationTypes.FP8_ZP | ||
| or self.quant_type == XPUKvCacheQuantzationTypes.BLOCK_WISE_FP8 | ||
| ): | ||
| self.max_bound = 448.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是不是应该报错不支持
| ): | ||
| self.max_bound = 448.0 | ||
| elif self.quant_type == XPUKvCacheQuantzationTypes.INT4_ZP: | ||
| self.max_bound = 7.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
| cache_v_scale_tensor = get_tensor(state_dict.pop(self.cache_v_scale_name)).cast("float32").reshape_([-1]) | ||
|
|
||
| if self.cache_quant_config.has_zero_point: # cache_int4_zp | ||
| cache_k_scale = 1.0 / cache_k_scale_tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么是倒数关系?是不是加一下注释。
| use for loader v1 | ||
| """ | ||
| if layer.cache_k_scale._is_initialized(): | ||
| layer.cache_k_out_scale.set_value(1 / layer.cache_k_scale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加个注释说明一下吧
hong19860320
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, 先合入一版,有问题在后续PR补上吧
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
XiaoguangHu01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
本PR实现XPU适配ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle模型
启动命令:
export XPU_VISIBLE_DEVICES="4,5,6,7"
python -m fastdeploy.entrypoints.openai.api_server --model ./PaddlePaddle/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle --port 8188
--tensor-parallel-size 4 --max-model-len 32768 --max-num-seqs 1 --quantization "W4A8" --gpu-memory-utilization 0.9