Skip to content

Conversation

@zeroRains
Copy link
Contributor

@zeroRains zeroRains commented Aug 20, 2025

pcard-71500

注意:V1 Loader用到了Paddle develop版本对set_value和copy_的新特性,因此需要使用Paddle develop或者3.2(暂未发布)及以上的版本

本PR为Qwen2模型适配V1 Loader加载。
目前在bf16下验证,精度与develop对齐,且保证旧版和新版的load都能正常工作。
接入V1 Loader后,在Qwen2-7B-Instruct中,TP4的模型加载时内存占用下降为原本的29%,加载时间持平

修改内容:

  1. 在Qwen2模型中新增load_weights方法,以适配V1 loader。
  2. 在QKVParallelLinear中新增bias_loader方法,以便在bias调用weight_loader时,也有对应的方法执行。

测试记录:
模型:Qwen2-7B-Instruct
性能记录

指标 旧版loader V1 Loader
TP4内存占用(GB) 17 5
TP4模型加载时间(s) 4.41 4.41

ps: 内存占用指从fd开始执行端到端推理到结束出现的内存峰值。模型加载时间是指各个rank的worker.log中记录的模型加载时间的最大值。

精度记录

Qwen2-7B-Instruct tp1 bf16 结果对齐
base bf16 : ? I am a large language model created by Alibaba Cloud. I am called Qwen.\n\nCan you generate a poem about the beauty of nature?
pr   bf16 : ? I am a large language model created by Alibaba Cloud. I am called Qwen.\n\nCan you generate a poem about the beauty of nature?

tp4 bf16 结果对齐
base bf16 : ? I am a large language model created by Alibaba Cloud. I am called Qwen.\n\nCan you generate a poem about the beauty of nature?
pr   bf16 : ? I am a large language model created by Alibaba Cloud. I am called Qwen.\n\nCan you generate a poem about the beauty of nature?

测试脚本:

# run_demo.sh

export CUDA_VISIBLE_DEVICES=0,1,2,3
export PYTHONPATH=/workspace/miniconda3/envs/py310:$PYTHONPATH
export PYTHONPATH=/workspace/FastDeploy:$PYTHONPATH
python offline_demo.py
# offline_demo.py
from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM

model_name_or_path = "/workspace/Models/Qwen2-7B-Instruct"
sampling_params = SamplingParams(temperature=0.1, max_tokens=30, top_p=0)
# load_choices取值"default_v1"是新版Loader,"default"是旧版
llm = LLM(model=model_name_or_path,num_gpu_blocks_override=1024, tensor_parallel_size=4, load_choices="default_v1")
output = llm.generate(prompts="who are you",
                      use_tqdm=True,
                      sampling_params=sampling_params)
print(output)

@paddle-bot
Copy link

paddle-bot bot commented Aug 20, 2025

Thanks for your contribution!

@yuanlehome
Copy link
Collaborator

加一个CI单测吧

@zeroRains
Copy link
Contributor Author

加一个CI单测吧

这个跑端到端模型推理会好处理一些,直接将这部分抽离出来写单测,不太好弄。

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 79f0dbb into PaddlePaddle:develop Aug 22, 2025
13 of 16 checks passed
@zeroRains zeroRains deleted the qwen branch August 23, 2025 04:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants