Skip to content

Conversation

@bukejiyu
Copy link
Collaborator

@bukejiyu bukejiyu commented Sep 9, 2025

该pr支持通过下列命令 loading qwen系列 fp8离线量化权重 ernie fp8系列离线量化权重

llm = LLM(
    model=model_name_or_path,
    num_gpu_blocks_override=1024,
    tensor_parallel_size=1,
    load_choices="default_v1",
    use_cudagraph=False,
)
output = llm.generate(
    prompts="who are you",
    use_tqdm=True,
    sampling_params=sampling_params,
)

@paddle-bot
Copy link

paddle-bot bot commented Sep 9, 2025

Thanks for your contribution!

--ignore=tests/ce
--ignore=tests/operators/test_fused_moe.py
--ignore=tests/operators/test_w4afp8_gemm.py
--ignore=tests/model_loader/test_model_cache.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥把这个单测加进去,有问题需要解决吧

**extra_weight_attrs,
"weight_loader": extra_weight_attrs.get("weight_loader", default_weight_loader(layer.fd_config)),
"model_format": extra_weight_attrs.get("model_format", ""),
"weight_need_transpose": extra_weight_attrs.get("model_format") == "torch",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我不建议这样改,我理解model_format和weight_need_transpose是耦合在一块儿的,未来模型格式统一了后,这里的transpose操作就不复存在了,但你这么一改,意味着这个transpose逻辑独立于模型格式,到时候后续开发者接手这段代码,格式统一后,他很难get到transpose的逻辑需要删除

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

但是目前 torch模型也不都是全要transpose的,比如 离线量化fp8 torch权重不需要transpose,以前那个判断就不对了吧,这样改 weight_need_transpose还是和 模型类型强相关,但是可以在quantmethod create_weight中 根据模型和量化类型来确定 需不需要transpose

if getattr(model_config, "num_hidden_layers", None) is None:
raise ValueError("num_hidden_layers is None")

quantization_config = model_config.quantization_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看一下 #4051 里work_process.py的改动,有冲突

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经处理过了

setattr(config_obj, config_attr_name, origin_value)


def rename_offline_ckpt_suffix_to_fd_suffix(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么要有这一块代码,缘由说清楚,来龙去脉讲一下

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp8的 量化后缀不同模型似乎不太一致, 比如llama的fp8 的scale 叫 weight_scale vllm有在llama4.py中专门处理这个maping,qwen 的 scale 叫 weight_scale_inv, 所以加了个专门处理 checkpoint 后缀到 fd 后缀的映射方法

@yuanlehome
Copy link
Collaborator

update branch一下,可能有代码冲突了

if weight_need_transpose:
loaded_weight = get_tensor(loaded_weight)
loaded_weight = loaded_weight.transpose([1, 0])
param.weight_need_transpose = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里又置为False是有什么考虑吗

Copy link
Collaborator Author

@bukejiyu bukejiyu Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为 fuse在磁盘中的 是递归调用,会调用2次 weight_loader,这样改可以只transpose一次

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 29ed617 into PaddlePaddle:develop Sep 15, 2025
23 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants