[Loader] Refactor PT model loading #4532

bukejiyu · 2025-10-22T02:59:37Z

Motivation

原始PT权重加载逻辑导致 H2D性能出现数倍劣化，因此修改PTloading逻辑，提升模型 loading性能

Modifications

改动概述

除 ViT / Resampler 外的模型，PT 权重加载逻辑调整如下：
原逻辑: 加载权重 -> 转置 -> param.copy_(weight)
新逻辑: 创建与 checkpoint 对齐的参数 -> param.copy_(weight) -> after_loading_fn 负责转置

依赖paddle框架PR

改动内容

修改了HF上PT模型的loading方式
已重构：
1.bf16
2.weightonly
3.deepgemm fp8 在线量化
4.trtion backend : fp8/Wfp8Afp8MoEMethod/triton weight only

测试结果

⚠️ 如需有提升加载速度需要使用 paddle develop 11.04及以后版本，用之前版本无加载速度提升，paddle3.2/3.2.1 均无加载速度提升
依赖pip install safetensors==0.7.0rc0

CPU型号	模型	精度	v1 新版本	v1 旧版本	v0	sglang/vllm
Intel(R) Xeon(R) Platinum 8468V	ERNIE-4.5-300B-A47B-PT	bf16	47.595s	93.335s	-	60s
Intel(R) Xeon(R) Platinum 8468V	ERNIE-4.5-300B-Paddle	bf16	60.060s	140.328s	152.71s	-
Intel(R) Xeon(R) Platinum 8468V	ERNIE-4.5-VL-424B-A47B-PT	fp8	71.386s	355.4 s	-	-

Usage or Command

无

Accuracy Tests

ci/ce

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-10-22T02:59:42Z

Thanks for your contribution!

Copilot

Pull Request Overview

This PR introduces comprehensive support for loading PyTorch model weights in FastDeploy, with a focus on handling weight transposition between PyTorch and Paddle formats and optimizing the model loading pipeline. Key improvements include migrating to safetensors 0.7.0rc0 for direct GPU tensor loading, refactoring weight processing logic, and introducing a new post-loading processing phase.

Migrated safetensors dependency to version 0.7.0rc0 with direct framework integration
Implemented process_final_after_loading for post-loading weight transformations
Refactored weight transpose logic into centralized process_weight_transpose and h2d_copy functions
Updated quantization methods to handle PyTorch vs Paddle weight format differences

Reviewed Changes

Copilot reviewed 31 out of 32 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
requirements.txt	Added safetensors 0.7.0rc0 for improved tensor loading
fastdeploy/model_executor/utils.py	Added weight transpose utilities, h2d_copy, and multi-config context manager
fastdeploy/model_executor/load_weight_utils.py	Updated safetensors loader to use Paddle framework, modified cache logic
fastdeploy/model_executor/layers/quantization/*.py	Refactored quantization methods to handle format-specific weight shapes and transpose logic
fastdeploy/model_executor/layers/moe/*.py	Updated MoE layers with format-aware weight handling and transpose operations
fastdeploy/model_executor/layers/linear.py	Added transpose processing to linear layers for PyTorch format compatibility
fastdeploy/model_executor/layers/lm_head.py	Implemented weight transpose in lm_head for tied embeddings
fastdeploy/model_executor/models/*.py	Updated all model load_weights methods to call process_final_after_loading
fastdeploy/engine/*.py	Set OMP_NUM_THREADS environment variable to 3

fastdeploy/model_executor/models/paddleocr_vl/paddleocr_vl.py

fastdeploy/model_executor/models/gpt_oss.py

fastdeploy/model_executor/models/ernie4_5_mtp.py

fastdeploy/model_executor/load_weight_utils.py

fastdeploy/model_executor/layers/linear.py

fastdeploy/engine/engine.py

fastdeploy/engine/async_llm.py

fastdeploy/model_executor/layers/quantization/block_wise_fp8.py

fastdeploy/model_executor/layers/linear.py

fastdeploy/model_executor/layers/quantization/weight_only.py

YuanRisheng · 2025-11-05T12:30:56Z

这里PT 权重加载逻辑调整的原因是啥，哪步对性能提升有帮助呢

bukejiyu · 2025-11-11T08:41:18Z

这里PT 权重加载逻辑调整的原因是啥，哪步对性能提升有帮助呢

原逻辑：是先在 CPU 上转置再做 contiguous，然后再拷贝到 GPU。
现逻辑：先在 CPU 上直接做 contiguous，把数据搬到 GPU 再转置。
因为 CPU 上先 transpose 再 contiguous 会导致大量乱序访存，性能会慢 3~4 倍。
把 transpose 放到 GPU 上做就很快了。torch跑同样的例子的性能劣化和paddle是一致的

yuanlehome

LGTM

bukejiyu changed the title ~~[loader] Refactor PT model loading~~ [Loader] Refactor PT model loading Oct 22, 2025

bukejiyu force-pushed the v1_loader_speed_up branch 2 times, most recently from a526f60 to 233ca08 Compare October 28, 2025 14:05

bukejiyu force-pushed the v1_loader_speed_up branch 3 times, most recently from 6db5f59 to 5b3e605 Compare November 5, 2025 09:04

YuanRisheng requested a review from Copilot November 5, 2025 11:50

Copilot AI reviewed Nov 5, 2025

View reviewed changes

YuanRisheng reviewed Nov 5, 2025

View reviewed changes

fastdeploy/model_executor/layers/linear.py Show resolved Hide resolved

fastdeploy/model_executor/layers/quantization/weight_only.py Outdated Show resolved Hide resolved

bukejiyu force-pushed the v1_loader_speed_up branch 8 times, most recently from f809c64 to 90e3a47 Compare November 10, 2025 18:23

refactor pt loading

689a8a5

bukejiyu force-pushed the v1_loader_speed_up branch from 90e3a47 to 689a8a5 Compare November 11, 2025 03:06

YuanRisheng added the skip-ci: coverage label Nov 11, 2025

Merge branch 'develop' into v1_loader_speed_up

e45a62b

yuanlehome approved these changes Nov 11, 2025

View reviewed changes

bukejiyu merged commit b09ebb2 into PaddlePaddle:develop Nov 11, 2025
13 of 16 checks passed

SigureMo mentioned this pull request Dec 6, 2025

[Loader][BugFix] Fix some parameters place on CPU in PaddleOCR-VL #5413

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Loader] Refactor PT model loading #4532

[Loader] Refactor PT model loading #4532

Uh oh!

bukejiyu commented Oct 22, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Oct 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YuanRisheng commented Nov 5, 2025

Uh oh!

bukejiyu commented Nov 11, 2025 •

edited

Loading

Uh oh!

yuanlehome left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Loader] Refactor PT model loading #4532

[Loader] Refactor PT model loading #4532

Uh oh!

Conversation

bukejiyu commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

改动概述

依赖paddle框架PR

改动内容

测试结果

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Oct 22, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YuanRisheng commented Nov 5, 2025

Uh oh!

bukejiyu commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuanlehome left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bukejiyu commented Oct 22, 2025 •

edited

Loading

bukejiyu commented Nov 11, 2025 •

edited

Loading