Skip to content

Conversation

@bukejiyu
Copy link
Collaborator

@bukejiyu bukejiyu commented Oct 22, 2025

Motivation

原始PT权重加载逻辑导致 H2D性能出现数倍劣化,因此修改PTloading逻辑,提升模型 loading性能

Modifications

改动概述

除 ViT / Resampler 外的模型,PT 权重加载逻辑调整如下:
原逻辑: 加载权重 -> 转置 -> param.copy_(weight)
新逻辑: 创建与 checkpoint 对齐的参数 -> param.copy_(weight) -> after_loading_fn 负责转置

依赖paddle框架PR

  1. PR H2D copy 优化
  2. PR修复MMAP未正确释放
  3. cpu连续 copy修复

改动内容

修改了HF上PT模型的loading方式
已重构:
1.bf16
2.weightonly
3.deepgemm fp8 在线量化
4.trtion backend : fp8/Wfp8Afp8MoEMethod/triton weight only

测试结果

⚠️ 如需有提升加载速度需要使用 paddle develop 11.04及以后版本,用之前版本无加载速度提升,paddle3.2/3.2.1 均无加载速度提升
依赖pip install safetensors==0.7.0rc0

CPU型号 模型 精度 v1 新版本 v1 旧版本 v0 sglang/vllm
Intel(R) Xeon(R) Platinum 8468V ERNIE-4.5-300B-A47B-PT bf16 47.595s 93.335s - 60s
Intel(R) Xeon(R) Platinum 8468V ERNIE-4.5-300B-Paddle bf16 60.060s 140.328s 152.71s -
Intel(R) Xeon(R) Platinum 8468V ERNIE-4.5-VL-424B-A47B-PT fp8 71.386s 355.4 s - -

Usage or Command

Accuracy Tests

ci/ce

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Oct 22, 2025

Thanks for your contribution!

@bukejiyu bukejiyu changed the title [loader] Refactor PT model loading [Loader] Refactor PT model loading Oct 22, 2025
@bukejiyu bukejiyu force-pushed the v1_loader_speed_up branch 2 times, most recently from a526f60 to 233ca08 Compare October 28, 2025 14:05
@bukejiyu bukejiyu force-pushed the v1_loader_speed_up branch 3 times, most recently from 6db5f59 to 5b3e605 Compare November 5, 2025 09:04
@YuanRisheng YuanRisheng requested a review from Copilot November 5, 2025 11:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces comprehensive support for loading PyTorch model weights in FastDeploy, with a focus on handling weight transposition between PyTorch and Paddle formats and optimizing the model loading pipeline. Key improvements include migrating to safetensors 0.7.0rc0 for direct GPU tensor loading, refactoring weight processing logic, and introducing a new post-loading processing phase.

  • Migrated safetensors dependency to version 0.7.0rc0 with direct framework integration
  • Implemented process_final_after_loading for post-loading weight transformations
  • Refactored weight transpose logic into centralized process_weight_transpose and h2d_copy functions
  • Updated quantization methods to handle PyTorch vs Paddle weight format differences

Reviewed Changes

Copilot reviewed 31 out of 32 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
requirements.txt Added safetensors 0.7.0rc0 for improved tensor loading
fastdeploy/model_executor/utils.py Added weight transpose utilities, h2d_copy, and multi-config context manager
fastdeploy/model_executor/load_weight_utils.py Updated safetensors loader to use Paddle framework, modified cache logic
fastdeploy/model_executor/layers/quantization/*.py Refactored quantization methods to handle format-specific weight shapes and transpose logic
fastdeploy/model_executor/layers/moe/*.py Updated MoE layers with format-aware weight handling and transpose operations
fastdeploy/model_executor/layers/linear.py Added transpose processing to linear layers for PyTorch format compatibility
fastdeploy/model_executor/layers/lm_head.py Implemented weight transpose in lm_head for tied embeddings
fastdeploy/model_executor/models/*.py Updated all model load_weights methods to call process_final_after_loading
fastdeploy/engine/*.py Set OMP_NUM_THREADS environment variable to 3

@YuanRisheng
Copy link
Collaborator

这里PT 权重加载逻辑调整的原因是啥,哪步对性能提升有帮助呢

@bukejiyu bukejiyu force-pushed the v1_loader_speed_up branch 8 times, most recently from f809c64 to 90e3a47 Compare November 10, 2025 18:23
@bukejiyu
Copy link
Collaborator Author

bukejiyu commented Nov 11, 2025

这里PT 权重加载逻辑调整的原因是啥,哪步对性能提升有帮助呢

原逻辑 :是先在 CPU 上转置再做 contiguous,然后再拷贝到 GPU。
现逻辑:先在 CPU 上直接做 contiguous,把数据搬到 GPU 再转置。
因为 CPU 上 先 transpose 再 contiguous 会导致大量乱序访存,性能会慢 3~4 倍。
把 transpose 放到 GPU 上做就很快了。torch跑同样的例子的性能劣化和paddle是一致的

Copy link
Collaborator

@yuanlehome yuanlehome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bukejiyu bukejiyu merged commit b09ebb2 into PaddlePaddle:develop Nov 11, 2025
13 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants