-
Notifications
You must be signed in to change notification settings - Fork 690
[Loader] Refactor PT model loading #4532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
a526f60 to
233ca08
Compare
6db5f59 to
5b3e605
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces comprehensive support for loading PyTorch model weights in FastDeploy, with a focus on handling weight transposition between PyTorch and Paddle formats and optimizing the model loading pipeline. Key improvements include migrating to safetensors 0.7.0rc0 for direct GPU tensor loading, refactoring weight processing logic, and introducing a new post-loading processing phase.
- Migrated safetensors dependency to version 0.7.0rc0 with direct framework integration
- Implemented
process_final_after_loadingfor post-loading weight transformations - Refactored weight transpose logic into centralized
process_weight_transposeandh2d_copyfunctions - Updated quantization methods to handle PyTorch vs Paddle weight format differences
Reviewed Changes
Copilot reviewed 31 out of 32 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| requirements.txt | Added safetensors 0.7.0rc0 for improved tensor loading |
| fastdeploy/model_executor/utils.py | Added weight transpose utilities, h2d_copy, and multi-config context manager |
| fastdeploy/model_executor/load_weight_utils.py | Updated safetensors loader to use Paddle framework, modified cache logic |
| fastdeploy/model_executor/layers/quantization/*.py | Refactored quantization methods to handle format-specific weight shapes and transpose logic |
| fastdeploy/model_executor/layers/moe/*.py | Updated MoE layers with format-aware weight handling and transpose operations |
| fastdeploy/model_executor/layers/linear.py | Added transpose processing to linear layers for PyTorch format compatibility |
| fastdeploy/model_executor/layers/lm_head.py | Implemented weight transpose in lm_head for tied embeddings |
| fastdeploy/model_executor/models/*.py | Updated all model load_weights methods to call process_final_after_loading |
| fastdeploy/engine/*.py | Set OMP_NUM_THREADS environment variable to 3 |
|
这里PT 权重加载逻辑调整的原因是啥,哪步对性能提升有帮助呢 |
f809c64 to
90e3a47
Compare
90e3a47 to
689a8a5
Compare
原逻辑 :是先在 CPU 上转置再做 contiguous,然后再拷贝到 GPU。 |
yuanlehome
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Motivation
原始PT权重加载逻辑导致 H2D性能出现数倍劣化,因此修改PTloading逻辑,提升模型 loading性能
Modifications
改动概述
除 ViT / Resampler 外的模型,PT 权重加载逻辑调整如下:
原逻辑: 加载权重 -> 转置 -> param.copy_(weight)
新逻辑: 创建与 checkpoint 对齐的参数 -> param.copy_(weight) -> after_loading_fn 负责转置
依赖paddle框架PR
改动内容
修改了HF上PT模型的loading方式
已重构:
1.bf16
2.weightonly
3.deepgemm fp8 在线量化
4.trtion backend : fp8/Wfp8Afp8MoEMethod/triton weight only
测试结果
依赖pip install safetensors==0.7.0rc0
Usage or Command
无
Accuracy Tests
ci/ce
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.