-
Notifications
You must be signed in to change notification settings - Fork 693
[Optimize] optimize mask_quant & swiglu #6222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #6222 +/- ##
==========================================
Coverage ? 67.00%
==========================================
Files ? 385
Lines ? 51283
Branches ? 7998
==========================================
Hits ? 34362
Misses ? 14430
Partials ? 2491
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/re-run all-failed |
|
/re-run all-failed |
|
/re-run all-failed |
K11OntheBoat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
1dc458d
yongqiangma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
/re-run all-failed |
a9846ea
This reverts commit 2ada119.
* optimize mask_quant op speed up 1.5 * fix calculate sequence * add fused * rm log * push kernel code * add ut * accuracy ok * add ue8m0 * add ut * add merge develop * rm ut of mask_per_token_quant * Revert "[Optimize] optimize mask_quant & swiglu (#6222)" This reverts commit 2ada119. * add block_size * pre-commit
Motivation
Modifications
from fastdeploy.model_executor.ops.gpu import group_swiglu_with_masked
from fastdeploy.model_executor.ops.gpu import masked_per_token_quant
融合上述两个算子为fused_mask_swiglu_fp8_quant
去掉了fp16的支持,暂时看没有需要调用的地方
去掉了输入支持int64的场景,同样是没有需求
支持ue8m0的场景
精度:
bd7b915
这个commit中测试了融合后的算子和融合之前的算子逐位对齐的
删去了mask_per_token_quant算子,mask_swiglu算子在别的文件(custom_ops/gpu_ops/moe/moe_ffn.cu ,custom_ops/gpu_ops/moe/moe_expert_ffn_wint2.cu)中有调用,暂时先不删除
性能结论:测试数据:self.group_num = 10


self.group_size = 2048
self.hidden_dim = 7168
self.block_size = 128
每个rank10个专家,有效token数在0-512的范围内,
H 卡替换比:(约1.6倍)
B卡替换比:(约2倍)
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.