[webgpu] Optimize MatMulNBits for f16 Block32 prefill performance#23908
[webgpu] Optimize MatMulNBits for f16 Block32 prefill performance#23908guschmue merged 12 commits intomicrosoft:mainfrom
Conversation
|
Tests: model_benchmark.exe -i Phi-3.5-mini-instruct-onnx-web -l 1000
|
|
@qjia7 @sushraja-msft @jchen10 |
|
Add shader for easy review. |
|
/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
|
/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models |
|
/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
Azure Pipelines successfully started running 9 pipeline(s). |
|
Azure Pipelines successfully started running 4 pipeline(s). |
I'm trying to avoid making too many modifications in a single PR to keep it easier review, and comparable with previous shader. What are your thoughts? |
Yes, we observed quite good performance at accuracy level 4 using the DP4A shader. I'll investigate similar for f16. |
|
I can capture some perf numbers as well |
that's acceptable, perhaps name this MatMulNBitsBlock32Program > MatMulNBitsBlockWideTileProgram and land this PR and then work towards making this the default prefill program on all platforms. Ill review the shader |
Sure. Thanks. |
@guschmue thanks, please let me know if any issues. |
8a250db to
74da290
Compare
|
/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline |
|
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
|
/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI |
|
Azure Pipelines successfully started running 4 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
Azure Pipelines successfully started running 9 pipeline(s). |
|
Resolved existing comment. Please take another look, thanks. |
|
@guschmue could you have a look as well, and apply this PR? |
|
CI pipelines changes - can you merge with main? |
- Rename to `MatMulNBitsBlockWideTileProgram` for clarity. - Enforce `M >= kMinMForTileOptimization`. - Add TODO for future improvements.
4d3801f to
ca1710a
Compare
Rebase to main. Please help to re-trigger the CI, thanks. |
|
lint issue, wants you to run |
Fixed the lint issues. |
|
The logs of CI failure shows it's likely a result of infrastructure instability, and does not to be related to the changes. |
|
/azp run Windows x64 QNN CI Pipeline,Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
…crosoft#23908) ### Description This commit improve the MatMulNBits f16 Block32 prefill performance, by increasing tiling size and enhancing memory efficiency. Achieved a +2x performance boost on Intel iGPUs for Phi-3.5-mini f16 model. ### Motivation and Context See above.
Description
This commit improve the MatMulNBits f16 Block32 prefill performance, by increasing tiling size and enhancing memory efficiency. Achieved a +2x performance boost on Intel iGPUs for Phi-3.5-mini f16 model.
Motivation and Context
See above.