Skip to content

[WebNN] Support MatMulNBits op#24142

Merged
fdwr merged 3 commits intomicrosoft:mainfrom
Honry:support-matmulnbits
Apr 9, 2025
Merged

[WebNN] Support MatMulNBits op#24142
fdwr merged 3 commits intomicrosoft:mainfrom
Honry:support-matmulnbits

Conversation

@Honry
Copy link
Contributor

@Honry Honry commented Mar 24, 2025

MatMulNBits op can be simply emulated by DequantizeLinear + Transpose + MatMul and currently only 4-bit quantization is supported.

Thus the B and zero_points (if present) inputs must be known as initializers with data type 'uint8' and we need to register them as 'uint4' WebNN constant.

Typically, all initializers are registered as WebNN constants in one step via ModelBuilder::RegisterInitializers before constructing the WebNN graph. However, due to WebNN doesn't support cast to 'uint4', we need to defer the registration of these two inputs until the MatMulNBitsBuilder::AddToModelBuilderImpl is invoked.

@Honry
Copy link
Contributor Author

Honry commented Mar 24, 2025

@fdwr, @guschmue, PTAL, thanks!

@guschmue guschmue added the ep:WebNN WebNN execution provider label Mar 24, 2025
@fdwr
Copy link
Contributor

fdwr commented Apr 7, 2025

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline,Windows GPU WebGPU CI Pipeline,Windows OpenVINO CI Pipeline

@fdwr
Copy link
Contributor

fdwr commented Apr 7, 2025

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@fdwr
Copy link
Contributor

fdwr commented Apr 7, 2025

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI

@fdwr
Copy link
Contributor

fdwr commented Apr 7, 2025

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Contributor

@fdwr fdwr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, else LGTM.

Honry added 3 commits April 7, 2025 14:07
MatMulNBits op can be simply emulated by DequantizeLinear + Transpose + MatMul
and currently only 4-bit quantization is supported.

Thus the B and zero_points (if present) inputs must be known as initializers
with data type 'uint8' and we need to register it them as 'uint4' WebNN constant.

Typically, all initializers are registered as WebNN constants in one step via
`ModelBuilder::RegisterInitializers` before constructing the WebNN graph.
However, due to WebNN doesn't support cast to 'uint4', we need to defer the
registration of these two inputs until the
`MatMulNBitsBuilder::AddToModelBuilderImpl` is invoked.
@Honry Honry force-pushed the support-matmulnbits branch from 7693135 to f372f79 Compare April 7, 2025 06:42
@Honry
Copy link
Contributor Author

Honry commented Apr 7, 2025

@fdwr, thanks for your comments, I've fixed them in the new commit, PTAL again.

Copy link
Contributor

@fdwr fdwr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@fdwr
Copy link
Contributor

fdwr commented Apr 8, 2025

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline,Windows GPU WebGPU CI Pipeline,Windows OpenVINO CI Pipeline

@fdwr
Copy link
Contributor

fdwr commented Apr 8, 2025

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@fdwr
Copy link
Contributor

fdwr commented Apr 8, 2025

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI

@fdwr
Copy link
Contributor

fdwr commented Apr 8, 2025

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@fdwr fdwr merged commit 5612ce5 into microsoft:main Apr 9, 2025
70 of 76 checks passed
ashrit-ms pushed a commit that referenced this pull request Apr 24, 2025
MatMulNBits op can be simply emulated by DequantizeLinear + Transpose +
MatMul and currently only 4-bit quantization is supported.

Thus the B and zero_points (if present) inputs must be known as
initializers with data type 'uint8' and we need to register them as
'uint4' WebNN constant.

Typically, all initializers are registered as WebNN constants in one
step via `ModelBuilder::RegisterInitializers` before constructing the
WebNN graph. However, due to WebNN doesn't support cast to 'uint4', we
need to defer the registration of these two inputs until the
`MatMulNBitsBuilder::AddToModelBuilderImpl` is invoked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebNN WebNN execution provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants