[QNN EP] Add Case-2 LPBQ pattern support for Gemm and Matmul nodes by quic-tirupath · Pull Request #25865 · microsoft/onnxruntime

quic-tirupath · 2025-08-27T00:54:50Z

Description

Case-2 LPBQ pattern omits QuantizeLinear node in LPBQ packing pattern
Modify LPBQ fusion logic in QNN EP implemented for Gemma and MatMul nodes to gracefully handle the optional QuantizeLinear node in LPBQ packing pattern.
Add unit tests to verify Case-2 LPBQ pattern fusion for Gemm and MatMul nodes.

Motivation and Context

QuantizeLinear node in LowPowerBlockQuantization encoding packing pattern can be optional as it helps to keep the weights in INT datatype and further helps to reduce the size of model.

jywu-msft · 2025-08-27T16:03:27Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

azure-pipelines · 2025-08-27T16:03:51Z

Azure Pipelines successfully started running 5 pipeline(s).

Copilot

Pull Request Overview

This PR adds support for Case-2 LPBQ (Low Power Block Quantization) pattern where the QuantizeLinear node is optional in the LPBQ packing pattern for Gemm and MatMul nodes. This enables keeping weights in INT datatype to reduce model size.

Key changes:

Modified LPBQ fusion logic to handle optional QuantizeLinear nodes
Added new unit tests for both MatMul and Gemm Case-2 LPBQ patterns
Extended utility functions to support lookup by input name

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`lpbqmatmul_fusion_without_ql_test.cc`	Unit test for MatMul LPBQ fusion without QuantizeLinear
`lpbqgemm_fusion_without_ql_test.cc`	Unit test for Gemm LPBQ fusion without QuantizeLinear
`utils.h/.cc`	Added GetParentOfInputByName utility function
`lpbqmatmul_fusion.h/.cc`	Modified to support optional QuantizeLinear nodes
`lpbqgemm_fusion.h/.cc`	Modified to support optional QuantizeLinear nodes

Comments suppressed due to low confidence (2)

onnxruntime/core/providers/qnn/builder/qnn_node_group/lpbqmatmul_fusion.cc:142

[nitpick] Magic number index 2 is used to access the target node unit. Consider using a named constant or enum to make the code more self-documenting and less error-prone.

  return node_units_[2];

onnxruntime/core/providers/qnn/builder/qnn_node_group/lpbqgemm_fusion.cc:187

[nitpick] Magic number index 4 is used to access the target node unit. Consider using a named constant or enum to make the code more self-documenting and less error-prone.

  return node_units_[4];

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

onnxruntime/core/providers/qnn/builder/qnn_node_group/lpbqmatmul_fusion.cc

onnxruntime/core/providers/qnn/builder/qnn_node_group/lpbqgemm_fusion.cc

onnxruntime/core/providers/qnn/builder/qnn_node_group/utils.cc

minfhong-qti

onnxruntime/onnxruntime/core/providers/qnn/builder/qnn_node_group/lpbqgemm_fusion.cc

Line 246 in bd28495

const NodeUnitIODef& per_channel_float_def = scale_dql_node_unit.Inputs()[0];

Should it be

const NodeUnitIODef& per_channel_float_def = scale_dql_node_unit.Inputs()[1]

instead?

jywu-msft · 2025-09-10T17:12:11Z

onnxruntime/onnxruntime/core/providers/qnn/builder/qnn_node_group/lpbqgemm_fusion.cc

Line 246 in bd28495

const NodeUnitIODef& per_channel_float_def = scale_dql_node_unit.Inputs()[0];

Should it be
const NodeUnitIODef& per_channel_float_def = scale_dql_node_unit.Inputs()[1]
instead?

@quic-tirupath ?

quic-tirupath · 2025-12-16T17:53:08Z

onnxruntime/onnxruntime/core/providers/qnn/builder/qnn_node_group/lpbqgemm_fusion.cc

Line 246 in bd28495

const NodeUnitIODef& per_channel_float_def = scale_dql_node_unit.Inputs()[0];

Should it be
const NodeUnitIODef& per_channel_float_def = scale_dql_node_unit.Inputs()[1]
instead?

Nope; This is correct. The DQL node_unit contains only one input which is input[0] and other inputs like input[1], input[2] are wrapped as quantization params of input[0]. Here we first get the input[0] IODef, and then get the per_channel_float values from its quantization param: scale.

onnxruntime/core/providers/qnn/builder/qnn_node_group/lpbqgemm_fusion.h

onnxruntime/core/providers/qnn/builder/qnn_node_group/lpbqgemm_fusion.cc

onnxruntime/core/providers/qnn/builder/qnn_node_group/lpbqmatmul_fusion.cc

onnxruntime/core/providers/qnn/builder/qnn_node_group/utils.cc

- Case-2 LPBQ pattern omits QuantizeLinear node in LPBQ packing pattern - Modify LPBQ fusion logic in QNN EP implemented for Gemma and MatMul nodes to gracefully handle the optional QuantizeLinear node in LPBQ packing pattern. - Add unit tests to verify Case-2 LPBQ pattern fusion for Gemm and MatMul nodes.

- Fix review comments - Rebase the PR on tip and address the conflicts

tirupath-qti · 2025-12-22T20:43:31Z

Hi @edgchen1
Thanks for providing the review comments. I added my responses to all your comments and address many of them in the new commit added to this PR.

Could you please help to review and approve the latest PR version? Also, please help trigger CI job and merge after the successful review.

onnxruntime/core/providers/qnn/builder/qnn_node_group/lpbqgemm_fusion.h

edgchen1 · 2025-12-23T02:15:47Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-12-23T02:16:05Z

Azure Pipelines successfully started running 4 pipeline(s).

tirupath-qti

Addressed the review comments in a new commit added to this PR.

edgchen1 · 2026-01-08T22:27:58Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-08T22:28:18Z

Azure Pipelines successfully started running 4 pipeline(s).

…icrosoft#25865) ### Description - Case-2 LPBQ pattern omits QuantizeLinear node in LPBQ packing pattern - Modify LPBQ fusion logic in QNN EP implemented for Gemma and MatMul nodes to gracefully handle the optional QuantizeLinear node in LPBQ packing pattern. - Add unit tests to verify Case-2 LPBQ pattern fusion for Gemm and MatMul nodes. ### Motivation and Context - QuantizeLinear node in LowPowerBlockQuantization encoding packing pattern can be optional as it helps to keep the weights in INT datatype and further helps to reduce the size of model. --------- Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com>

jywu-msft requested a review from Copilot August 27, 2025 16:03

Copilot AI reviewed Aug 27, 2025

View reviewed changes

HectorSVC added the ep:QNN issues related to QNN exeution provider label Aug 27, 2025

jywu-msft previously approved these changes Sep 3, 2025

View reviewed changes

minfhong-qti reviewed Sep 5, 2025

View reviewed changes

quic-tirupath dismissed jywu-msft’s stale review via 7cf9b91 December 14, 2025 00:44

quic-tirupath force-pushed the dev/tirupath/lpbq_case2_pattern_support branch from bd28495 to 7cf9b91 Compare December 14, 2025 00:44