fix(Qualcomm): Replace linear op with conv2d in Qualcomm backend by chenghuaWang · Pull Request #600 · UbiquitousLearning/mllm

chenghuaWang · 2026-01-12T06:36:52Z

Summary by CodeRabbit

New Features
- Conv2D operations now supported in QNN backend for enhanced model execution
- Conv2D quantization support with LPBQ (Low Precision Block Quantization) method
- Model deployment optimization with HWIO format for improved inference efficiency
Improvements
- Qwen3 model architecture updated with Conv2D-based layers for optimized performance
- GPU acceleration added during model quantization initialization

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-12T06:37:04Z

📝 Walkthrough

Walkthrough

The changes add Conv2D support to the QNN AOT backend by replacing Linear layers with Conv2D in Qwen3 model components, implementing QNN AOT Conv2D visitor patterns and quantization recipes, updating LPBQ quantization handling for Conv2D tensors, and adding Python deployment conversion methods for HWIO layout preparation.

Changes

Cohort / File(s)	Summary
Qwen3 Model Definition `examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp`	Replaced Linear layers with Conv2D in Qwen3MLP, Qwen3Attention, and Qwen3ForCausalLM; added reshaping steps and KV cache validation; introduced CONV2D\_PROPERTY macro and vi32 type alias
QNN AOT Conv2D Visitor `mllm/backends/qnn/aot/visitor/Conv2D.hpp`, `mllm/backends/qnn/aot/visitor/Conv2D.cpp`	Introduced QnnAOTConv2DPattern class with isMatch and rewrite methods; implements linalg Conv2DOp to QNN Conv2d conversion with stride, padding, weight, and bias handling
QNN AOT Lowering Pass `mllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cpp`	Registered QnnAOTConv2DPattern in lowering pass pattern list to enable Conv2D matching during AOT compilation
QNN Quantization Recipe Pass `mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp`, `mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp`	Added LLMQuantRecipeConv2DPattern class supporting LPBQ method for weight and output quantization annotation on Conv2D operations
QNN PTQ Weight Solving `mllm/backends/qnn/aot/passes/PTQPass.cpp`	Extended PTQ weight solving to handle Conv2DOp by reusing LinearOp logic
QNN LPBQ Quantization `mllm/backends/qnn/aot/QnnWrappersAPI.cpp`	Updated LPBQ handling in setupComplexTensorQuantization to use last tensor axis and adjusted scale indexing from channel-based to position-based for Conv2D-specific layout
QNN Utils `mllm/backends/qnn/QNNUtils.cpp`	Cleaned up include statements, added standard library headers and removed redundant mllm.hpp dependency
Core Conv2D Operation `mllm/core/aops/Conv2DOp.hpp`, `mllm/core/aops/Conv2DOp.cpp`	Added Conv2DOpImplType variants kQNN\_LPBQ\_w4a16o16\_G32/G64; updated shape handling for Qualcomm DSP-specific implementations with NHWC format reinterpretation
Python Quantized Linear Converters `pymllm/backends/qualcomm/transformers/core/qlinear.py`	Added convert\_to\_conv2d\_deploy\_hwio() methods to QLinearW8A16\_PerChannelSym and QLinearLPBQ for HWIO [1,1,In,Out] deployment layout conversion
Qwen3 Runner `pymllm/backends/qualcomm/transformers/qwen3/runner.py`	Updated weight conversion logic to call convert\_to\_conv2d\_deploy\_hwio() for quantized linear layers; added CUDA placement during quantizer initialization

Sequence Diagram(s)

sequenceDiagram
    participant Model as Qwen3 Model
    participant Lowering as AOT Lowering Pass
    participant Quant as Quantization Recipe
    participant Visitor as Conv2D Visitor
    participant QNN as QNN Backend
    
    Model->>Lowering: Conv2D operations
    activate Lowering
    Lowering->>Lowering: Match Conv2DPattern
    Lowering->>Quant: Annotated Conv2D ops
    deactivate Lowering
    
    activate Quant
    Quant->>Quant: Apply LPBQ method
    Quant->>Quant: Attach quantization annotations
    Quant->>Visitor: Annotated Conv2D with quant_recipe
    deactivate Quant
    
    activate Visitor
    Visitor->>Visitor: Validate Conv2D attributes
    Visitor->>Visitor: Extract weights & bias
    Visitor->>QNN: Create QNN Conv2d node
    deactivate Visitor
    
    QNN->>QNN: Register with stride & padding params
    QNN-->>Model: QNN AOT executable

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

PR #577: Adds and registers LLM2QnnLoweringPass pattern infrastructure that the Conv2D pattern now extends
PR #595: Modifies LPBQ Conv2D quantization paths and reshape handling in QnnWrappersAPI with related tensor layout adjustments
PR #562: Introduces foundational QNN AOT Conv2D visitor and pattern support that this PR builds upon

Suggested reviewers

oreomaker
liang1232018

🐰 Linear layers hop, but Conv2D's the hop-hop-hop,
LPBQ quantization makes the bitrate drop,
HWIO layout reshapes with a flip and a flop,
Qwen3's now convolved—the compilation won't stop! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is empty, lacking any explanation of the changes, rationale, or testing performed.	Add a comprehensive description explaining why linear ops are replaced with Conv2D, any performance impacts, testing details, and migration notes for users.
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: replacing linear operations with Conv2D in the Qualcomm backend.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

UbiquitousLearning

LGTM

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

mllm/backends/qnn/aot/QnnWrappersAPI.cpp (1)

161-185: Add guard or separate handling for Linear LPBQ quantization.

The code at lines 161-185 is hardcoded for Conv2D HWIO (4D) tensor layout, but Linear operations can also be assigned LPBQ quantization via LLMQuantRecipeLinearPattern::rewrite(). When a Linear weight tensor (2D) reaches setupComplexTensorQuantization(), it will crash attempting to access at<float>({0, 0, 0, i}) on a 2D tensor, or hang if assertion checks fail.

Either:

Add a type check in setupComplexTensorQuantization() to skip LPBQ handling for Linear operations, or

Implement separate 2D tensor handling for Linear LPBQ with proper indexing, or

Document and enforce that Linear LPBQ is not supported in the quantization recipe configuration

🤖 Fix all issues with AI agents

In @mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp:
- Around line 256-258: The section header comment incorrectly reads "Sigmoid
Pattern" while the code implements the Conv2D pattern; update the comment to
accurately reflect the implementation (e.g., change the header to "Conv2D
Pattern" or "Conv2D Pattern / Conv2D lowering") so the comment matches the
implemented Conv2D pattern in this block.
- Around line 317-339: weight_quant_spec can be nullptr when method == "LPBQ"
but precision != "w4a16", causing nullptr to be stored/used; after the precision
branch check weight_quant_spec and bail out instead of proceeding: add a
validation like MLLM_RETURN_FALSE_IF_NOT(weight_quant_spec) (or return false/log
an error) before inserting into annotation_attr->annotation_.weights and before
calling t->setAttr(...). Apply the same fix in
LLMQuantRecipeLinearPattern::rewrite to prevent null propagation for unsupported
precisions.

In @mllm/backends/qnn/aot/QnnWrappersAPI.cpp:
- Around line 167-174: The code assumes 4D HWIO tensors (using
v->tensor_.size(-1) and cfg->scale_level_1_fp.at<float>({0,0,0,i})), which can
crash for lower-rank weights (e.g., 2D linear weights); add a defensive rank
check on cfg->scale_level_1_fp and v->tensor_ before the loop and either assert
the expected rank (==4) or branch to a fallback that reads the scale using only
the last-dimension index (i) so you safely access the scale regardless of
whether scale_level_1_fp is 4D or 1D; update the logic that builds scale_offsets
(the loop over num_scale_offsets and the use of scale_level_1_fp.at<float>(...))
accordingly and keep the existing assertion on cfg->scale_level_0_int.dtype()
intact.

In @mllm/backends/qnn/aot/visitor/Conv2D.cpp:
- Around line 22-26: The local variable names `linear_op` and `real_linear_op`
in Conv2D.cpp are misleading copy/paste artifacts; rename them to reflect Conv2D
(e.g., `conv_op` and `real_conv_op`) wherever `auto linear_op =
op->cast_<mllm::ir::linalg::Conv2DOp>()` and subsequent uses appear (including
checks, error messages like MLLM_ERROR, and any downstream references) so all
identifiers and messages consistently indicate Conv2D instead of Linear.
- Around line 43-48: The lookup of the weight symbol table can return null or an
outputs() vector with no elements, so before calling .front() and
cast_<ir::tensor::TensorValue>(), guard the chain: check that
writer.getContext()->lookupSymbolTable(base_op->getName() + ".weight") is
non-null and that its outputs() is non-empty; if either check fails, handle the
error (e.g., log via processLogger or throw a descriptive exception) and avoid
dereferencing, otherwise proceed to obtain the front() and cast_ as currently
written.
- Around line 36-41: The variable name real_linear_op is misleading for a Conv2D
runtime op; change it to real_conv2d_op where you perform the dynamic_cast from
base_op to mllm::aops::Conv2DOp (the block starting with auto base_op =
linear_op->getAOp(); and the dynamic_cast line) and update all subsequent
references (e.g., the usage referenced around line 58) to use real_conv2d_op so
the identifier correctly reflects the Conv2D type.

In @pymllm/backends/qualcomm/transformers/core/qlinear.py:
- Around line 265-266: The check uses a tensor buffer
self.weight_quant.is_frozen and should use its Python boolean value; replace the
direct tensor truthiness check with self.weight_quant.is_frozen.item() (e.g.,
change the condition in the block that calls freeze_weight() to use .item()) so
the if statement reliably reads the boolean and avoids tensor
truthiness/deprecation issues.

In @pymllm/backends/qualcomm/transformers/qwen3/runner.py:
- Line 48: The code unconditionally calls self.model.cuda(), which fails on
CPU-only systems; update the class/__init__ to accept a device parameter (e.g.,
device) or check torch.cuda.is_available() and choose "cuda" only when
available, then move the model with self.model.to(self.device) (or equivalent)
instead of self.model.cuda(); locate the placement call (self.model.cuda()) and
the constructor (the class __init__) to add the device logic and use
self.model.to(self.device).

🧹 Nitpick comments (5)

examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (2)
115-120: Consider replacing macro with constexpr or inline function.

The CONV2D_PROPERTY macro works but has drawbacks in header files: macros are not namespace-scoped and can cause name collisions. Consider a safer alternative:
♻️ Suggested refactor using inline function
 using vi32 = std::vector<int32_t>;
-#define CONV2D_PROPERTY vi32{1, 1}, vi32{1, 1}, vi32{0, 0}, vi32{1, 1}, false, aops::Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32
+
+// Conv2D properties for Linear replacement: kernel 1x1, stride 1x1, pad 0, dilation 1x1
+struct Conv2DProps {
+  static constexpr auto kernel = vi32{1, 1};
+  static constexpr auto stride = vi32{1, 1};
+  static constexpr auto padding = vi32{0, 0};
+  static constexpr auto dilation = vi32{1, 1};
+  static constexpr bool bias = false;
+  static constexpr auto impl_type = aops::Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32;
+};
Then update registrations to use Conv2DProps::kernel, Conv2DProps::stride, ... or create a helper template/function.
410-432: Remove outdated comments or implement the suggested handling.

The exception throwing for missing KV caches is appropriate, but the comments at lines 415-417 and 428-430 suggest uncertainty:
// This might need adjustment based on your initialization logic
If the current implementation (throwing an exception) is the intended behavior, these comments should be removed. If further handling is needed, consider tracking this as a follow-up task.
♻️ Suggested cleanup
       } else {
-        // If KV cache doesn't exist, we need to handle this case
-        // For now, we'll create empty tensors or handle it appropriately
-        // This might need adjustment based on your initialization logic
         throw std::runtime_error("Missing KV cache for layer " + std::to_string(i));
       }
Apply similar cleanup for both key and value cache loops.
mllm/core/aops/Conv2DOp.hpp (1)
13-16: New enum values are missing from string conversion functions.

The new kQNN_LPBQ_w4a16o16_G32 and kQNN_LPBQ_w4a16o16_G64 enum values are not added to the str2Conv2DOpImplType (line 31) and Conv2DOpImplType2Str (line 41) mapping functions. This will cause them to silently fall back to kDefault during string conversion.

If these types need to be serialized/deserialized or logged, consider adding the mappings:
🔧 Suggested fix
 inline Conv2DOpImplType str2Conv2DOpImplType(const std::string& str) {
-  static const std::unordered_map<std::string, Conv2DOpImplType> map = {{"Default", Conv2DOpImplType::kDefault}};
+  static const std::unordered_map<std::string, Conv2DOpImplType> map = {
+      {"Default", Conv2DOpImplType::kDefault},
+      {"QNN_LPBQ_w4a16o16_G32", Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32},
+      {"QNN_LPBQ_w4a16o16_G64", Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64}
+  };

   auto it = map.find(str);
   if (it != map.end()) { return it->second; }

   // Return default if not found
   return Conv2DOpImplType::kDefault;
 }

 inline std::string Conv2DOpImplType2Str(Conv2DOpImplType type) {
-  static const std::unordered_map<Conv2DOpImplType, std::string> map = {{Conv2DOpImplType::kDefault, "Default"}};
+  static const std::unordered_map<Conv2DOpImplType, std::string> map = {
+      {Conv2DOpImplType::kDefault, "Default"},
+      {Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32, "QNN_LPBQ_w4a16o16_G32"},
+      {Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64, "QNN_LPBQ_w4a16o16_G64"}
+  };

   auto it = map.find(type);
   if (it != map.end()) return it->second;
   return "Default";
 }
mllm/core/aops/Conv2DOp.cpp (1)
114-117: Consider extracting the DSP impl_type check to avoid duplication.

The condition options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32 || options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64 is repeated twice (lines 80-81 and 114-115). Consider extracting this to a helper or local boolean for clarity and maintainability.
♻️ Suggested refactor
+  const bool is_dsp_layout = options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32
+                          || options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64;
+
   // CHECK if in Qualcomm DSP shape. Inputs is [N, H, W, C], Filter Weight is [N, H, In, Out]
-  if (options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32
-      || options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64) {
+  if (is_dsp_layout) {
     in_channels = ishape[3];
     in_height = ishape[1];
     in_width = ishape[2];
   }
   ...
   auto new_shape = std::vector<int32_t>{batch, out_channels, h_out, w_out};

-  if (options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32
-      || options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64) {
+  if (is_dsp_layout) {
     new_shape = std::vector<int32_t>{batch, h_out, w_out, out_channels};
   }
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (1)

264-349: Consider extracting shared quantization logic.

The LLMQuantRecipeConv2DPattern::rewrite implementation is nearly identical to LLMQuantRecipeLinearPattern::rewrite (lines 835-920). Both patterns share:

Config lookup and regex matching logic

LPBQ method handling with "w4a16" precision

Weight tensor registration and quant spec assignment

Consider extracting a shared helper function to reduce code duplication and simplify future maintenance.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c91c54a and 877571c.

📒 Files selected for processing (13)

examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp
mllm/backends/qnn/QNNUtils.cpp
mllm/backends/qnn/aot/QnnWrappersAPI.cpp
mllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cpp
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp
mllm/backends/qnn/aot/passes/PTQPass.cpp
mllm/backends/qnn/aot/visitor/Conv2D.cpp
mllm/backends/qnn/aot/visitor/Conv2D.hpp
mllm/core/aops/Conv2DOp.cpp
mllm/core/aops/Conv2DOp.hpp
pymllm/backends/qualcomm/transformers/core/qlinear.py
pymllm/backends/qualcomm/transformers/qwen3/runner.py

🧰 Additional context used

📓 Path-based instructions (5)

{mllm,mllm-cli,pymllm}/**/*