Skip to content

fix(Qualcomm): Replace linear op with conv2d in Qualcomm backend#600

Merged
chenghuaWang merged 1 commit intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main
Jan 12, 2026
Merged

fix(Qualcomm): Replace linear op with conv2d in Qualcomm backend#600
chenghuaWang merged 1 commit intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main

Conversation

@chenghuaWang
Copy link
Copy Markdown
Collaborator

@chenghuaWang chenghuaWang commented Jan 12, 2026

Summary by CodeRabbit

  • New Features

    • Conv2D operations now supported in QNN backend for enhanced model execution
    • Conv2D quantization support with LPBQ (Low Precision Block Quantization) method
    • Model deployment optimization with HWIO format for improved inference efficiency
  • Improvements

    • Qwen3 model architecture updated with Conv2D-based layers for optimized performance
    • GPU acceleration added during model quantization initialization

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 12, 2026

📝 Walkthrough

Walkthrough

The changes add Conv2D support to the QNN AOT backend by replacing Linear layers with Conv2D in Qwen3 model components, implementing QNN AOT Conv2D visitor patterns and quantization recipes, updating LPBQ quantization handling for Conv2D tensors, and adding Python deployment conversion methods for HWIO layout preparation.

Changes

Cohort / File(s) Summary
Qwen3 Model Definition
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp
Replaced Linear layers with Conv2D in Qwen3MLP, Qwen3Attention, and Qwen3ForCausalLM; added reshaping steps and KV cache validation; introduced CONV2D\_PROPERTY macro and vi32 type alias
QNN AOT Conv2D Visitor
mllm/backends/qnn/aot/visitor/Conv2D.hpp, mllm/backends/qnn/aot/visitor/Conv2D.cpp
Introduced QnnAOTConv2DPattern class with isMatch and rewrite methods; implements linalg Conv2DOp to QNN Conv2d conversion with stride, padding, weight, and bias handling
QNN AOT Lowering Pass
mllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cpp
Registered QnnAOTConv2DPattern in lowering pass pattern list to enable Conv2D matching during AOT compilation
QNN Quantization Recipe Pass
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp, mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
Added LLMQuantRecipeConv2DPattern class supporting LPBQ method for weight and output quantization annotation on Conv2D operations
QNN PTQ Weight Solving
mllm/backends/qnn/aot/passes/PTQPass.cpp
Extended PTQ weight solving to handle Conv2DOp by reusing LinearOp logic
QNN LPBQ Quantization
mllm/backends/qnn/aot/QnnWrappersAPI.cpp
Updated LPBQ handling in setupComplexTensorQuantization to use last tensor axis and adjusted scale indexing from channel-based to position-based for Conv2D-specific layout
QNN Utils
mllm/backends/qnn/QNNUtils.cpp
Cleaned up include statements, added standard library headers and removed redundant mllm.hpp dependency
Core Conv2D Operation
mllm/core/aops/Conv2DOp.hpp, mllm/core/aops/Conv2DOp.cpp
Added Conv2DOpImplType variants kQNN\_LPBQ\_w4a16o16\_G32/G64; updated shape handling for Qualcomm DSP-specific implementations with NHWC format reinterpretation
Python Quantized Linear Converters
pymllm/backends/qualcomm/transformers/core/qlinear.py
Added convert\_to\_conv2d\_deploy\_hwio() methods to QLinearW8A16\_PerChannelSym and QLinearLPBQ for HWIO [1,1,In,Out] deployment layout conversion
Qwen3 Runner
pymllm/backends/qualcomm/transformers/qwen3/runner.py
Updated weight conversion logic to call convert\_to\_conv2d\_deploy\_hwio() for quantized linear layers; added CUDA placement during quantizer initialization

Sequence Diagram(s)

sequenceDiagram
    participant Model as Qwen3 Model
    participant Lowering as AOT Lowering Pass
    participant Quant as Quantization Recipe
    participant Visitor as Conv2D Visitor
    participant QNN as QNN Backend
    
    Model->>Lowering: Conv2D operations
    activate Lowering
    Lowering->>Lowering: Match Conv2DPattern
    Lowering->>Quant: Annotated Conv2D ops
    deactivate Lowering
    
    activate Quant
    Quant->>Quant: Apply LPBQ method
    Quant->>Quant: Attach quantization annotations
    Quant->>Visitor: Annotated Conv2D with quant_recipe
    deactivate Quant
    
    activate Visitor
    Visitor->>Visitor: Validate Conv2D attributes
    Visitor->>Visitor: Extract weights & bias
    Visitor->>QNN: Create QNN Conv2d node
    deactivate Visitor
    
    QNN->>QNN: Register with stride & padding params
    QNN-->>Model: QNN AOT executable
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #577: Adds and registers LLM2QnnLoweringPass pattern infrastructure that the Conv2D pattern now extends
  • PR #595: Modifies LPBQ Conv2D quantization paths and reshape handling in QnnWrappersAPI with related tensor layout adjustments
  • PR #562: Introduces foundational QNN AOT Conv2D visitor and pattern support that this PR builds upon

Suggested reviewers

  • oreomaker
  • liang1232018

🐰 Linear layers hop, but Conv2D's the hop-hop-hop,
LPBQ quantization makes the bitrate drop,
HWIO layout reshapes with a flip and a flop,
Qwen3's now convolved—the compilation won't stop!

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is empty, lacking any explanation of the changes, rationale, or testing performed. Add a comprehensive description explaining why linear ops are replaced with Conv2D, any performance impacts, testing details, and migration notes for users.
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: replacing linear operations with Conv2D in the Qualcomm backend.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Owner

@UbiquitousLearning UbiquitousLearning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
mllm/backends/qnn/aot/QnnWrappersAPI.cpp (1)

161-185: Add guard or separate handling for Linear LPBQ quantization.

The code at lines 161-185 is hardcoded for Conv2D HWIO (4D) tensor layout, but Linear operations can also be assigned LPBQ quantization via LLMQuantRecipeLinearPattern::rewrite(). When a Linear weight tensor (2D) reaches setupComplexTensorQuantization(), it will crash attempting to access at<float>({0, 0, 0, i}) on a 2D tensor, or hang if assertion checks fail.

Either:

  1. Add a type check in setupComplexTensorQuantization() to skip LPBQ handling for Linear operations, or
  2. Implement separate 2D tensor handling for Linear LPBQ with proper indexing, or
  3. Document and enforce that Linear LPBQ is not supported in the quantization recipe configuration
🤖 Fix all issues with AI agents
In @mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp:
- Around line 256-258: The section header comment incorrectly reads "Sigmoid
Pattern" while the code implements the Conv2D pattern; update the comment to
accurately reflect the implementation (e.g., change the header to "Conv2D
Pattern" or "Conv2D Pattern / Conv2D lowering") so the comment matches the
implemented Conv2D pattern in this block.
- Around line 317-339: weight_quant_spec can be nullptr when method == "LPBQ"
but precision != "w4a16", causing nullptr to be stored/used; after the precision
branch check weight_quant_spec and bail out instead of proceeding: add a
validation like MLLM_RETURN_FALSE_IF_NOT(weight_quant_spec) (or return false/log
an error) before inserting into annotation_attr->annotation_.weights and before
calling t->setAttr(...). Apply the same fix in
LLMQuantRecipeLinearPattern::rewrite to prevent null propagation for unsupported
precisions.

In @mllm/backends/qnn/aot/QnnWrappersAPI.cpp:
- Around line 167-174: The code assumes 4D HWIO tensors (using
v->tensor_.size(-1) and cfg->scale_level_1_fp.at<float>({0,0,0,i})), which can
crash for lower-rank weights (e.g., 2D linear weights); add a defensive rank
check on cfg->scale_level_1_fp and v->tensor_ before the loop and either assert
the expected rank (==4) or branch to a fallback that reads the scale using only
the last-dimension index (i) so you safely access the scale regardless of
whether scale_level_1_fp is 4D or 1D; update the logic that builds scale_offsets
(the loop over num_scale_offsets and the use of scale_level_1_fp.at<float>(...))
accordingly and keep the existing assertion on cfg->scale_level_0_int.dtype()
intact.

In @mllm/backends/qnn/aot/visitor/Conv2D.cpp:
- Around line 22-26: The local variable names `linear_op` and `real_linear_op`
in Conv2D.cpp are misleading copy/paste artifacts; rename them to reflect Conv2D
(e.g., `conv_op` and `real_conv_op`) wherever `auto linear_op =
op->cast_<mllm::ir::linalg::Conv2DOp>()` and subsequent uses appear (including
checks, error messages like MLLM_ERROR, and any downstream references) so all
identifiers and messages consistently indicate Conv2D instead of Linear.
- Around line 43-48: The lookup of the weight symbol table can return null or an
outputs() vector with no elements, so before calling .front() and
cast_<ir::tensor::TensorValue>(), guard the chain: check that
writer.getContext()->lookupSymbolTable(base_op->getName() + ".weight") is
non-null and that its outputs() is non-empty; if either check fails, handle the
error (e.g., log via processLogger or throw a descriptive exception) and avoid
dereferencing, otherwise proceed to obtain the front() and cast_ as currently
written.
- Around line 36-41: The variable name real_linear_op is misleading for a Conv2D
runtime op; change it to real_conv2d_op where you perform the dynamic_cast from
base_op to mllm::aops::Conv2DOp (the block starting with auto base_op =
linear_op->getAOp(); and the dynamic_cast line) and update all subsequent
references (e.g., the usage referenced around line 58) to use real_conv2d_op so
the identifier correctly reflects the Conv2D type.

In @pymllm/backends/qualcomm/transformers/core/qlinear.py:
- Around line 265-266: The check uses a tensor buffer
self.weight_quant.is_frozen and should use its Python boolean value; replace the
direct tensor truthiness check with self.weight_quant.is_frozen.item() (e.g.,
change the condition in the block that calls freeze_weight() to use .item()) so
the if statement reliably reads the boolean and avoids tensor
truthiness/deprecation issues.

In @pymllm/backends/qualcomm/transformers/qwen3/runner.py:
- Line 48: The code unconditionally calls self.model.cuda(), which fails on
CPU-only systems; update the class/__init__ to accept a device parameter (e.g.,
device) or check torch.cuda.is_available() and choose "cuda" only when
available, then move the model with self.model.to(self.device) (or equivalent)
instead of self.model.cuda(); locate the placement call (self.model.cuda()) and
the constructor (the class __init__) to add the device logic and use
self.model.to(self.device).
🧹 Nitpick comments (5)
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (2)

115-120: Consider replacing macro with constexpr or inline function.

The CONV2D_PROPERTY macro works but has drawbacks in header files: macros are not namespace-scoped and can cause name collisions. Consider a safer alternative:

♻️ Suggested refactor using inline function
 using vi32 = std::vector<int32_t>;
-#define CONV2D_PROPERTY vi32{1, 1}, vi32{1, 1}, vi32{0, 0}, vi32{1, 1}, false, aops::Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32
+
+// Conv2D properties for Linear replacement: kernel 1x1, stride 1x1, pad 0, dilation 1x1
+struct Conv2DProps {
+  static constexpr auto kernel = vi32{1, 1};
+  static constexpr auto stride = vi32{1, 1};
+  static constexpr auto padding = vi32{0, 0};
+  static constexpr auto dilation = vi32{1, 1};
+  static constexpr bool bias = false;
+  static constexpr auto impl_type = aops::Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32;
+};

Then update registrations to use Conv2DProps::kernel, Conv2DProps::stride, ... or create a helper template/function.


410-432: Remove outdated comments or implement the suggested handling.

The exception throwing for missing KV caches is appropriate, but the comments at lines 415-417 and 428-430 suggest uncertainty:

// This might need adjustment based on your initialization logic

If the current implementation (throwing an exception) is the intended behavior, these comments should be removed. If further handling is needed, consider tracking this as a follow-up task.

♻️ Suggested cleanup
       } else {
-        // If KV cache doesn't exist, we need to handle this case
-        // For now, we'll create empty tensors or handle it appropriately
-        // This might need adjustment based on your initialization logic
         throw std::runtime_error("Missing KV cache for layer " + std::to_string(i));
       }

Apply similar cleanup for both key and value cache loops.

mllm/core/aops/Conv2DOp.hpp (1)

13-16: New enum values are missing from string conversion functions.

The new kQNN_LPBQ_w4a16o16_G32 and kQNN_LPBQ_w4a16o16_G64 enum values are not added to the str2Conv2DOpImplType (line 31) and Conv2DOpImplType2Str (line 41) mapping functions. This will cause them to silently fall back to kDefault during string conversion.

If these types need to be serialized/deserialized or logged, consider adding the mappings:

🔧 Suggested fix
 inline Conv2DOpImplType str2Conv2DOpImplType(const std::string& str) {
-  static const std::unordered_map<std::string, Conv2DOpImplType> map = {{"Default", Conv2DOpImplType::kDefault}};
+  static const std::unordered_map<std::string, Conv2DOpImplType> map = {
+      {"Default", Conv2DOpImplType::kDefault},
+      {"QNN_LPBQ_w4a16o16_G32", Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32},
+      {"QNN_LPBQ_w4a16o16_G64", Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64}
+  };

   auto it = map.find(str);
   if (it != map.end()) { return it->second; }

   // Return default if not found
   return Conv2DOpImplType::kDefault;
 }

 inline std::string Conv2DOpImplType2Str(Conv2DOpImplType type) {
-  static const std::unordered_map<Conv2DOpImplType, std::string> map = {{Conv2DOpImplType::kDefault, "Default"}};
+  static const std::unordered_map<Conv2DOpImplType, std::string> map = {
+      {Conv2DOpImplType::kDefault, "Default"},
+      {Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32, "QNN_LPBQ_w4a16o16_G32"},
+      {Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64, "QNN_LPBQ_w4a16o16_G64"}
+  };

   auto it = map.find(type);
   if (it != map.end()) return it->second;
   return "Default";
 }
mllm/core/aops/Conv2DOp.cpp (1)

114-117: Consider extracting the DSP impl_type check to avoid duplication.

The condition options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32 || options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64 is repeated twice (lines 80-81 and 114-115). Consider extracting this to a helper or local boolean for clarity and maintainability.

♻️ Suggested refactor
+  const bool is_dsp_layout = options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32
+                          || options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64;
+
   // CHECK if in Qualcomm DSP shape. Inputs is [N, H, W, C], Filter Weight is [N, H, In, Out]
-  if (options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32
-      || options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64) {
+  if (is_dsp_layout) {
     in_channels = ishape[3];
     in_height = ishape[1];
     in_width = ishape[2];
   }
   ...
   auto new_shape = std::vector<int32_t>{batch, out_channels, h_out, w_out};

-  if (options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32
-      || options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64) {
+  if (is_dsp_layout) {
     new_shape = std::vector<int32_t>{batch, h_out, w_out, out_channels};
   }
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (1)

264-349: Consider extracting shared quantization logic.

The LLMQuantRecipeConv2DPattern::rewrite implementation is nearly identical to LLMQuantRecipeLinearPattern::rewrite (lines 835-920). Both patterns share:

  • Config lookup and regex matching logic
  • LPBQ method handling with "w4a16" precision
  • Weight tensor registration and quant spec assignment

Consider extracting a shared helper function to reduce code duplication and simplify future maintenance.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c91c54a and 877571c.

📒 Files selected for processing (13)
  • examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp
  • mllm/backends/qnn/QNNUtils.cpp
  • mllm/backends/qnn/aot/QnnWrappersAPI.cpp
  • mllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cpp
  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp
  • mllm/backends/qnn/aot/passes/PTQPass.cpp
  • mllm/backends/qnn/aot/visitor/Conv2D.cpp
  • mllm/backends/qnn/aot/visitor/Conv2D.hpp
  • mllm/core/aops/Conv2DOp.cpp
  • mllm/core/aops/Conv2DOp.hpp
  • pymllm/backends/qualcomm/transformers/core/qlinear.py
  • pymllm/backends/qualcomm/transformers/qwen3/runner.py
🧰 Additional context used
📓 Path-based instructions (5)
{mllm,mllm-cli,pymllm}/**/*

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).

Files:

  • mllm/backends/qnn/QNNUtils.cpp
  • pymllm/backends/qualcomm/transformers/qwen3/runner.py
  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp
  • mllm/core/aops/Conv2DOp.cpp
  • mllm/core/aops/Conv2DOp.hpp
  • mllm/backends/qnn/aot/visitor/Conv2D.hpp
  • mllm/backends/qnn/aot/passes/PTQPass.cpp
  • mllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cpp
  • mllm/backends/qnn/aot/visitor/Conv2D.cpp
  • pymllm/backends/qualcomm/transformers/core/qlinear.py
  • mllm/backends/qnn/aot/QnnWrappersAPI.cpp
  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Files:

  • mllm/backends/qnn/QNNUtils.cpp
  • pymllm/backends/qualcomm/transformers/qwen3/runner.py
  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp
  • mllm/core/aops/Conv2DOp.cpp
  • mllm/core/aops/Conv2DOp.hpp
  • mllm/backends/qnn/aot/visitor/Conv2D.hpp
  • mllm/backends/qnn/aot/passes/PTQPass.cpp
  • mllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cpp
  • mllm/backends/qnn/aot/visitor/Conv2D.cpp
  • pymllm/backends/qualcomm/transformers/core/qlinear.py
  • mllm/backends/qnn/aot/QnnWrappersAPI.cpp
  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.

Files:

  • mllm/backends/qnn/QNNUtils.cpp
  • pymllm/backends/qualcomm/transformers/qwen3/runner.py
  • mllm/core/aops/Conv2DOp.cpp
  • mllm/backends/qnn/aot/passes/PTQPass.cpp
  • mllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cpp
  • mllm/backends/qnn/aot/visitor/Conv2D.cpp
  • pymllm/backends/qualcomm/transformers/core/qlinear.py
  • mllm/backends/qnn/aot/QnnWrappersAPI.cpp
  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).

Files:

  • mllm/backends/qnn/QNNUtils.cpp
  • pymllm/backends/qualcomm/transformers/qwen3/runner.py
  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp
  • mllm/core/aops/Conv2DOp.cpp
  • mllm/core/aops/Conv2DOp.hpp
  • mllm/backends/qnn/aot/visitor/Conv2D.hpp
  • mllm/backends/qnn/aot/passes/PTQPass.cpp
  • mllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cpp
  • mllm/backends/qnn/aot/visitor/Conv2D.cpp
  • pymllm/backends/qualcomm/transformers/core/qlinear.py
  • mllm/backends/qnn/aot/QnnWrappersAPI.cpp
  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
{mllm,mllm-cli,pymllm}/**/*.{sh,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

If a file starts with a shebang, it must be the first line and contain only the interpreter path and optional single argument, encoded in UTF-8.

Files:

  • pymllm/backends/qualcomm/transformers/qwen3/runner.py
  • pymllm/backends/qualcomm/transformers/core/qlinear.py
🧬 Code graph analysis (5)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp (1)
mllm/backends/qnn/aot/visitor/Conv2D.hpp (2)
  • op (14-14)
  • writer (16-16)
mllm/core/aops/Conv2DOp.cpp (2)
mllm/core/Tensor.cpp (2)
  • empty (78-82)
  • empty (78-78)
mllm/core/aops/Conv2DOp.hpp (1)
  • options_ (68-68)
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (1)
mllm/nn/Module.hpp (8)
  • name (34-34)
  • name (36-36)
  • name (38-38)
  • name (190-190)
  • reg (83-115)
  • reg (83-83)
  • args (148-170)
  • args (148-148)
mllm/backends/qnn/aot/visitor/Conv2D.cpp (3)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (32)
  • isMatch (259-262)
  • isMatch (259-259)
  • isMatch (354-357)
  • isMatch (354-354)
  • isMatch (366-369)
  • isMatch (366-366)
  • isMatch (379-382)
  • isMatch (379-379)
  • isMatch (391-394)
  • isMatch (391-391)
  • isMatch (437-440)
  • isMatch (437-437)
  • isMatch (465-468)
  • isMatch (465-465)
  • isMatch (501-504)
  • isMatch (501-501)
  • rewrite (264-349)
  • rewrite (264-264)
  • rewrite (359-361)
  • rewrite (359-359)
  • rewrite (371-374)
  • rewrite (371-371)
  • rewrite (384-386)
  • rewrite (384-384)
  • rewrite (396-432)
  • rewrite (396-396)
  • rewrite (442-460)
  • rewrite (442-442)
  • rewrite (470-496)
  • rewrite (470-470)
  • rewrite (506-508)
  • rewrite (506-506)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp (32)
  • op (40-40)
  • op (54-54)
  • op (68-68)
  • op (80-80)
  • op (94-94)
  • op (106-106)
  • op (120-120)
  • op (134-134)
  • op (146-146)
  • op (158-158)
  • op (170-170)
  • op (184-184)
  • op (198-198)
  • op (212-212)
  • op (226-226)
  • op (240-240)
  • writer (42-42)
  • writer (56-56)
  • writer (70-70)
  • writer (82-82)
  • writer (96-96)
  • writer (108-108)
  • writer (122-122)
  • writer (136-136)
  • writer (148-148)
  • writer (160-160)
  • writer (172-172)
  • writer (186-186)
  • writer (200-200)
  • writer (214-214)
  • writer (228-228)
  • writer (242-242)
mllm/backends/qnn/QNNUtils.cpp (4)
  • create (449-467)
  • create (449-450)
  • create (549-553)
  • create (549-551)
pymllm/backends/qualcomm/transformers/core/qlinear.py (1)
pymllm/backends/qualcomm/transformers/core/rms_norm.py (1)
  • freeze_weight (102-116)
🪛 Clang (14.0.6)
mllm/backends/qnn/QNNUtils.cpp

[error] 1-1: 'cstdint' file not found

(clang-diagnostic-error)

mllm/backends/qnn/aot/visitor/Conv2D.hpp

[error] 6-6: 'mllm/core/OpTypes.hpp' file not found

(clang-diagnostic-error)

mllm/backends/qnn/aot/visitor/Conv2D.cpp

[error] 4-4: 'mllm/backends/qnn/QNNUtils.hpp' file not found

(clang-diagnostic-error)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-macos
  • GitHub Check: build-android
  • GitHub Check: build-x86
🔇 Additional comments (18)
mllm/backends/qnn/QNNUtils.cpp (1)

1-8: LGTM! Explicit header inclusion improves code clarity.

Making dependencies explicit by directly including standard library headers (<cstdint>, <memory>, <dlfcn.h>, <cstring>) is a C++ best practice. This eliminates reliance on transitive includes and improves compilation robustness across different environments.

The static analysis error "'cstdint' file not found" is a false positive from the analyzer's environment configuration—<cstdint> is a standard C++11 header available in all modern compilers.

examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (5)

122-158: LGTM!

The Linear to Conv2D replacement in Qwen3MLP is correctly implemented. The view operations properly reshape tensors from [B, S, D] to [B, 1, S, D] format for Conv2D operations and back to 3D for subsequent computations.


161-201: LGTM!

The Conv2D replacements for the attention projections maintain correct input/output dimensions and use consistent configuration via the CONV2D_PROPERTY macro.


211-279: LGTM!

The tensor reshaping in the attention forward pass correctly handles the Conv2D format requirements. The flow properly maintains shape compatibility through the attention computation pipeline.


393-397: LGTM!

The lm_head_ Conv2D registration is consistent with the other projection replacements.


463-466: Verify output shape after lm_head_ operation.

The lm_head_ Conv2D operation produces a 4D output [1, 1, S, vocab_size], but unlike all other Conv2D operations in this file (up_proj_, gate_proj_, down_proj_, o_proj_), there is no .view() to reshape it back to 3D before the trace ends. Confirm whether this 4D output is intentional for the final model output, or if a reshape is needed.

mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp (1)

35-47: LGTM!

The LLMQuantRecipeConv2DPattern class follows the established pattern structure used by other pattern classes in this file (LLMQuantRecipeSigmoidPattern, LLMQuantRecipeLinearPattern, etc.) with consistent isMatch, rewrite, and create method signatures.

pymllm/backends/qualcomm/transformers/core/qlinear.py (2)

114-157: LGTM! Well-structured HWIO conversion for W8A16 Per-Channel.

The implementation correctly:

  1. Guards against double conversion via deploy_mode check
  2. Ensures quantization parameters are frozen before conversion
  3. Applies the correct transformation: [Out, In] → transpose → [In, Out] → reshape → [1, 1, In, Out]
  4. Reshapes scale/zero_point to [1, 1, 1, Out] for proper broadcasting in HWIO format

284-299: LPBQ scale transformations look correct for HWIO layout.

The scale handling properly transforms:

  • Scale2 (per-channel): [Out, 1, 1] → [1, 1, 1, Out]
  • Scale1 (per-block): [Out, Blocks, 1] → [1, 1, Blocks, Out]

This maintains the semantic correspondence between weight blocks and their scales after the transpose operation.

pymllm/backends/qualcomm/transformers/qwen3/runner.py (1)

35-38: LGTM! Correctly routes linear layers to HWIO deployment.

The convert_weight function now properly directs QLinearLPBQ and QLinearW8A16_PerChannelSym to use the new convert_to_conv2d_deploy_hwio() method while keeping QRMSNorm on the standard convert_to_deploy() path.

mllm/backends/qnn/aot/passes/PTQPass.cpp (1)

125-128: LGTM - Conv2D weight solving correctly reuses Linear's logic.

The approach of reusing solveLinearWeight for Conv2D operations is appropriate since both operations share the same LPBQ weight quantization handling. The inline comment clarifies the intent.

mllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cpp (2)

15-15: LGTM - Conv2D visitor include added correctly.

The include placement follows the existing pattern of other visitor headers.


38-42: LGTM - Conv2D pattern registration looks correct.

The QnnAOTConv2DPattern is properly added to the pattern registration list, consistent with how other patterns are registered.

mllm/backends/qnn/aot/visitor/Conv2D.hpp (1)

12-21: LGTM - Pattern class declaration follows existing conventions.

The QnnAOTConv2DPattern class properly extends QnnAOTBasePattern and follows the same structure as other AOT patterns in the codebase. The static create() factory method returns the expected pair<OpTypes, shared_ptr> type.

Note: The static analysis error about missing mllm/core/OpTypes.hpp is a false positive due to the sandbox environment not having the complete build system.

mllm/core/aops/Conv2DOp.cpp (1)

79-85: LGTM - DSP/LPBQ input shape reinterpretation is correct.

The NHWC layout handling correctly reinterprets the input dimensions for Qualcomm DSP paths. The comment clearly documents the expected layout transformation.

mllm/backends/qnn/aot/visitor/Conv2D.cpp (1)

70-95: Stride and padding are hardcoded instead of using Conv2D options.

The stride is hardcoded to [1, 1] and padding to zeros, ignoring the actual values from real_linear_op->options().stride and real_linear_op->options().padding. If this is intentional for the Linear→Conv2D replacement use case (where these values are always fixed), please add a comment explaining this. Otherwise, this should use the actual options values.

Is the hardcoding intentional because Conv2D is only used as a replacement for Linear ops (where stride=1 and padding=0)?

If the hardcoding is intentional, consider adding a comment:

// Stride is always [1,1] for Linear->Conv2D replacement pattern

If not intentional, consider using the actual options:

   {
     auto stride_param =
         QNNParamTensorWrapper::create("stride", base_op->getName() + ".stride", QNN_DATATYPE_UINT_32, std::vector<uint32_t>{2});
     uint32_t* data = static_cast<uint32_t*>(stride_param->alloc());
-    data[0] = 1;
-    data[1] = 1;
+    const auto& stride = real_conv2d_op->options().stride;
+    data[0] = static_cast<uint32_t>(stride[0]);
+    data[1] = static_cast<uint32_t>(stride[1]);
     qnn_op_node->emplaceParamTensor(stride_param);
   }
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (2)

267-268: Verify config key: Conv2D pattern uses "linear" config path.

The Conv2D pattern fetches configuration from ["quant_recipe"]["builtin_llm_pass"]["linear"], which is the same path used by the Linear pattern. Is this intentional to share configuration, or should Conv2D have its own config key (e.g., "conv2d")?


1089-1089: LGTM!

The Conv2D pattern registration follows the existing conventions with appropriate naming and priority.

Comment on lines +256 to +258
//===----------------------------------------------------------------------===//
// Sigmoid Pattern
//===----------------------------------------------------------------------===//
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Incorrect section header comment.

The comment says "Sigmoid Pattern" but this section implements the Conv2D pattern.

📝 Suggested fix
 //===----------------------------------------------------------------------===//
-// Sigmoid Pattern
+// Conv2D Pattern
 //===----------------------------------------------------------------------===//
 bool LLMQuantRecipeConv2DPattern::isMatch(const mllm::ir::op_ptr_t& op) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
//===----------------------------------------------------------------------===//
// Sigmoid Pattern
//===----------------------------------------------------------------------===//
//===----------------------------------------------------------------------===//
// Conv2D Pattern
//===----------------------------------------------------------------------===//
🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp around lines 256 - 258,
The section header comment incorrectly reads "Sigmoid Pattern" while the code
implements the Conv2D pattern; update the comment to accurately reflect the
implementation (e.g., change the header to "Conv2D Pattern" or "Conv2D Pattern /
Conv2D lowering") so the comment matches the implemented Conv2D pattern in this
block.

Comment on lines +317 to +339
ir::linalg::QuantizationSpecLPBQ::ptr_t weight_quant_spec = nullptr;

if (precision == "w4a16") {
weight_quant_spec =
ir::linalg::QuantizationSpecLPBQ::create(-8, 7, block_size, 0, 4, kUInt4, kFloat32, Tensor::nil(), Tensor::nil());

// output sym int16
auto out_quant_spec = ir::linalg::QuantizationSpecAsymPerTensor::create(0, 65536 - 1, kUInt16, kFloat32, kInt32,
Tensor::nil(), Tensor::nil());
conv2d_ir->outputs().front()->setAttr("quant_recipe",
writer.create<ir::linalg::LinalgIRQuantizatonSpecAttr>(out_quant_spec));

annotation_attr->annotation_.outputs.emplace_back(out_quant_spec);
annotation_attr->annotation_.weights.insert({"weight", weight_quant_spec});
}

auto weight_name = conv2d_ir->getAOp()->getName() + ".weight";
auto weight_reg_tensor_ir = writer.getContext()->lookupSymbolTable(weight_name);
MLLM_RETURN_FALSE_IF_NOT(weight_reg_tensor_ir);
MLLM_RETURN_FALSE_IF_NOT(weight_reg_tensor_ir->isa_<ir::tensor::RegisterOp>());
MLLM_RETURN_FALSE_IF_NOT(weight_reg_tensor_ir->outputs().front()->isa_<ir::tensor::TensorValue>());
auto t = weight_reg_tensor_ir->outputs().front()->cast_<ir::tensor::TensorValue>();
t->setAttr("quant_recipe", writer.create<ir::linalg::LinalgIRQuantizatonSpecAttr>(weight_quant_spec));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential null pointer propagation for unsupported precisions.

If method is "LPBQ" but precision is not "w4a16", weight_quant_spec remains nullptr. This leads to:

  1. Line 330: Inserting nullptr into annotation_.weights
  2. Line 339: Setting a nullptr quant recipe attribute on the weight tensor

This may cause null dereference issues downstream when the quant recipe is accessed.

🐛 Suggested fix: Add handling for unsupported precisions
       if (precision == "w4a16") {
         weight_quant_spec =
             ir::linalg::QuantizationSpecLPBQ::create(-8, 7, block_size, 0, 4, kUInt4, kFloat32, Tensor::nil(), Tensor::nil());

         // output sym int16
         auto out_quant_spec = ir::linalg::QuantizationSpecAsymPerTensor::create(0, 65536 - 1, kUInt16, kFloat32, kInt32,
                                                                                 Tensor::nil(), Tensor::nil());
         conv2d_ir->outputs().front()->setAttr("quant_recipe",
                                               writer.create<ir::linalg::LinalgIRQuantizatonSpecAttr>(out_quant_spec));

         annotation_attr->annotation_.outputs.emplace_back(out_quant_spec);
         annotation_attr->annotation_.weights.insert({"weight", weight_quant_spec});
+      } else {
+        MLLM_WARN("Currently not support precision: {} for LPBQ method", precision);
+        return false;
       }

       auto weight_name = conv2d_ir->getAOp()->getName() + ".weight";
       auto weight_reg_tensor_ir = writer.getContext()->lookupSymbolTable(weight_name);
       MLLM_RETURN_FALSE_IF_NOT(weight_reg_tensor_ir);
       MLLM_RETURN_FALSE_IF_NOT(weight_reg_tensor_ir->isa_<ir::tensor::RegisterOp>());
       MLLM_RETURN_FALSE_IF_NOT(weight_reg_tensor_ir->outputs().front()->isa_<ir::tensor::TensorValue>());
       auto t = weight_reg_tensor_ir->outputs().front()->cast_<ir::tensor::TensorValue>();
-      t->setAttr("quant_recipe", writer.create<ir::linalg::LinalgIRQuantizatonSpecAttr>(weight_quant_spec));
+      if (weight_quant_spec) {
+        t->setAttr("quant_recipe", writer.create<ir::linalg::LinalgIRQuantizatonSpecAttr>(weight_quant_spec));
+      }
     } else {

Note: The existing LLMQuantRecipeLinearPattern::rewrite (lines 888-910) has the same issue with unsupported precision handling.

🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp around lines 317 - 339,
weight_quant_spec can be nullptr when method == "LPBQ" but precision != "w4a16",
causing nullptr to be stored/used; after the precision branch check
weight_quant_spec and bail out instead of proceeding: add a validation like
MLLM_RETURN_FALSE_IF_NOT(weight_quant_spec) (or return false/log an error)
before inserting into annotation_attr->annotation_.weights and before calling
t->setAttr(...). Apply the same fix in LLMQuantRecipeLinearPattern::rewrite to
prevent null propagation for unsupported precisions.

Comment on lines +167 to 174
auto num_scale_offsets = (uint32_t)v->tensor_.size(-1);
std::vector<Qnn_ScaleOffset_t> scale_offsets(num_scale_offsets);
MLLM_RT_ASSERT_EQ(num_scale_offsets, cfg->scale_level_1_fp.size(0));
MLLM_RT_ASSERT_EQ(num_scale_offsets, cfg->scale_level_1_fp.size(-1));
MLLM_RT_ASSERT_EQ(cfg->scale_level_0_int.dtype(), kUInt8);
for (int i = 0; i < num_scale_offsets; ++i) {
scale_offsets[i].scale = cfg->scale_level_1_fp.at<float>({i, 0, 0});
scale_offsets[i].scale = cfg->scale_level_1_fp.at<float>({0, 0, 0, i});
scale_offsets[i].offset = 0;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential runtime error with tensor dimension mismatch.

The scale access at line 172 uses 4D indexing {0, 0, 0, i}, which assumes scale_level_1_fp has been reshaped to [1, 1, 1, Out] (HWIO format). Similarly, line 167 uses size(-1) on v->tensor_ expecting 4D HWIO layout.

If any LPBQ-quantized tensor reaches this code with a different rank (e.g., 2D for Linear weights), this could cause a runtime crash or incorrect behavior.

Consider adding a rank assertion to catch mismatches early:

🛡️ Suggested defensive check
     case ir::linalg::QuantizationSpecType::kLPBQ: {
       // This LPBQ Type is for Conv2D Only !!! Linear has diff layout cmp with conv2d
+      MLLM_RT_ASSERT_EQ(v->tensor_.rank(), 4);  // Expect HWIO [1, 1, In, Out]

       auto cfg = std::static_pointer_cast<ir::linalg::QuantizationSpecLPBQ>(quant_spec);
🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/QnnWrappersAPI.cpp around lines 167 - 174, The code
assumes 4D HWIO tensors (using v->tensor_.size(-1) and
cfg->scale_level_1_fp.at<float>({0,0,0,i})), which can crash for lower-rank
weights (e.g., 2D linear weights); add a defensive rank check on
cfg->scale_level_1_fp and v->tensor_ before the loop and either assert the
expected rank (==4) or branch to a fallback that reads the scale using only the
last-dimension index (i) so you safely access the scale regardless of whether
scale_level_1_fp is 4D or 1D; update the logic that builds scale_offsets (the
loop over num_scale_offsets and the use of scale_level_1_fp.at<float>(...))
accordingly and keep the existing assertion on cfg->scale_level_0_int.dtype()
intact.

Comment on lines +22 to +26
auto linear_op = op->cast_<mllm::ir::linalg::Conv2DOp>();
if (!linear_op) {
MLLM_ERROR("Failed to cast to linalg::Conv2DOp");
return false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Fix misleading variable names (copy/paste artifact).

The variables linear_op and later real_linear_op are misleadingly named since this is handling Conv2D operations, not Linear. This appears to be a copy/paste artifact from the Linear pattern implementation.

♻️ Proposed fix
-  auto linear_op = op->cast_<mllm::ir::linalg::Conv2DOp>();
-  if (!linear_op) {
+  auto conv2d_op = op->cast_<mllm::ir::linalg::Conv2DOp>();
+  if (!conv2d_op) {
     MLLM_ERROR("Failed to cast to linalg::Conv2DOp");
     return false;
   }
🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/visitor/Conv2D.cpp around lines 22 - 26, The local
variable names `linear_op` and `real_linear_op` in Conv2D.cpp are misleading
copy/paste artifacts; rename them to reflect Conv2D (e.g., `conv_op` and
`real_conv_op`) wherever `auto linear_op =
op->cast_<mllm::ir::linalg::Conv2DOp>()` and subsequent uses appear (including
checks, error messages like MLLM_ERROR, and any downstream references) so all
identifiers and messages consistently indicate Conv2D instead of Linear.

Comment on lines +36 to +41
auto base_op = linear_op->getAOp();
auto real_linear_op = dynamic_cast<mllm::aops::Conv2DOp*>(base_op);
if (!real_linear_op) {
MLLM_ERROR("Failed to cast BaseOp to mllm::aops::Conv2DOp");
return false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Fix misleading variable name for the runtime Conv2D op.

Consistent with the previous comment, real_linear_op should be renamed to reflect that it's a Conv2D operation.

♻️ Proposed fix
-  auto base_op = linear_op->getAOp();
-  auto real_linear_op = dynamic_cast<mllm::aops::Conv2DOp*>(base_op);
-  if (!real_linear_op) {
+  auto base_op = conv2d_op->getAOp();
+  auto real_conv2d_op = dynamic_cast<mllm::aops::Conv2DOp*>(base_op);
+  if (!real_conv2d_op) {
     MLLM_ERROR("Failed to cast BaseOp to mllm::aops::Conv2DOp");
     return false;
   }

And update line 58:

-  if (real_linear_op->options().bias) {
+  if (real_conv2d_op->options().bias) {
🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/visitor/Conv2D.cpp around lines 36 - 41, The variable
name real_linear_op is misleading for a Conv2D runtime op; change it to
real_conv2d_op where you perform the dynamic_cast from base_op to
mllm::aops::Conv2DOp (the block starting with auto base_op =
linear_op->getAOp(); and the dynamic_cast line) and update all subsequent
references (e.g., the usage referenced around line 58) to use real_conv2d_op so
the identifier correctly reflects the Conv2D type.

Comment on lines +43 to +48
// Retrieve weight from symbol table
auto weight_val = writer.getContext()
->lookupSymbolTable(base_op->getName() + ".weight")
->outputs()
.front()
->cast_<ir::tensor::TensorValue>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add null check for weight symbol table lookup.

The lookupSymbolTable call chain could return null or an empty outputs vector. Consider adding validation before accessing .front().

🐛 Proposed fix
+  auto weight_symbol = writer.getContext()->lookupSymbolTable(base_op->getName() + ".weight");
+  if (!weight_symbol) {
+    MLLM_ERROR("Failed to find weight symbol for {}", base_op->getName());
+    return false;
+  }
   auto weight_val = writer.getContext()
                         ->lookupSymbolTable(base_op->getName() + ".weight")
                         ->outputs()
                         .front()
                         ->cast_<ir::tensor::TensorValue>();
🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/visitor/Conv2D.cpp around lines 43 - 48, The lookup of
the weight symbol table can return null or an outputs() vector with no elements,
so before calling .front() and cast_<ir::tensor::TensorValue>(), guard the
chain: check that writer.getContext()->lookupSymbolTable(base_op->getName() +
".weight") is non-null and that its outputs() is non-empty; if either check
fails, handle the error (e.g., log via processLogger or throw a descriptive
exception) and avoid dereferencing, otherwise proceed to obtain the front() and
cast_ as currently written.

Comment on lines +265 to +266
if not self.weight_quant.is_frozen:
self.freeze_weight()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use .item() to check tensor boolean value.

self.weight_quant.is_frozen is registered as a tensor buffer (line 169: self.register_buffer("is_frozen", torch.tensor(False))). Using if not self.weight_quant.is_frozen: on a tensor can produce unexpected behavior or deprecation warnings.

🔧 Suggested fix
-        if not self.weight_quant.is_frozen:
+        if not self.weight_quant.is_frozen.item():
             self.freeze_weight()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not self.weight_quant.is_frozen:
self.freeze_weight()
if not self.weight_quant.is_frozen.item():
self.freeze_weight()
🤖 Prompt for AI Agents
In @pymllm/backends/qualcomm/transformers/core/qlinear.py around lines 265 -
266, The check uses a tensor buffer self.weight_quant.is_frozen and should use
its Python boolean value; replace the direct tensor truthiness check with
self.weight_quant.is_frozen.item() (e.g., change the condition in the block that
calls freeze_weight() to use .item()) so the if statement reliably reads the
boolean and avoids tensor truthiness/deprecation issues.

model_path,
attn_implementation="eager",
)
self.model.cuda()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unconditional CUDA placement may break CPU-only environments.

The self.model.cuda() call assumes CUDA is available. This will raise a runtime error on systems without a GPU or CUDA installation.

Consider adding a CUDA availability check or making device placement configurable:

🔧 Suggested fix with CUDA check
-        self.model.cuda()
+        if torch.cuda.is_available():
+            self.model.cuda()
+        else:
+            print("Warning: CUDA not available, running on CPU")

Alternatively, add a device parameter to __init__:

def __init__(self, model_path: str, mllm_qualcomm_max_length=2048, device="cuda"):
    # ...
    self.device = device if device != "cuda" or torch.cuda.is_available() else "cpu"
    self.model.to(self.device)
🤖 Prompt for AI Agents
In @pymllm/backends/qualcomm/transformers/qwen3/runner.py at line 48, The code
unconditionally calls self.model.cuda(), which fails on CPU-only systems; update
the class/__init__ to accept a device parameter (e.g., device) or check
torch.cuda.is_available() and choose "cuda" only when available, then move the
model with self.model.to(self.device) (or equivalent) instead of
self.model.cuda(); locate the placement call (self.model.cuda()) and the
constructor (the class __init__) to add the device logic and use
self.model.to(self.device).

@chenghuaWang chenghuaWang merged commit 44a7736 into UbiquitousLearning:main Jan 12, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants