fix(Qualcomm): Replace linear op with conv2d in Qualcomm backend#600
fix(Qualcomm): Replace linear op with conv2d in Qualcomm backend#600chenghuaWang merged 1 commit intoUbiquitousLearning:mainfrom
Conversation
📝 WalkthroughWalkthroughThe changes add Conv2D support to the QNN AOT backend by replacing Linear layers with Conv2D in Qwen3 model components, implementing QNN AOT Conv2D visitor patterns and quantization recipes, updating LPBQ quantization handling for Conv2D tensors, and adding Python deployment conversion methods for HWIO layout preparation. Changes
Sequence Diagram(s)sequenceDiagram
participant Model as Qwen3 Model
participant Lowering as AOT Lowering Pass
participant Quant as Quantization Recipe
participant Visitor as Conv2D Visitor
participant QNN as QNN Backend
Model->>Lowering: Conv2D operations
activate Lowering
Lowering->>Lowering: Match Conv2DPattern
Lowering->>Quant: Annotated Conv2D ops
deactivate Lowering
activate Quant
Quant->>Quant: Apply LPBQ method
Quant->>Quant: Attach quantization annotations
Quant->>Visitor: Annotated Conv2D with quant_recipe
deactivate Quant
activate Visitor
Visitor->>Visitor: Validate Conv2D attributes
Visitor->>Visitor: Extract weights & bias
Visitor->>QNN: Create QNN Conv2d node
deactivate Visitor
QNN->>QNN: Register with stride & padding params
QNN-->>Model: QNN AOT executable
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
mllm/backends/qnn/aot/QnnWrappersAPI.cpp (1)
161-185: Add guard or separate handling for Linear LPBQ quantization.The code at lines 161-185 is hardcoded for Conv2D HWIO (4D) tensor layout, but Linear operations can also be assigned LPBQ quantization via
LLMQuantRecipeLinearPattern::rewrite(). When a Linear weight tensor (2D) reachessetupComplexTensorQuantization(), it will crash attempting to accessat<float>({0, 0, 0, i})on a 2D tensor, or hang if assertion checks fail.Either:
- Add a type check in
setupComplexTensorQuantization()to skip LPBQ handling for Linear operations, or- Implement separate 2D tensor handling for Linear LPBQ with proper indexing, or
- Document and enforce that Linear LPBQ is not supported in the quantization recipe configuration
🤖 Fix all issues with AI agents
In @mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp:
- Around line 256-258: The section header comment incorrectly reads "Sigmoid
Pattern" while the code implements the Conv2D pattern; update the comment to
accurately reflect the implementation (e.g., change the header to "Conv2D
Pattern" or "Conv2D Pattern / Conv2D lowering") so the comment matches the
implemented Conv2D pattern in this block.
- Around line 317-339: weight_quant_spec can be nullptr when method == "LPBQ"
but precision != "w4a16", causing nullptr to be stored/used; after the precision
branch check weight_quant_spec and bail out instead of proceeding: add a
validation like MLLM_RETURN_FALSE_IF_NOT(weight_quant_spec) (or return false/log
an error) before inserting into annotation_attr->annotation_.weights and before
calling t->setAttr(...). Apply the same fix in
LLMQuantRecipeLinearPattern::rewrite to prevent null propagation for unsupported
precisions.
In @mllm/backends/qnn/aot/QnnWrappersAPI.cpp:
- Around line 167-174: The code assumes 4D HWIO tensors (using
v->tensor_.size(-1) and cfg->scale_level_1_fp.at<float>({0,0,0,i})), which can
crash for lower-rank weights (e.g., 2D linear weights); add a defensive rank
check on cfg->scale_level_1_fp and v->tensor_ before the loop and either assert
the expected rank (==4) or branch to a fallback that reads the scale using only
the last-dimension index (i) so you safely access the scale regardless of
whether scale_level_1_fp is 4D or 1D; update the logic that builds scale_offsets
(the loop over num_scale_offsets and the use of scale_level_1_fp.at<float>(...))
accordingly and keep the existing assertion on cfg->scale_level_0_int.dtype()
intact.
In @mllm/backends/qnn/aot/visitor/Conv2D.cpp:
- Around line 22-26: The local variable names `linear_op` and `real_linear_op`
in Conv2D.cpp are misleading copy/paste artifacts; rename them to reflect Conv2D
(e.g., `conv_op` and `real_conv_op`) wherever `auto linear_op =
op->cast_<mllm::ir::linalg::Conv2DOp>()` and subsequent uses appear (including
checks, error messages like MLLM_ERROR, and any downstream references) so all
identifiers and messages consistently indicate Conv2D instead of Linear.
- Around line 43-48: The lookup of the weight symbol table can return null or an
outputs() vector with no elements, so before calling .front() and
cast_<ir::tensor::TensorValue>(), guard the chain: check that
writer.getContext()->lookupSymbolTable(base_op->getName() + ".weight") is
non-null and that its outputs() is non-empty; if either check fails, handle the
error (e.g., log via processLogger or throw a descriptive exception) and avoid
dereferencing, otherwise proceed to obtain the front() and cast_ as currently
written.
- Around line 36-41: The variable name real_linear_op is misleading for a Conv2D
runtime op; change it to real_conv2d_op where you perform the dynamic_cast from
base_op to mllm::aops::Conv2DOp (the block starting with auto base_op =
linear_op->getAOp(); and the dynamic_cast line) and update all subsequent
references (e.g., the usage referenced around line 58) to use real_conv2d_op so
the identifier correctly reflects the Conv2D type.
In @pymllm/backends/qualcomm/transformers/core/qlinear.py:
- Around line 265-266: The check uses a tensor buffer
self.weight_quant.is_frozen and should use its Python boolean value; replace the
direct tensor truthiness check with self.weight_quant.is_frozen.item() (e.g.,
change the condition in the block that calls freeze_weight() to use .item()) so
the if statement reliably reads the boolean and avoids tensor
truthiness/deprecation issues.
In @pymllm/backends/qualcomm/transformers/qwen3/runner.py:
- Line 48: The code unconditionally calls self.model.cuda(), which fails on
CPU-only systems; update the class/__init__ to accept a device parameter (e.g.,
device) or check torch.cuda.is_available() and choose "cuda" only when
available, then move the model with self.model.to(self.device) (or equivalent)
instead of self.model.cuda(); locate the placement call (self.model.cuda()) and
the constructor (the class __init__) to add the device logic and use
self.model.to(self.device).
🧹 Nitpick comments (5)
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (2)
115-120: Consider replacing macro with constexpr or inline function.The
CONV2D_PROPERTYmacro works but has drawbacks in header files: macros are not namespace-scoped and can cause name collisions. Consider a safer alternative:♻️ Suggested refactor using inline function
using vi32 = std::vector<int32_t>; -#define CONV2D_PROPERTY vi32{1, 1}, vi32{1, 1}, vi32{0, 0}, vi32{1, 1}, false, aops::Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32 + +// Conv2D properties for Linear replacement: kernel 1x1, stride 1x1, pad 0, dilation 1x1 +struct Conv2DProps { + static constexpr auto kernel = vi32{1, 1}; + static constexpr auto stride = vi32{1, 1}; + static constexpr auto padding = vi32{0, 0}; + static constexpr auto dilation = vi32{1, 1}; + static constexpr bool bias = false; + static constexpr auto impl_type = aops::Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32; +};Then update registrations to use
Conv2DProps::kernel, Conv2DProps::stride, ...or create a helper template/function.
410-432: Remove outdated comments or implement the suggested handling.The exception throwing for missing KV caches is appropriate, but the comments at lines 415-417 and 428-430 suggest uncertainty:
// This might need adjustment based on your initialization logicIf the current implementation (throwing an exception) is the intended behavior, these comments should be removed. If further handling is needed, consider tracking this as a follow-up task.
♻️ Suggested cleanup
} else { - // If KV cache doesn't exist, we need to handle this case - // For now, we'll create empty tensors or handle it appropriately - // This might need adjustment based on your initialization logic throw std::runtime_error("Missing KV cache for layer " + std::to_string(i)); }Apply similar cleanup for both key and value cache loops.
mllm/core/aops/Conv2DOp.hpp (1)
13-16: New enum values are missing from string conversion functions.The new
kQNN_LPBQ_w4a16o16_G32andkQNN_LPBQ_w4a16o16_G64enum values are not added to thestr2Conv2DOpImplType(line 31) andConv2DOpImplType2Str(line 41) mapping functions. This will cause them to silently fall back tokDefaultduring string conversion.If these types need to be serialized/deserialized or logged, consider adding the mappings:
🔧 Suggested fix
inline Conv2DOpImplType str2Conv2DOpImplType(const std::string& str) { - static const std::unordered_map<std::string, Conv2DOpImplType> map = {{"Default", Conv2DOpImplType::kDefault}}; + static const std::unordered_map<std::string, Conv2DOpImplType> map = { + {"Default", Conv2DOpImplType::kDefault}, + {"QNN_LPBQ_w4a16o16_G32", Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32}, + {"QNN_LPBQ_w4a16o16_G64", Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64} + }; auto it = map.find(str); if (it != map.end()) { return it->second; } // Return default if not found return Conv2DOpImplType::kDefault; } inline std::string Conv2DOpImplType2Str(Conv2DOpImplType type) { - static const std::unordered_map<Conv2DOpImplType, std::string> map = {{Conv2DOpImplType::kDefault, "Default"}}; + static const std::unordered_map<Conv2DOpImplType, std::string> map = { + {Conv2DOpImplType::kDefault, "Default"}, + {Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32, "QNN_LPBQ_w4a16o16_G32"}, + {Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64, "QNN_LPBQ_w4a16o16_G64"} + }; auto it = map.find(type); if (it != map.end()) return it->second; return "Default"; }mllm/core/aops/Conv2DOp.cpp (1)
114-117: Consider extracting the DSP impl_type check to avoid duplication.The condition
options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32 || options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64is repeated twice (lines 80-81 and 114-115). Consider extracting this to a helper or local boolean for clarity and maintainability.♻️ Suggested refactor
+ const bool is_dsp_layout = options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32 + || options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64; + // CHECK if in Qualcomm DSP shape. Inputs is [N, H, W, C], Filter Weight is [N, H, In, Out] - if (options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32 - || options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64) { + if (is_dsp_layout) { in_channels = ishape[3]; in_height = ishape[1]; in_width = ishape[2]; } ... auto new_shape = std::vector<int32_t>{batch, out_channels, h_out, w_out}; - if (options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G32 - || options_.impl_type == Conv2DOpImplType::kQNN_LPBQ_w4a16o16_G64) { + if (is_dsp_layout) { new_shape = std::vector<int32_t>{batch, h_out, w_out, out_channels}; }mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (1)
264-349: Consider extracting shared quantization logic.The
LLMQuantRecipeConv2DPattern::rewriteimplementation is nearly identical toLLMQuantRecipeLinearPattern::rewrite(lines 835-920). Both patterns share:
- Config lookup and regex matching logic
- LPBQ method handling with "w4a16" precision
- Weight tensor registration and quant spec assignment
Consider extracting a shared helper function to reduce code duplication and simplify future maintenance.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hppmllm/backends/qnn/QNNUtils.cppmllm/backends/qnn/aot/QnnWrappersAPI.cppmllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cppmllm/backends/qnn/aot/passes/LLMQuantRecipePass.cppmllm/backends/qnn/aot/passes/LLMQuantRecipePass.hppmllm/backends/qnn/aot/passes/PTQPass.cppmllm/backends/qnn/aot/visitor/Conv2D.cppmllm/backends/qnn/aot/visitor/Conv2D.hppmllm/core/aops/Conv2DOp.cppmllm/core/aops/Conv2DOp.hpppymllm/backends/qualcomm/transformers/core/qlinear.pypymllm/backends/qualcomm/transformers/qwen3/runner.py
🧰 Additional context used
📓 Path-based instructions (5)
{mllm,mllm-cli,pymllm}/**/*
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).
Files:
mllm/backends/qnn/QNNUtils.cpppymllm/backends/qualcomm/transformers/qwen3/runner.pymllm/backends/qnn/aot/passes/LLMQuantRecipePass.hppmllm/core/aops/Conv2DOp.cppmllm/core/aops/Conv2DOp.hppmllm/backends/qnn/aot/visitor/Conv2D.hppmllm/backends/qnn/aot/passes/PTQPass.cppmllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cppmllm/backends/qnn/aot/visitor/Conv2D.cpppymllm/backends/qualcomm/transformers/core/qlinear.pymllm/backends/qnn/aot/QnnWrappersAPI.cppmllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.
Files:
mllm/backends/qnn/QNNUtils.cpppymllm/backends/qualcomm/transformers/qwen3/runner.pymllm/backends/qnn/aot/passes/LLMQuantRecipePass.hppmllm/core/aops/Conv2DOp.cppmllm/core/aops/Conv2DOp.hppmllm/backends/qnn/aot/visitor/Conv2D.hppmllm/backends/qnn/aot/passes/PTQPass.cppmllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cppmllm/backends/qnn/aot/visitor/Conv2D.cpppymllm/backends/qualcomm/transformers/core/qlinear.pymllm/backends/qnn/aot/QnnWrappersAPI.cppmllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.
Files:
mllm/backends/qnn/QNNUtils.cpppymllm/backends/qualcomm/transformers/qwen3/runner.pymllm/core/aops/Conv2DOp.cppmllm/backends/qnn/aot/passes/PTQPass.cppmllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cppmllm/backends/qnn/aot/visitor/Conv2D.cpppymllm/backends/qualcomm/transformers/core/qlinear.pymllm/backends/qnn/aot/QnnWrappersAPI.cppmllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).
Files:
mllm/backends/qnn/QNNUtils.cpppymllm/backends/qualcomm/transformers/qwen3/runner.pymllm/backends/qnn/aot/passes/LLMQuantRecipePass.hppmllm/core/aops/Conv2DOp.cppmllm/core/aops/Conv2DOp.hppmllm/backends/qnn/aot/visitor/Conv2D.hppmllm/backends/qnn/aot/passes/PTQPass.cppmllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cppmllm/backends/qnn/aot/visitor/Conv2D.cpppymllm/backends/qualcomm/transformers/core/qlinear.pymllm/backends/qnn/aot/QnnWrappersAPI.cppmllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
{mllm,mllm-cli,pymllm}/**/*.{sh,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
If a file starts with a shebang, it must be the first line and contain only the interpreter path and optional single argument, encoded in UTF-8.
Files:
pymllm/backends/qualcomm/transformers/qwen3/runner.pypymllm/backends/qualcomm/transformers/core/qlinear.py
🧬 Code graph analysis (5)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp (1)
mllm/backends/qnn/aot/visitor/Conv2D.hpp (2)
op(14-14)writer(16-16)
mllm/core/aops/Conv2DOp.cpp (2)
mllm/core/Tensor.cpp (2)
empty(78-82)empty(78-78)mllm/core/aops/Conv2DOp.hpp (1)
options_(68-68)
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (1)
mllm/nn/Module.hpp (8)
name(34-34)name(36-36)name(38-38)name(190-190)reg(83-115)reg(83-83)args(148-170)args(148-148)
mllm/backends/qnn/aot/visitor/Conv2D.cpp (3)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (32)
isMatch(259-262)isMatch(259-259)isMatch(354-357)isMatch(354-354)isMatch(366-369)isMatch(366-366)isMatch(379-382)isMatch(379-379)isMatch(391-394)isMatch(391-391)isMatch(437-440)isMatch(437-437)isMatch(465-468)isMatch(465-465)isMatch(501-504)isMatch(501-501)rewrite(264-349)rewrite(264-264)rewrite(359-361)rewrite(359-359)rewrite(371-374)rewrite(371-371)rewrite(384-386)rewrite(384-384)rewrite(396-432)rewrite(396-396)rewrite(442-460)rewrite(442-442)rewrite(470-496)rewrite(470-470)rewrite(506-508)rewrite(506-506)mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp (32)
op(40-40)op(54-54)op(68-68)op(80-80)op(94-94)op(106-106)op(120-120)op(134-134)op(146-146)op(158-158)op(170-170)op(184-184)op(198-198)op(212-212)op(226-226)op(240-240)writer(42-42)writer(56-56)writer(70-70)writer(82-82)writer(96-96)writer(108-108)writer(122-122)writer(136-136)writer(148-148)writer(160-160)writer(172-172)writer(186-186)writer(200-200)writer(214-214)writer(228-228)writer(242-242)mllm/backends/qnn/QNNUtils.cpp (4)
create(449-467)create(449-450)create(549-553)create(549-551)
pymllm/backends/qualcomm/transformers/core/qlinear.py (1)
pymllm/backends/qualcomm/transformers/core/rms_norm.py (1)
freeze_weight(102-116)
🪛 Clang (14.0.6)
mllm/backends/qnn/QNNUtils.cpp
[error] 1-1: 'cstdint' file not found
(clang-diagnostic-error)
mllm/backends/qnn/aot/visitor/Conv2D.hpp
[error] 6-6: 'mllm/core/OpTypes.hpp' file not found
(clang-diagnostic-error)
mllm/backends/qnn/aot/visitor/Conv2D.cpp
[error] 4-4: 'mllm/backends/qnn/QNNUtils.hpp' file not found
(clang-diagnostic-error)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-macos
- GitHub Check: build-android
- GitHub Check: build-x86
🔇 Additional comments (18)
mllm/backends/qnn/QNNUtils.cpp (1)
1-8: LGTM! Explicit header inclusion improves code clarity.Making dependencies explicit by directly including standard library headers (
<cstdint>,<memory>,<dlfcn.h>,<cstring>) is a C++ best practice. This eliminates reliance on transitive includes and improves compilation robustness across different environments.The static analysis error "'cstdint' file not found" is a false positive from the analyzer's environment configuration—
<cstdint>is a standard C++11 header available in all modern compilers.examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (5)
122-158: LGTM!The Linear to Conv2D replacement in
Qwen3MLPis correctly implemented. The view operations properly reshape tensors from[B, S, D]to[B, 1, S, D]format for Conv2D operations and back to 3D for subsequent computations.
161-201: LGTM!The Conv2D replacements for the attention projections maintain correct input/output dimensions and use consistent configuration via the
CONV2D_PROPERTYmacro.
211-279: LGTM!The tensor reshaping in the attention forward pass correctly handles the Conv2D format requirements. The flow properly maintains shape compatibility through the attention computation pipeline.
393-397: LGTM!The
lm_head_Conv2D registration is consistent with the other projection replacements.
463-466: Verify output shape afterlm_head_operation.The
lm_head_Conv2D operation produces a 4D output[1, 1, S, vocab_size], but unlike all other Conv2D operations in this file (up_proj_, gate_proj_, down_proj_, o_proj_), there is no.view()to reshape it back to 3D before the trace ends. Confirm whether this 4D output is intentional for the final model output, or if a reshape is needed.mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp (1)
35-47: LGTM!The
LLMQuantRecipeConv2DPatternclass follows the established pattern structure used by other pattern classes in this file (LLMQuantRecipeSigmoidPattern,LLMQuantRecipeLinearPattern, etc.) with consistentisMatch,rewrite, andcreatemethod signatures.pymllm/backends/qualcomm/transformers/core/qlinear.py (2)
114-157: LGTM! Well-structured HWIO conversion for W8A16 Per-Channel.The implementation correctly:
- Guards against double conversion via
deploy_modecheck- Ensures quantization parameters are frozen before conversion
- Applies the correct transformation:
[Out, In] → transpose → [In, Out] → reshape → [1, 1, In, Out]- Reshapes scale/zero_point to
[1, 1, 1, Out]for proper broadcasting in HWIO format
284-299: LPBQ scale transformations look correct for HWIO layout.The scale handling properly transforms:
- Scale2 (per-channel):
[Out, 1, 1] → [1, 1, 1, Out]- Scale1 (per-block):
[Out, Blocks, 1] → [1, 1, Blocks, Out]This maintains the semantic correspondence between weight blocks and their scales after the transpose operation.
pymllm/backends/qualcomm/transformers/qwen3/runner.py (1)
35-38: LGTM! Correctly routes linear layers to HWIO deployment.The
convert_weightfunction now properly directsQLinearLPBQandQLinearW8A16_PerChannelSymto use the newconvert_to_conv2d_deploy_hwio()method while keepingQRMSNormon the standardconvert_to_deploy()path.mllm/backends/qnn/aot/passes/PTQPass.cpp (1)
125-128: LGTM - Conv2D weight solving correctly reuses Linear's logic.The approach of reusing
solveLinearWeightfor Conv2D operations is appropriate since both operations share the same LPBQ weight quantization handling. The inline comment clarifies the intent.mllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cpp (2)
15-15: LGTM - Conv2D visitor include added correctly.The include placement follows the existing pattern of other visitor headers.
38-42: LGTM - Conv2D pattern registration looks correct.The
QnnAOTConv2DPatternis properly added to the pattern registration list, consistent with how other patterns are registered.mllm/backends/qnn/aot/visitor/Conv2D.hpp (1)
12-21: LGTM - Pattern class declaration follows existing conventions.The
QnnAOTConv2DPatternclass properly extendsQnnAOTBasePatternand follows the same structure as other AOT patterns in the codebase. The staticcreate()factory method returns the expectedpair<OpTypes, shared_ptr>type.Note: The static analysis error about missing
mllm/core/OpTypes.hppis a false positive due to the sandbox environment not having the complete build system.mllm/core/aops/Conv2DOp.cpp (1)
79-85: LGTM - DSP/LPBQ input shape reinterpretation is correct.The NHWC layout handling correctly reinterprets the input dimensions for Qualcomm DSP paths. The comment clearly documents the expected layout transformation.
mllm/backends/qnn/aot/visitor/Conv2D.cpp (1)
70-95: Stride and padding are hardcoded instead of using Conv2D options.The stride is hardcoded to
[1, 1]and padding to zeros, ignoring the actual values fromreal_linear_op->options().strideandreal_linear_op->options().padding. If this is intentional for the Linear→Conv2D replacement use case (where these values are always fixed), please add a comment explaining this. Otherwise, this should use the actual options values.Is the hardcoding intentional because Conv2D is only used as a replacement for Linear ops (where stride=1 and padding=0)?
If the hardcoding is intentional, consider adding a comment:
// Stride is always [1,1] for Linear->Conv2D replacement patternIf not intentional, consider using the actual options:
{ auto stride_param = QNNParamTensorWrapper::create("stride", base_op->getName() + ".stride", QNN_DATATYPE_UINT_32, std::vector<uint32_t>{2}); uint32_t* data = static_cast<uint32_t*>(stride_param->alloc()); - data[0] = 1; - data[1] = 1; + const auto& stride = real_conv2d_op->options().stride; + data[0] = static_cast<uint32_t>(stride[0]); + data[1] = static_cast<uint32_t>(stride[1]); qnn_op_node->emplaceParamTensor(stride_param); }mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (2)
267-268: Verify config key: Conv2D pattern uses "linear" config path.The Conv2D pattern fetches configuration from
["quant_recipe"]["builtin_llm_pass"]["linear"], which is the same path used by the Linear pattern. Is this intentional to share configuration, or should Conv2D have its own config key (e.g.,"conv2d")?
1089-1089: LGTM!The Conv2D pattern registration follows the existing conventions with appropriate naming and priority.
| //===----------------------------------------------------------------------===// | ||
| // Sigmoid Pattern | ||
| //===----------------------------------------------------------------------===// |
There was a problem hiding this comment.
Incorrect section header comment.
The comment says "Sigmoid Pattern" but this section implements the Conv2D pattern.
📝 Suggested fix
//===----------------------------------------------------------------------===//
-// Sigmoid Pattern
+// Conv2D Pattern
//===----------------------------------------------------------------------===//
bool LLMQuantRecipeConv2DPattern::isMatch(const mllm::ir::op_ptr_t& op) {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| //===----------------------------------------------------------------------===// | |
| // Sigmoid Pattern | |
| //===----------------------------------------------------------------------===// | |
| //===----------------------------------------------------------------------===// | |
| // Conv2D Pattern | |
| //===----------------------------------------------------------------------===// |
🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp around lines 256 - 258,
The section header comment incorrectly reads "Sigmoid Pattern" while the code
implements the Conv2D pattern; update the comment to accurately reflect the
implementation (e.g., change the header to "Conv2D Pattern" or "Conv2D Pattern /
Conv2D lowering") so the comment matches the implemented Conv2D pattern in this
block.
| ir::linalg::QuantizationSpecLPBQ::ptr_t weight_quant_spec = nullptr; | ||
|
|
||
| if (precision == "w4a16") { | ||
| weight_quant_spec = | ||
| ir::linalg::QuantizationSpecLPBQ::create(-8, 7, block_size, 0, 4, kUInt4, kFloat32, Tensor::nil(), Tensor::nil()); | ||
|
|
||
| // output sym int16 | ||
| auto out_quant_spec = ir::linalg::QuantizationSpecAsymPerTensor::create(0, 65536 - 1, kUInt16, kFloat32, kInt32, | ||
| Tensor::nil(), Tensor::nil()); | ||
| conv2d_ir->outputs().front()->setAttr("quant_recipe", | ||
| writer.create<ir::linalg::LinalgIRQuantizatonSpecAttr>(out_quant_spec)); | ||
|
|
||
| annotation_attr->annotation_.outputs.emplace_back(out_quant_spec); | ||
| annotation_attr->annotation_.weights.insert({"weight", weight_quant_spec}); | ||
| } | ||
|
|
||
| auto weight_name = conv2d_ir->getAOp()->getName() + ".weight"; | ||
| auto weight_reg_tensor_ir = writer.getContext()->lookupSymbolTable(weight_name); | ||
| MLLM_RETURN_FALSE_IF_NOT(weight_reg_tensor_ir); | ||
| MLLM_RETURN_FALSE_IF_NOT(weight_reg_tensor_ir->isa_<ir::tensor::RegisterOp>()); | ||
| MLLM_RETURN_FALSE_IF_NOT(weight_reg_tensor_ir->outputs().front()->isa_<ir::tensor::TensorValue>()); | ||
| auto t = weight_reg_tensor_ir->outputs().front()->cast_<ir::tensor::TensorValue>(); | ||
| t->setAttr("quant_recipe", writer.create<ir::linalg::LinalgIRQuantizatonSpecAttr>(weight_quant_spec)); |
There was a problem hiding this comment.
Potential null pointer propagation for unsupported precisions.
If method is "LPBQ" but precision is not "w4a16", weight_quant_spec remains nullptr. This leads to:
- Line 330: Inserting
nullptrintoannotation_.weights - Line 339: Setting a
nullptrquant recipe attribute on the weight tensor
This may cause null dereference issues downstream when the quant recipe is accessed.
🐛 Suggested fix: Add handling for unsupported precisions
if (precision == "w4a16") {
weight_quant_spec =
ir::linalg::QuantizationSpecLPBQ::create(-8, 7, block_size, 0, 4, kUInt4, kFloat32, Tensor::nil(), Tensor::nil());
// output sym int16
auto out_quant_spec = ir::linalg::QuantizationSpecAsymPerTensor::create(0, 65536 - 1, kUInt16, kFloat32, kInt32,
Tensor::nil(), Tensor::nil());
conv2d_ir->outputs().front()->setAttr("quant_recipe",
writer.create<ir::linalg::LinalgIRQuantizatonSpecAttr>(out_quant_spec));
annotation_attr->annotation_.outputs.emplace_back(out_quant_spec);
annotation_attr->annotation_.weights.insert({"weight", weight_quant_spec});
+ } else {
+ MLLM_WARN("Currently not support precision: {} for LPBQ method", precision);
+ return false;
}
auto weight_name = conv2d_ir->getAOp()->getName() + ".weight";
auto weight_reg_tensor_ir = writer.getContext()->lookupSymbolTable(weight_name);
MLLM_RETURN_FALSE_IF_NOT(weight_reg_tensor_ir);
MLLM_RETURN_FALSE_IF_NOT(weight_reg_tensor_ir->isa_<ir::tensor::RegisterOp>());
MLLM_RETURN_FALSE_IF_NOT(weight_reg_tensor_ir->outputs().front()->isa_<ir::tensor::TensorValue>());
auto t = weight_reg_tensor_ir->outputs().front()->cast_<ir::tensor::TensorValue>();
- t->setAttr("quant_recipe", writer.create<ir::linalg::LinalgIRQuantizatonSpecAttr>(weight_quant_spec));
+ if (weight_quant_spec) {
+ t->setAttr("quant_recipe", writer.create<ir::linalg::LinalgIRQuantizatonSpecAttr>(weight_quant_spec));
+ }
} else {Note: The existing LLMQuantRecipeLinearPattern::rewrite (lines 888-910) has the same issue with unsupported precision handling.
🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp around lines 317 - 339,
weight_quant_spec can be nullptr when method == "LPBQ" but precision != "w4a16",
causing nullptr to be stored/used; after the precision branch check
weight_quant_spec and bail out instead of proceeding: add a validation like
MLLM_RETURN_FALSE_IF_NOT(weight_quant_spec) (or return false/log an error)
before inserting into annotation_attr->annotation_.weights and before calling
t->setAttr(...). Apply the same fix in LLMQuantRecipeLinearPattern::rewrite to
prevent null propagation for unsupported precisions.
| auto num_scale_offsets = (uint32_t)v->tensor_.size(-1); | ||
| std::vector<Qnn_ScaleOffset_t> scale_offsets(num_scale_offsets); | ||
| MLLM_RT_ASSERT_EQ(num_scale_offsets, cfg->scale_level_1_fp.size(0)); | ||
| MLLM_RT_ASSERT_EQ(num_scale_offsets, cfg->scale_level_1_fp.size(-1)); | ||
| MLLM_RT_ASSERT_EQ(cfg->scale_level_0_int.dtype(), kUInt8); | ||
| for (int i = 0; i < num_scale_offsets; ++i) { | ||
| scale_offsets[i].scale = cfg->scale_level_1_fp.at<float>({i, 0, 0}); | ||
| scale_offsets[i].scale = cfg->scale_level_1_fp.at<float>({0, 0, 0, i}); | ||
| scale_offsets[i].offset = 0; | ||
| } |
There was a problem hiding this comment.
Potential runtime error with tensor dimension mismatch.
The scale access at line 172 uses 4D indexing {0, 0, 0, i}, which assumes scale_level_1_fp has been reshaped to [1, 1, 1, Out] (HWIO format). Similarly, line 167 uses size(-1) on v->tensor_ expecting 4D HWIO layout.
If any LPBQ-quantized tensor reaches this code with a different rank (e.g., 2D for Linear weights), this could cause a runtime crash or incorrect behavior.
Consider adding a rank assertion to catch mismatches early:
🛡️ Suggested defensive check
case ir::linalg::QuantizationSpecType::kLPBQ: {
// This LPBQ Type is for Conv2D Only !!! Linear has diff layout cmp with conv2d
+ MLLM_RT_ASSERT_EQ(v->tensor_.rank(), 4); // Expect HWIO [1, 1, In, Out]
auto cfg = std::static_pointer_cast<ir::linalg::QuantizationSpecLPBQ>(quant_spec);🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/QnnWrappersAPI.cpp around lines 167 - 174, The code
assumes 4D HWIO tensors (using v->tensor_.size(-1) and
cfg->scale_level_1_fp.at<float>({0,0,0,i})), which can crash for lower-rank
weights (e.g., 2D linear weights); add a defensive rank check on
cfg->scale_level_1_fp and v->tensor_ before the loop and either assert the
expected rank (==4) or branch to a fallback that reads the scale using only the
last-dimension index (i) so you safely access the scale regardless of whether
scale_level_1_fp is 4D or 1D; update the logic that builds scale_offsets (the
loop over num_scale_offsets and the use of scale_level_1_fp.at<float>(...))
accordingly and keep the existing assertion on cfg->scale_level_0_int.dtype()
intact.
| auto linear_op = op->cast_<mllm::ir::linalg::Conv2DOp>(); | ||
| if (!linear_op) { | ||
| MLLM_ERROR("Failed to cast to linalg::Conv2DOp"); | ||
| return false; | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Fix misleading variable names (copy/paste artifact).
The variables linear_op and later real_linear_op are misleadingly named since this is handling Conv2D operations, not Linear. This appears to be a copy/paste artifact from the Linear pattern implementation.
♻️ Proposed fix
- auto linear_op = op->cast_<mllm::ir::linalg::Conv2DOp>();
- if (!linear_op) {
+ auto conv2d_op = op->cast_<mllm::ir::linalg::Conv2DOp>();
+ if (!conv2d_op) {
MLLM_ERROR("Failed to cast to linalg::Conv2DOp");
return false;
}🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/visitor/Conv2D.cpp around lines 22 - 26, The local
variable names `linear_op` and `real_linear_op` in Conv2D.cpp are misleading
copy/paste artifacts; rename them to reflect Conv2D (e.g., `conv_op` and
`real_conv_op`) wherever `auto linear_op =
op->cast_<mllm::ir::linalg::Conv2DOp>()` and subsequent uses appear (including
checks, error messages like MLLM_ERROR, and any downstream references) so all
identifiers and messages consistently indicate Conv2D instead of Linear.
| auto base_op = linear_op->getAOp(); | ||
| auto real_linear_op = dynamic_cast<mllm::aops::Conv2DOp*>(base_op); | ||
| if (!real_linear_op) { | ||
| MLLM_ERROR("Failed to cast BaseOp to mllm::aops::Conv2DOp"); | ||
| return false; | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Fix misleading variable name for the runtime Conv2D op.
Consistent with the previous comment, real_linear_op should be renamed to reflect that it's a Conv2D operation.
♻️ Proposed fix
- auto base_op = linear_op->getAOp();
- auto real_linear_op = dynamic_cast<mllm::aops::Conv2DOp*>(base_op);
- if (!real_linear_op) {
+ auto base_op = conv2d_op->getAOp();
+ auto real_conv2d_op = dynamic_cast<mllm::aops::Conv2DOp*>(base_op);
+ if (!real_conv2d_op) {
MLLM_ERROR("Failed to cast BaseOp to mllm::aops::Conv2DOp");
return false;
}And update line 58:
- if (real_linear_op->options().bias) {
+ if (real_conv2d_op->options().bias) {🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/visitor/Conv2D.cpp around lines 36 - 41, The variable
name real_linear_op is misleading for a Conv2D runtime op; change it to
real_conv2d_op where you perform the dynamic_cast from base_op to
mllm::aops::Conv2DOp (the block starting with auto base_op =
linear_op->getAOp(); and the dynamic_cast line) and update all subsequent
references (e.g., the usage referenced around line 58) to use real_conv2d_op so
the identifier correctly reflects the Conv2D type.
| // Retrieve weight from symbol table | ||
| auto weight_val = writer.getContext() | ||
| ->lookupSymbolTable(base_op->getName() + ".weight") | ||
| ->outputs() | ||
| .front() | ||
| ->cast_<ir::tensor::TensorValue>(); |
There was a problem hiding this comment.
Add null check for weight symbol table lookup.
The lookupSymbolTable call chain could return null or an empty outputs vector. Consider adding validation before accessing .front().
🐛 Proposed fix
+ auto weight_symbol = writer.getContext()->lookupSymbolTable(base_op->getName() + ".weight");
+ if (!weight_symbol) {
+ MLLM_ERROR("Failed to find weight symbol for {}", base_op->getName());
+ return false;
+ }
auto weight_val = writer.getContext()
->lookupSymbolTable(base_op->getName() + ".weight")
->outputs()
.front()
->cast_<ir::tensor::TensorValue>();🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/visitor/Conv2D.cpp around lines 43 - 48, The lookup of
the weight symbol table can return null or an outputs() vector with no elements,
so before calling .front() and cast_<ir::tensor::TensorValue>(), guard the
chain: check that writer.getContext()->lookupSymbolTable(base_op->getName() +
".weight") is non-null and that its outputs() is non-empty; if either check
fails, handle the error (e.g., log via processLogger or throw a descriptive
exception) and avoid dereferencing, otherwise proceed to obtain the front() and
cast_ as currently written.
| if not self.weight_quant.is_frozen: | ||
| self.freeze_weight() |
There was a problem hiding this comment.
Use .item() to check tensor boolean value.
self.weight_quant.is_frozen is registered as a tensor buffer (line 169: self.register_buffer("is_frozen", torch.tensor(False))). Using if not self.weight_quant.is_frozen: on a tensor can produce unexpected behavior or deprecation warnings.
🔧 Suggested fix
- if not self.weight_quant.is_frozen:
+ if not self.weight_quant.is_frozen.item():
self.freeze_weight()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not self.weight_quant.is_frozen: | |
| self.freeze_weight() | |
| if not self.weight_quant.is_frozen.item(): | |
| self.freeze_weight() |
🤖 Prompt for AI Agents
In @pymllm/backends/qualcomm/transformers/core/qlinear.py around lines 265 -
266, The check uses a tensor buffer self.weight_quant.is_frozen and should use
its Python boolean value; replace the direct tensor truthiness check with
self.weight_quant.is_frozen.item() (e.g., change the condition in the block that
calls freeze_weight() to use .item()) so the if statement reliably reads the
boolean and avoids tensor truthiness/deprecation issues.
| model_path, | ||
| attn_implementation="eager", | ||
| ) | ||
| self.model.cuda() |
There was a problem hiding this comment.
Unconditional CUDA placement may break CPU-only environments.
The self.model.cuda() call assumes CUDA is available. This will raise a runtime error on systems without a GPU or CUDA installation.
Consider adding a CUDA availability check or making device placement configurable:
🔧 Suggested fix with CUDA check
- self.model.cuda()
+ if torch.cuda.is_available():
+ self.model.cuda()
+ else:
+ print("Warning: CUDA not available, running on CPU")Alternatively, add a device parameter to __init__:
def __init__(self, model_path: str, mllm_qualcomm_max_length=2048, device="cuda"):
# ...
self.device = device if device != "cuda" or torch.cuda.is_available() else "cpu"
self.model.to(self.device)🤖 Prompt for AI Agents
In @pymllm/backends/qualcomm/transformers/qwen3/runner.py at line 48, The code
unconditionally calls self.model.cuda(), which fails on CPU-only systems; update
the class/__init__ to accept a device parameter (e.g., device) or check
torch.cuda.is_available() and choose "cuda" only when available, then move the
model with self.model.to(self.device) (or equivalent) instead of
self.model.cuda(); locate the placement call (self.model.cuda()) and the
constructor (the class __init__) to add the device logic and use
self.model.to(self.device).
Summary by CodeRabbit
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.