Skip to content

fix: LPBQ return shape fellow qnn spec#595

Merged
chenghuaWang merged 3 commits intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main
Jan 9, 2026
Merged

fix: LPBQ return shape fellow qnn spec#595
chenghuaWang merged 3 commits intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main

Conversation

@chenghuaWang
Copy link
Copy Markdown
Collaborator

@chenghuaWang chenghuaWang commented Jan 9, 2026

Summary by CodeRabbit

  • New Features

    • Added quantization canonicalization support for improved compiler optimization of quantized models.
  • Bug Fixes

    • Improved quantization scale representation handling for more efficient model compilation.
    • Enhanced weight tensor processing in quantization pipeline.
    • Added error handling for missing cache states during inference.
  • Refactor

    • Optimized tensor shape handling in quantization operations for better performance.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 9, 2026

📝 Walkthrough

Walkthrough

This PR introduces shape canonicalization for quantized tensors in QNN AOT compilation. It adds a new LPBQ canonical IR pass for tensor reshaping, modifies Qwen3 model MLP/Attention components with explicit size tracking, adjusts CPU linear operation shape handling, updates quantization bitwidth parameters, and removes unnecessary type conversions in weight handling.

Changes

Cohort / File(s) Summary
Qwen3 Model Architecture
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp
Added hidden_size_ and intermediate_size_ members to Qwen3MLP and Qwen3Attention; initialized from config; adjusted forward pass with view() reshaping for outputs; updated linear layer registrations with cfg.linear_impl_type parameter; added runtime error handling for missing KV caches
QNN AOT Quantization Pass
mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp, LPBQCanonicalizePass.cpp
New pass implementation for LPBQ quantization canonicalization; traverses IR graphs with recursive utilities (visitSubGraph, visitCallGraph); reshapes quantized tensor outputs from \[B, H, S0, S1\] to \[B\*H, S0, S1\] and inserts ViewOp for remapping; propagates changes to downstream consumers
Quantization Configuration
mllm/backends/qnn/aot/QnnWrappersAPI.cpp
Changed blockScaleBitwidth from 12 to 4 bits for LPBQ per-channel quantization setup
Operator Implementations
mllm/backends/cpu/ops/LinearOp.cpp, mllm/backends/qnn/aot/visitor/Linear.cpp
CPU LinearOp::reshape: removes leading dimension when o_shape\[0\] == 1 for specific LinearImplTypes; QNN Linear visitor: removed UInt8 weight tensor conversion

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • oreomaker
  • liang1232018
  • yirongjie

Poem

🐰 Shapes are shuffled, tensors dance,
With views remapped in ViewOp's trance,
LPBQ passes through the graph,
Canonicalized paths on AOT's behalf,
Bits reduced, sizes tracked with care! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request has no description provided by the author, despite the repository providing a description template that requires contributions to include clear and complete descriptions. Add a detailed pull request description explaining the LPBQ shape fix, including what was changed, why it was necessary to align with QNN spec, and any testing performed.
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: LPBQ return shape fellow qnn spec' clearly refers to the primary changes in the PR: adjusting return shapes for LPBQ quantization to align with QNN specifications, which is evident across multiple files (modeling changes, canonicalization pass, and reshape operations).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Owner

@UbiquitousLearning UbiquitousLearning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LPBQ

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp:
- Around line 13-15: The file contains a duplicate include of
mllm/engine/Context.hpp; remove the redundant include so the header is only
included once (leave the single remaining #include "mllm/engine/Context.hpp" and
delete the second occurrence) to eliminate the duplicate include directive in
LPBQCanonicalizePass.cpp.
🧹 Nitpick comments (4)
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (1)

398-421: Good defensive error handling for missing KV caches.

The explicit error handling with descriptive messages is appropriate for AOT compilation where KV caches must be provided.

Consider removing the redundant comments (lines 403-405, 416-418) since the throw statements are self-explanatory:

♻️ Optional cleanup
       if (input.count(past_key_name)) {
         kv_caches.push_back(input.at(past_key_name));
       } else {
-        // If KV cache doesn't exist, we need to handle this case
-        // For now, we'll create empty tensors or handle it appropriately
-        // This might need adjustment based on your initialization logic
         throw std::runtime_error("Missing KV cache for layer " + std::to_string(i));
       }
mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp (1)

11-18: Consider explicitly deleting copy/move operations for clarity.

The static analysis flags that the class defines a destructor but not other special member functions. If the base class ir::Pass already handles this (common for polymorphic pass types), this is benign. However, being explicit improves clarity and suppresses warnings.

♻️ Suggested explicit deletion
 class LPBQCanonicalizePass final : public ir::Pass {
  public:
   LPBQCanonicalizePass() = default;
+  LPBQCanonicalizePass(const LPBQCanonicalizePass&) = delete;
+  LPBQCanonicalizePass& operator=(const LPBQCanonicalizePass&) = delete;
+  LPBQCanonicalizePass(LPBQCanonicalizePass&&) = delete;
+  LPBQCanonicalizePass& operator=(LPBQCanonicalizePass&&) = delete;
 
   ~LPBQCanonicalizePass() override = default;
 
   uint8_t run(const ir::node_ptr_t& op) override;
 };
mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp (1)

70-70: Use a descriptive variable name.

cc_help_me is non-descriptive. Consider renaming to consumer_op or similar for clarity.

-            for (auto cc_help_me : consumer_ops) {
-              MLLM_RT_ASSERT(cc_help_me->isa_<ir::Op>());
+            for (auto consumer_op : consumer_ops) {
+              MLLM_RT_ASSERT(consumer_op->isa_<ir::Op>());
mllm/backends/cpu/ops/LinearOp.cpp (1)

361-366: Add explanatory comment for shape canonicalization.

The conditional removal of the leading dimension when o_shape[0] == 1 for LPBQ types implements shape canonicalization per the QNN LPBQ specification and is safe—the input is guaranteed to have at least 2 dimensions from the upstream assertion at line 105. A comment explaining the spec requirement would improve code clarity:

 case aops::LinearImplTypes::kQNN_LPBQ_w4a16o16_G32:
 case aops::LinearImplTypes::kQNN_LPBQ_w4a16o16_G64: {
+  // Canonicalize shape to match QNN LPBQ spec: squeeze leading dimension when batch=1
   if (o_shape[0] == 1) { o_shape.erase(o_shape.begin()); }
   o_dtype = kUInt16PerTensorAsy;
   break;
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68c7c6c and 42744dc.

📒 Files selected for processing (6)
  • examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp
  • mllm/backends/cpu/ops/LinearOp.cpp
  • mllm/backends/qnn/aot/QnnWrappersAPI.cpp
  • mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp
  • mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp
  • mllm/backends/qnn/aot/visitor/Linear.cpp
💤 Files with no reviewable changes (1)
  • mllm/backends/qnn/aot/visitor/Linear.cpp
🧰 Additional context used
📓 Path-based instructions (4)
{mllm,mllm-cli,pymllm}/**/*

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).

Files:

  • mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp
  • mllm/backends/cpu/ops/LinearOp.cpp
  • mllm/backends/qnn/aot/QnnWrappersAPI.cpp
  • mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Files:

  • mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp
  • mllm/backends/cpu/ops/LinearOp.cpp
  • mllm/backends/qnn/aot/QnnWrappersAPI.cpp
  • mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).

Files:

  • mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp
  • mllm/backends/cpu/ops/LinearOp.cpp
  • mllm/backends/qnn/aot/QnnWrappersAPI.cpp
  • mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.

Files:

  • mllm/backends/cpu/ops/LinearOp.cpp
  • mllm/backends/qnn/aot/QnnWrappersAPI.cpp
  • mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp
🧬 Code graph analysis (2)
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (4)
mllm/core/aops/ViewOp.hpp (3)
  • inputs (24-24)
  • inputs (26-26)
  • inputs (28-28)
mllm/nn/Module.hpp (14)
  • inputs (174-174)
  • inputs (178-178)
  • inputs (182-182)
  • inputs (184-184)
  • inputs (186-186)
  • inputs (216-220)
  • inputs (216-216)
  • inputs (240-244)
  • inputs (240-240)
  • inputs (263-267)
  • args (148-170)
  • args (148-148)
  • reg (83-115)
  • reg (83-83)
mllm/nn/Functional.hpp (1)
  • sigmoid (163-163)
mllm/nn/Functional.cpp (2)
  • sigmoid (210-213)
  • sigmoid (210-210)
mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp (1)
mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp (1)
  • op (17-17)
🪛 Clang (14.0.6)
mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp

[error] 11-11: class 'LPBQCanonicalizePass' defines a default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator

(cppcoreguidelines-special-member-functions,-warnings-as-errors)

examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp

[error] 134-134: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 135-135: variable name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 144-144: variable name 'o' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp

[error] 15-15: duplicate include

(readability-duplicate-include,-warnings-as-errors)


[error] 23-23: 2 adjacent parameters of 'visitCallGraph' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 31-31: 2 adjacent parameters of 'visitSubGraph' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-android
  • GitHub Check: build-x86
  • GitHub Check: build-macos
🔇 Additional comments (7)
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (4)

120-131: LGTM! Member variables added for shape canonicalization.

The addition of hidden_size_ and intermediate_size_ to cache configuration values for use in the view() operations is appropriate. Initialization from cfg after linear layer registration is correct since these values are only used in forward().


137-145: Shape canonicalization for QNN spec compliance looks correct.

The view() operations ensure tensors maintain the expected 3D shape {batch, seq, feature} after linear projections. The hardcoded batch size of 1 is typical for QNN AOT compilation.

Please verify that the batch size constraint of 1 is documented or enforced at the model entry point, to avoid runtime shape mismatches if this model is invoked with different batch sizes.


180-183: Linear layer registrations correctly updated with implementation type.

The addition of cfg.linear_impl_type as a parameter to all Linear layer registrations is consistent with the pattern used in Qwen3MLP (lines 126, 128, 129) and the lm_head_ registration (line 384).


266-269: LGTM! Attention output shape canonicalization.

The view({1, -1, hidden_size_}) after o_proj_ ensures the attention output maintains the expected 3D shape, consistent with the MLP pattern. The hidden_size_ value correctly matches the o_proj_ output dimension.

mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp (2)

91-112: LGTM for the pass entry point logic.

The run method correctly validates the ModuleOp input, finds the main CallGraphOp entry point, and delegates to visitCallGraph. The assertion at line 102 enforces a single-entry-point invariant which is appropriate for this pass structure.


23-29: Traversal logic looks correct.

The visitCallGraph function properly resolves the subgraph symbol and delegates to visitSubGraph. The null check assertion provides good safety.

mllm/backends/qnn/aot/QnnWrappersAPI.cpp (1)

174-182: LPBQ blockwise quantization configuration is correct for 4-bit weight quantization.

The blockScaleBitwidth = 4 setting is consistent with the 4-bit weight quantization scheme used throughout the codebase (e.g., kQNN_LPBQ_w4a16o16_G32/G64 types). The combination with QNN_BLOCKWISE_EXPANSION_BITWIDTH_SCALE_STORAGE_8 and blocksScale8 reflects the standard pattern of packing two 4-bit scale values into each 8-bit storage byte, which is explicitly documented in PTQPass.cpp ("using uint8 to store uint4").

Comment on lines +13 to +15
#include "mllm/engine/Context.hpp"
#include "mllm/utils/Common.hpp"
#include "mllm/engine/Context.hpp"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove duplicate include.

mllm/engine/Context.hpp is included twice (lines 13 and 15).

🔧 Proposed fix
 #include "mllm/core/aops/ViewOp.hpp"
 #include "mllm/engine/Context.hpp"
 #include "mllm/utils/Common.hpp"
-#include "mllm/engine/Context.hpp"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include "mllm/engine/Context.hpp"
#include "mllm/utils/Common.hpp"
#include "mllm/engine/Context.hpp"
#include "mllm/engine/Context.hpp"
#include "mllm/utils/Common.hpp"
🧰 Tools
🪛 Clang (14.0.6)

[error] 15-15: duplicate include

(readability-duplicate-include,-warnings-as-errors)

🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp around lines 13 - 15,
The file contains a duplicate include of mllm/engine/Context.hpp; remove the
redundant include so the header is only included once (leave the single
remaining #include "mllm/engine/Context.hpp" and delete the second occurrence)
to eliminate the duplicate include directive in LPBQCanonicalizePass.cpp.

Comment on lines +68 to +75
// Find all operators that eats original o, and set them with new_o
auto consumer_ops = o->consumerOps();
for (auto cc_help_me : consumer_ops) {
MLLM_RT_ASSERT(cc_help_me->isa_<ir::Op>());
auto& inputs = cc_help_me->inputs();
auto& outputs = cc_help_me->outputs();
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Incomplete consumer rewiring logic - pass will not function correctly.

The loop iterates over consumer_ops and obtains their inputs() and outputs(), but never actually replaces references to o with new_o. This means the ViewOp is created but downstream operations will still consume the reshaped tensor o instead of the restored-shape new_o, breaking the intended canonicalization.

🐛 Proposed fix to complete the rewiring
             // Find all operators that eats original o, and set them with new_o
             auto consumer_ops = o->consumerOps();
             for (auto cc_help_me : consumer_ops) {
+              // Skip the newly created ViewOp
+              if (cc_help_me == view_op) continue;
+              
               MLLM_RT_ASSERT(cc_help_me->isa_<ir::Op>());
               auto& inputs = cc_help_me->inputs();
-              auto& outputs = cc_help_me->outputs();
+              for (size_t i = 0; i < inputs.size(); ++i) {
+                if (inputs[i] == o) {
+                  cc_help_me->replaceInput(i, new_o);
+                }
+              }
             }

Note: The exact API for replacing inputs depends on the ir::Op interface. Please verify replaceInput or equivalent method exists and adjust accordingly.

Committable suggestion skipped: line range outside the PR's diff.

@chenghuaWang chenghuaWang merged commit 9400984 into UbiquitousLearning:main Jan 9, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants