fix: LPBQ return shape fellow qnn spec by chenghuaWang · Pull Request #595 · UbiquitousLearning/mllm

chenghuaWang · 2026-01-09T12:44:40Z

Summary by CodeRabbit

New Features
- Added quantization canonicalization support for improved compiler optimization of quantized models.
Bug Fixes
- Improved quantization scale representation handling for more efficient model compilation.
- Enhanced weight tensor processing in quantization pipeline.
- Added error handling for missing cache states during inference.
Refactor
- Optimized tensor shape handling in quantization operations for better performance.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-09T12:44:56Z

📝 Walkthrough

Walkthrough

This PR introduces shape canonicalization for quantized tensors in QNN AOT compilation. It adds a new LPBQ canonical IR pass for tensor reshaping, modifies Qwen3 model MLP/Attention components with explicit size tracking, adjusts CPU linear operation shape handling, updates quantization bitwidth parameters, and removes unnecessary type conversions in weight handling.

Changes

Cohort / File(s)	Summary
Qwen3 Model Architecture `examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp`	Added hidden_size_ and intermediate_size_ members to Qwen3MLP and Qwen3Attention; initialized from config; adjusted forward pass with view() reshaping for outputs; updated linear layer registrations with cfg.linear_impl_type parameter; added runtime error handling for missing KV caches
QNN AOT Quantization Pass `mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp`, `LPBQCanonicalizePass.cpp`	New pass implementation for LPBQ quantization canonicalization; traverses IR graphs with recursive utilities (visitSubGraph, visitCallGraph); reshapes quantized tensor outputs from \[B, H, S0, S1\] to \[B\*H, S0, S1\] and inserts ViewOp for remapping; propagates changes to downstream consumers
Quantization Configuration `mllm/backends/qnn/aot/QnnWrappersAPI.cpp`	Changed blockScaleBitwidth from 12 to 4 bits for LPBQ per-channel quantization setup
Operator Implementations `mllm/backends/cpu/ops/LinearOp.cpp`, `mllm/backends/qnn/aot/visitor/Linear.cpp`	CPU LinearOp::reshape: removes leading dimension when o_shape\[0\] == 1 for specific LinearImplTypes; QNN Linear visitor: removed UInt8 weight tensor conversion

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat(qualcomm): AOTPipeline update #585: Modifies QNN AOT Linear visitor's weight handling, removing or changing weight type conversions similar to this PR's Linear.cpp changes.
feat(Qnn AOT): AOT and AOT Runtime. Qwen3 AOT Mode. #567: Directly modifies the same Qwen3 AOT model file with hidden/intermediate size tracking and linear registration parameter changes.
fix: Qualcomm QNN AOT Pass #579: Updates examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp with attention/MLP linear registrations and tensor layout adjustments.

Suggested reviewers

oreomaker
liang1232018
yirongjie

Poem

🐰 Shapes are shuffled, tensors dance,
With views remapped in ViewOp's trance,
LPBQ passes through the graph,
Canonicalized paths on AOT's behalf,
Bits reduced, sizes tracked with care! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request has no description provided by the author, despite the repository providing a description template that requires contributions to include clear and complete descriptions.	Add a detailed pull request description explaining the LPBQ shape fix, including what was changed, why it was necessary to align with QNN spec, and any testing performed.
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: LPBQ return shape fellow qnn spec' clearly refers to the primary changes in the PR: adjusting return shapes for LPBQ quantization to align with QNN specifications, which is evident across multiple files (modeling changes, canonicalization pass, and reshape operations).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

UbiquitousLearning

LPBQ

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp:
- Around line 13-15: The file contains a duplicate include of
mllm/engine/Context.hpp; remove the redundant include so the header is only
included once (leave the single remaining #include "mllm/engine/Context.hpp" and
delete the second occurrence) to eliminate the duplicate include directive in
LPBQCanonicalizePass.cpp.

🧹 Nitpick comments (4)

examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (1)
398-421: Good defensive error handling for missing KV caches.

The explicit error handling with descriptive messages is appropriate for AOT compilation where KV caches must be provided.

Consider removing the redundant comments (lines 403-405, 416-418) since the throw statements are self-explanatory:
♻️ Optional cleanup
       if (input.count(past_key_name)) {
         kv_caches.push_back(input.at(past_key_name));
       } else {
-        // If KV cache doesn't exist, we need to handle this case
-        // For now, we'll create empty tensors or handle it appropriately
-        // This might need adjustment based on your initialization logic
         throw std::runtime_error("Missing KV cache for layer " + std::to_string(i));
       }
mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp (1)
11-18: Consider explicitly deleting copy/move operations for clarity.

The static analysis flags that the class defines a destructor but not other special member functions. If the base class ir::Pass already handles this (common for polymorphic pass types), this is benign. However, being explicit improves clarity and suppresses warnings.
♻️ Suggested explicit deletion
 class LPBQCanonicalizePass final : public ir::Pass {
  public:
   LPBQCanonicalizePass() = default;
+  LPBQCanonicalizePass(const LPBQCanonicalizePass&) = delete;
+  LPBQCanonicalizePass& operator=(const LPBQCanonicalizePass&) = delete;
+  LPBQCanonicalizePass(LPBQCanonicalizePass&&) = delete;
+  LPBQCanonicalizePass& operator=(LPBQCanonicalizePass&&) = delete;
 
   ~LPBQCanonicalizePass() override = default;
 
   uint8_t run(const ir::node_ptr_t& op) override;
 };
mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp (1)
70-70: Use a descriptive variable name.

cc_help_me is non-descriptive. Consider renaming to consumer_op or similar for clarity.
-            for (auto cc_help_me : consumer_ops) {
-              MLLM_RT_ASSERT(cc_help_me->isa_<ir::Op>());
+            for (auto consumer_op : consumer_ops) {
+              MLLM_RT_ASSERT(consumer_op->isa_<ir::Op>());
mllm/backends/cpu/ops/LinearOp.cpp (1)
361-366: Add explanatory comment for shape canonicalization.

The conditional removal of the leading dimension when o_shape[0] == 1 for LPBQ types implements shape canonicalization per the QNN LPBQ specification and is safe—the input is guaranteed to have at least 2 dimensions from the upstream assertion at line 105. A comment explaining the spec requirement would improve code clarity:
 case aops::LinearImplTypes::kQNN_LPBQ_w4a16o16_G32:
 case aops::LinearImplTypes::kQNN_LPBQ_w4a16o16_G64: {
+  // Canonicalize shape to match QNN LPBQ spec: squeeze leading dimension when batch=1
   if (o_shape[0] == 1) { o_shape.erase(o_shape.begin()); }
   o_dtype = kUInt16PerTensorAsy;
   break;

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68c7c6c and 42744dc.

📒 Files selected for processing (6)

examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp
mllm/backends/cpu/ops/LinearOp.cpp
mllm/backends/qnn/aot/QnnWrappersAPI.cpp
mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp
mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp
mllm/backends/qnn/aot/visitor/Linear.cpp

💤 Files with no reviewable changes (1)

mllm/backends/qnn/aot/visitor/Linear.cpp

🧰 Additional context used

📓 Path-based instructions (4)

{mllm,mllm-cli,pymllm}/**/*

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).

Files:

mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp
mllm/backends/cpu/ops/LinearOp.cpp
mllm/backends/qnn/aot/QnnWrappersAPI.cpp
mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Files:

mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp
mllm/backends/cpu/ops/LinearOp.cpp
mllm/backends/qnn/aot/QnnWrappersAPI.cpp
mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).

Files:

mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp
mllm/backends/cpu/ops/LinearOp.cpp
mllm/backends/qnn/aot/QnnWrappersAPI.cpp
mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.

Files:

mllm/backends/cpu/ops/LinearOp.cpp
mllm/backends/qnn/aot/QnnWrappersAPI.cpp
mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp

🧬 Code graph analysis (2)

examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (4)

mllm/core/aops/ViewOp.hpp (3)

inputs (24-24)

inputs (26-26)

inputs (28-28)

mllm/nn/Module.hpp (14)

inputs (174-174)

inputs (178-178)

inputs (182-182)

inputs (184-184)

inputs (186-186)

inputs (216-220)

inputs (216-216)

inputs (240-244)

inputs (240-240)

inputs (263-267)

args (148-170)

args (148-148)

reg (83-115)

reg (83-83)

mllm/nn/Functional.hpp (1)

sigmoid (163-163)

mllm/nn/Functional.cpp (2)

sigmoid (210-213)

sigmoid (210-210)

mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp (1)

mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp (1)

op (17-17)

🪛 Clang (14.0.6)

mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.hpp

[error] 11-11: class 'LPBQCanonicalizePass' defines a default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator

(cppcoreguidelines-special-member-functions,-warnings-as-errors)

examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp

[error] 134-134: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

[error] 135-135: variable name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

[error] 144-144: variable name 'o' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp

[error] 15-15: duplicate include

(readability-duplicate-include,-warnings-as-errors)

[error] 23-23: 2 adjacent parameters of 'visitCallGraph' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

[error] 31-31: 2 adjacent parameters of 'visitSubGraph' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build-android
GitHub Check: build-x86
GitHub Check: build-macos

🔇 Additional comments (7)

examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (4)

120-131: LGTM! Member variables added for shape canonicalization.

The addition of hidden_size_ and intermediate_size_ to cache configuration values for use in the view() operations is appropriate. Initialization from cfg after linear layer registration is correct since these values are only used in forward().

137-145: Shape canonicalization for QNN spec compliance looks correct.

The view() operations ensure tensors maintain the expected 3D shape {batch, seq, feature} after linear projections. The hardcoded batch size of 1 is typical for QNN AOT compilation.

Please verify that the batch size constraint of 1 is documented or enforced at the model entry point, to avoid runtime shape mismatches if this model is invoked with different batch sizes.

180-183: Linear layer registrations correctly updated with implementation type.

The addition of cfg.linear_impl_type as a parameter to all Linear layer registrations is consistent with the pattern used in Qwen3MLP (lines 126, 128, 129) and the lm_head_ registration (line 384).

266-269: LGTM! Attention output shape canonicalization.

The view({1, -1, hidden_size_}) after o_proj_ ensures the attention output maintains the expected 3D shape, consistent with the MLP pattern. The hidden_size_ value correctly matches the o_proj_ output dimension.

mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp (2)

91-112: LGTM for the pass entry point logic.

The run method correctly validates the ModuleOp input, finds the main CallGraphOp entry point, and delegates to visitCallGraph. The assertion at line 102 enforces a single-entry-point invariant which is appropriate for this pass structure.

23-29: Traversal logic looks correct.

The visitCallGraph function properly resolves the subgraph symbol and delegates to visitSubGraph. The null check assertion provides good safety.

mllm/backends/qnn/aot/QnnWrappersAPI.cpp (1)

174-182: LPBQ blockwise quantization configuration is correct for 4-bit weight quantization.

The blockScaleBitwidth = 4 setting is consistent with the 4-bit weight quantization scheme used throughout the codebase (e.g., kQNN_LPBQ_w4a16o16_G32/G64 types). The combination with QNN_BLOCKWISE_EXPANSION_BITWIDTH_SCALE_STORAGE_8 and blocksScale8 reflects the standard pattern of packing two 4-bit scale values into each 8-bit storage byte, which is explicitly documented in PTQPass.cpp ("using uint8 to store uint4").

coderabbitai · 2026-01-09T12:49:44Z

+#include "mllm/engine/Context.hpp"
+#include "mllm/utils/Common.hpp"
+#include "mllm/engine/Context.hpp"


⚠️ Potential issue | 🟡 Minor

Remove duplicate include.

mllm/engine/Context.hpp is included twice (lines 13 and 15).

🔧 Proposed fix

#include "mllm/core/aops/ViewOp.hpp" #include "mllm/engine/Context.hpp" #include "mllm/utils/Common.hpp" -#include "mllm/engine/Context.hpp"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#include "mllm/engine/Context.hpp"

#include "mllm/utils/Common.hpp"

#include "mllm/engine/Context.hpp"

#include "mllm/engine/Context.hpp"

#include "mllm/utils/Common.hpp"

🧰 Tools

🪛 Clang (14.0.6)

[error] 15-15: duplicate include

(readability-duplicate-include,-warnings-as-errors)

🤖 Prompt for AI Agents

In @mllm/backends/qnn/aot/passes/LPBQCanonicalizePass.cpp around lines 13 - 15, The file contains a duplicate include of mllm/engine/Context.hpp; remove the redundant include so the header is only included once (leave the single remaining #include "mllm/engine/Context.hpp" and delete the second occurrence) to eliminate the duplicate include directive in LPBQCanonicalizePass.cpp.

coderabbitai · 2026-01-09T12:49:44Z

+            // Find all operators that eats original o, and set them with new_o
+            auto consumer_ops = o->consumerOps();
+            for (auto cc_help_me : consumer_ops) {
+              MLLM_RT_ASSERT(cc_help_me->isa_<ir::Op>());
+              auto& inputs = cc_help_me->inputs();
+              auto& outputs = cc_help_me->outputs();
+            }
+          }


⚠️ Potential issue | 🔴 Critical

Incomplete consumer rewiring logic - pass will not function correctly.

The loop iterates over consumer_ops and obtains their inputs() and outputs(), but never actually replaces references to o with new_o. This means the ViewOp is created but downstream operations will still consume the reshaped tensor o instead of the restored-shape new_o, breaking the intended canonicalization.

🐛 Proposed fix to complete the rewiring

// Find all operators that eats original o, and set them with new_o auto consumer_ops = o->consumerOps(); for (auto cc_help_me : consumer_ops) { + // Skip the newly created ViewOp + if (cc_help_me == view_op) continue; + MLLM_RT_ASSERT(cc_help_me->isa_<ir::Op>()); auto& inputs = cc_help_me->inputs(); - auto& outputs = cc_help_me->outputs(); + for (size_t i = 0; i < inputs.size(); ++i) { + if (inputs[i] == o) { + cc_help_me->replaceInput(i, new_o); + } + } }

Note: The exact API for replacing inputs depends on the ir::Op interface. Please verify replaceInput or equivalent method exists and adjust accordingly.

Committable suggestion skipped: line range outside the PR's diff.

chenghuaWang added 3 commits January 9, 2026 11:23

fix: Linear LPBQ return [X, S, D] format

9e92cfe

fix: LPBQ return shape is [S, D]

653fd59

fix: typo

42744dc

chenghuaWang requested review from liang1232018, oreomaker and yirongjie as code owners January 9, 2026 12:44

UbiquitousLearning approved these changes Jan 9, 2026

View reviewed changes

coderabbitai Bot reviewed Jan 9, 2026

View reviewed changes

chenghuaWang merged commit 9400984 into UbiquitousLearning:main Jan 9, 2026
4 checks passed

coderabbitai Bot mentioned this pull request Jan 12, 2026

fix(qualcomm): LM Head Merge pass #601

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: LPBQ return shape fellow qnn spec#595

fix: LPBQ return shape fellow qnn spec#595
chenghuaWang merged 3 commits intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main

chenghuaWang commented Jan 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 9, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

UbiquitousLearning left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jan 9, 2026

Uh oh!

coderabbitai Bot Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chenghuaWang commented Jan 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

UbiquitousLearning left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenghuaWang commented Jan 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 9, 2026 •

edited

Loading