feat(qnn): Enhance QNNBackend initialization with improved logging and error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs. by chenghuaWang · Pull Request #609 · UbiquitousLearning/mllm

chenghuaWang · 2026-01-23T02:26:08Z

Summary by CodeRabbit

New Features
- Android ARM64 QNN build pipeline
- Quantized embedding module with deploy/freeze utilities
Bug Fixes & Improvements
- Compiler/version-aware build options for broader compatibility
- Stronger validation and reuse checks for quantization and runtime contexts
- Increased default backend log verbosity and richer runtime diagnostics
- Unified embedding weight/output quantization and output dtype alignment
Chores
- Updated build configurations and .gitignore entries

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…d error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.

…es; ensure position-independent code for flatbuffers. Enhance context creation with existing context checks and improve weight quantization specifications.

coderabbitai · 2026-01-23T02:26:44Z

📝 Walkthrough

Walkthrough

Enhanced QNN backend logging and context caching, compiler-aware build guards and PIC for flatbuffers, embedding quantization unified to uint16 per-tensor asym, new PyTorch QEmbedding with quant/deploy lifecycle, Android ARM64 QNN build pipeline added, and minor formatting tweaks.

Changes

Cohort / File(s)	Summary
Build & Tooling `\.gitignore`, `CMakeLists.txt`, `mllm/CMakeLists.txt`, `tasks/build_sdk_*.yaml`	Added mllm-install-*/ ignore; enabled POSITION_INDEPENDENT_CODE for flatbuffers; made -Wno-comma-subscript compiler-aware (Clang unconditional, GCC >=10 guarded); added Android arm64 QNN build pipeline and changed x86 QNN build type to ReleaseWithDebInfo.
QNN Backend Core `mllm/backends/qnn/QNNBackend.cpp`, `mllm/backends/qnn/QNNBackend.hpp`, `mllm/backends/qnn/Register.cpp`	Increased default QNN log level to VERBOSE; added extensive success/error logging across symbol loading, runtime/device creation, provider discovery, profiling hooks, and context lifecycle; added memory manager registration log.
QNN AOT / Quantization Pipeline `mllm/backends/qnn/aot/QnnWrappersAPI.cpp`, `mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp`, `mllm/backends/qnn/aot/passes/PTQPass.cpp`	Added runtime guards/validation for quant params (scale/zero_point); pre-create context deduplication; unified embedding weights to uint16 per-tensor asym quantization and propagated weight spec to outputs; added AsymPerTensor handling and logging in PTQ flows.
Embedding Layers & Ops `mllm/core/aops/EmbeddingOp.cpp`, `pymllm/backends/qualcomm/transformers/core/embedding.py`, `pymllm/backends/qualcomm/transformers/qwen3/*`	C++ Embedding output dtype now derives from weight dtype (uint16 → kUInt16PerTensorAsy); introduced new PyTorch `QEmbedding` (quant lifecycle: freeze, convert_to_deploy, disable); wired QEmbedding into Qwen3 model and runner weight-conversion.
Minor / Formatting `mllm/backends/qnn/aot_rt/PromptProcessor.cpp`	Removed an extra leading blank line (formatting only).

Sequence Diagram(s)

sequenceDiagram
    participant Python as Python (pymllm / QEmbedding)
    participant Runner as Runner / PTQ
    participant AOT as QNN AOT (QnnWrappersAPI)
    participant FS as File System
    participant QNN as QNN Runtime (backend/device/system)

    Python->>Runner: instantiate model with QEmbedding
    Runner->>Runner: freeze / convert_to_deploy (QEmbedding.freeze -> observer -> scale/zp)
    Runner->>AOT: request context load/create (context_path?)
    AOT->>FS: check context file exists
    alt context exists
        FS-->>AOT: file present
        AOT->>QNN: load context from binary
        QNN-->>AOT: success / runtime created
    else no context
        AOT->>QNN: create context from graph/binary
        QNN-->>AOT: success / runtime created
    end
    QNN->>QNN: probe providers / create device
    QNN-->>Runner: runtime + device ready
    Runner-->>Python: model ready for inference

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

fix: Qualcomm QNN AOT Pass #579: Overlapping changes to QNN AOT quantization pipeline and passes (LLMQuantRecipePass, PTQPass).
feat(qualcomm): PTQPass add constant ptq impl. #593: Directly related AsymPerTensor/SymPerTensor PTQ handling updates in PTQPass.cpp.
feat(Qnn AOT): AOT and AOT Runtime. Qwen3 AOT Mode. #567: Modifications to QNN AOT components (QnnWrappersAPI, register paths) that overlap with backend/context handling here.

Suggested reviewers

liang1232018
oreomaker
yirongjie

Poem

🐰
I hopped through logs and quantized beds of green,
Turned weights to sixteen bits where they’d once been seen,
I cached the context, fixed build flags with cheer,
Now QNN hums loud and embeddings draw near —
Tiny rabbit, big changes, one carrot to cheer.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	No pull request description was provided; the template requires a meaningful description but the author left it completely empty.	Add a detailed description following the repository template, explaining the motivation, implementation details, and testing approach for the changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title comprehensively summarizes the three main changes: QNNBackend logging/error handling enhancements, QEmbedding class addition, and new build tasks.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (1)

989-1022: Add explicit dtype update for embedding output to align with weight quantization spec.

The output tensor o_0 receives the weight's quant_recipe attribute (which specifies kUInt16PerTensorAsy), but its actual tensor dtype is never explicitly updated. This creates a mismatch: the quantization recipe specifies kUInt16PerTensorAsy, but o_0->tensor_.dtype() may not reflect this. Add o_0->cast_<ir::tensor::TensorValue>()->tensor_ = o_0->cast_<ir::tensor::TensorValue>()->tensor_.__unsafeSetDType(kUInt16PerTensorAsy); before setting the quant_recipe attribute to ensure consistency.

🤖 Fix all issues with AI agents

In `@mllm/backends/qnn/aot/passes/PTQPass.cpp`:
- Around line 114-128: The per-tensor branch must enforce that scale and
zero_point are single-element scalars and that zero_point is within [quant_min,
quant_max]; update the QuantizationSpecAsymPerTensor handling (use symbols
this_spec, scale, zero_point, this_spec->quant_min/quant_max,
weight_spec->solved) to: validate scale and zero_point have exactly one element
(e.g., rank==0 or rank==1 with size==1) instead of allowing arbitrary length,
extract the scalar values (scale_val = scale.item<float>(), zp_val =
zero_point.item<int32_t>()), assert scale_val > 0 and dtype checks, clamp zp_val
to the range [this_spec->quant_min, this_spec->quant_max] before assigning
this_spec->zero_point, and then set this_spec->scale and mark
weight_spec->solved = true; keep existing checkTypeLimits usage for the weight
tensor.

In `@mllm/core/aops/EmbeddingOp.cpp`:
- Around line 73-76: In reshape(), add a guard to ensure weight_ is initialized
before calling weight_.dtype(): check if (!weight_) then log
MLLM_ERROR("EmbeddingOp::reshape: weight not loaded") and return; after that
safely compute out_dtype (preserving the kUInt16 → kUInt16PerTensorAsy mapping)
and proceed to create the output tensor (outputs.emplace_back(...)). This
mirrors the existing guard used in trace() and prevents dereferencing an
uninitialized weight_ when reshape() is invoked before load().

In `@pymllm/backends/qualcomm/transformers/core/embedding.py`:
- Around line 99-106: Replace the print() call in convert_to_deploy() with a
logging call using the module-level logger (use logging.getLogger(__name__)) and
an appropriate level (info/debug) to record class name, instance name, weight
dtype and zero_point; additionally, modify disable_quant() to guard against
missing attributes by checking for hasattr(self, "weight_fake_quant") before
accessing or deleting it (and handle the case where convert_to_deploy() already
removed it), ensuring calls to disable_quant() after deployment conversion do
not raise AttributeError.

🧹 Nitpick comments (5)

pymllm/backends/qualcomm/transformers/core/embedding.py (3)
6-38: Add input validation and class docstring.

Per coding guidelines, public APIs should validate inputs and have clear docstrings. Consider adding:

A class-level docstring explaining purpose, parameters, and usage.

Validation for num_embeddings and embedding_dim (must be positive integers).
Suggested improvements
 class QEmbedding(nn.Module):
+    """
+    Quantized embedding layer with per-tensor affine quantization.
+
+    Args:
+        num_embeddings: Size of the embedding dictionary.
+        embedding_dim: Size of each embedding vector.
+        padding_idx: If specified, entries at this index are zeroed.
+        quant_bits: Number of bits for quantization (default: 16).
+    """
     def __init__(
         self,
         num_embeddings,
         embedding_dim,
         padding_idx=None,
         quant_bits=16,
     ):
         super().__init__()
+        if num_embeddings <= 0:
+            raise ValueError(f"num_embeddings must be positive, got {num_embeddings}")
+        if embedding_dim <= 0:
+            raise ValueError(f"embedding_dim must be positive, got {embedding_dim}")
+        if not (1 <= quant_bits <= 32):
+            raise ValueError(f"quant_bits must be in [1, 32], got {quant_bits}")
         self.num_embeddings = num_embeddings
119-123: Replace print() with proper logging.

Consistent with the previous comment, use logging instead of print() for calibration status messages.
Suggested fix
-        class_name = self.__class__.__name__
-        instance_class_name = type(self).__name__
-        print(
-            f"Class: {class_name}, Instance: {instance_class_name}, Weight Quantized: scale={self.weight_fake_quant.scale}, zp={self.weight_fake_quant.zero_point}"
-        )
+        logger.info(
+            f"{self.__class__.__name__}: Weight quantized, "
+            f"scale={self.weight_fake_quant.scale}, zp={self.weight_fake_quant.zero_point}"
+        )
129-133: Consider including quant_bits in the representation.

For debugging and inspection, including the quantization bit-width would be helpful.
Suggested enhancement
     def extra_repr(self):
         s = f"{self.num_embeddings}, {self.embedding_dim}"
         if self.padding_idx is not None:
             s += f", padding_idx={self.padding_idx}"
+        s += f", quant_bits={self.quant_bits}"
         return s
mllm/backends/qnn/aot/QnnWrappersAPI.cpp (1)
439-443: Context caching logic looks correct; consider removing redundant assertion.

The early return for existing contexts is a good addition for preventing duplicate context creation. However, the assertion at line 496 (MLLM_RT_ASSERT_EQ(contexts_.count(name), 0)) is now redundant—it will never fail because we return early if the context exists. Consider removing it to avoid confusion.
♻️ Suggested cleanup
   MLLM_RT_ASSERT_EQ(QNN_CONTEXT_NO_ERROR, status);
   }

   // 5. Register MLLM's Qnn Opset
   // clang-format off
   {
     // FIXME(wch): we need to register our own opset of qnn.
   }
   // clang-format on

-  MLLM_RT_ASSERT_EQ(contexts_.count(name), 0);
   contexts_[name] = context;
   return context;
mllm/backends/qnn/QNNBackend.cpp (1)

287-287: Remove dead code or clarify intention for OpPackage registration.

The empty opPackages vector at line 287 creates unreachable code in the registration loop (lines 290-297). While custom QNN operations (DequantizeAdd, SiLU, RMSNorm, etc.) function correctly via the plugin system's loadOpPackage() mechanism, this registration block is dead code. Either remove it if QNN backend does not support direct OpPackage registration, or populate the vector if this feature should be enabled.

coderabbitai · 2026-01-23T02:40:25Z

+  // Output dtype should match weight dtype (e.g., uint16 for AsymPerTensor quantization)
+  auto out_dtype = weight_.dtype();
+  if (weight_.dtype() == kUInt16) { out_dtype = kUInt16PerTensorAsy; }
+  outputs.emplace_back(Tensor::empty(o_shape, out_dtype, i.device()));


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, locate the EmbeddingOp class definition and examine the full implementation find . -type f -name "*.cpp" -o -name "*.h" | xargs grep -l "EmbeddingOp" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 641

🏁 Script executed:

# Look at the EmbeddingOp class definition and method implementations find . -type f -name "EmbeddingOp.*"

Repository: UbiquitousLearning/mllm

Length of output: 298

🏁 Script executed:

# Examine the reshape() and load() methods in EmbeddingOp cat -n mllm/core/aops/EmbeddingOp.cpp | head -100

Repository: UbiquitousLearning/mllm

Length of output: 3814

🏁 Script executed:

# Check the EmbeddingOp header file cat -n mllm/core/aops/EmbeddingOp.hpp

Repository: UbiquitousLearning/mllm

Length of output: 1451

🏁 Script executed:

# Look for BaseOp and its lifecycle methods find . -type f -name "BaseOp.hpp" -o -name "BaseOp.cpp" | head -5

Repository: UbiquitousLearning/mllm

Length of output: 112

🏁 Script executed:

# Check how reshape() is called in the codebase - look at the base class or graph executor rg -n "\.reshape$" --type cpp -B2 -A2 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 6341

🏁 Script executed:

# Check if Tensor has a default constructor and what dtype() returns for uninitialized tensor rg -n "class Tensor" --type hpp -A20 | head -80

Repository: UbiquitousLearning/mllm

Length of output: 94

🏁 Script executed:

# Search for Tensor class definition in .h or .hpp files find . -type f \( -name "*.hpp" -o -name "*.h" $ -exec grep -l "^class Tensor" {} \; | head -5

Repository: UbiquitousLearning/mllm

Length of output: 330

🏁 Script executed:

# Look at Tensor.hpp to understand default construction and dtype() cat -n mllm/core/Tensor.hpp | head -150

Repository: UbiquitousLearning/mllm

Length of output: 5269

🏁 Script executed:

# Check BaseOp to understand when load() vs reshape() are called cat -n mllm/core/BaseOp.hpp | head -80

Repository: UbiquitousLearning/mllm

Length of output: 3073

🏁 Script executed:

# Search for lifecycle/execution patterns - how ops are used rg -n "load$\|reshape\(" --type cpp mllm/core/BaseOp.cpp -B2 -A2

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Search for reshape() implementations in the backends to see when it's called rg -n "void reshape" --type cpp mllm/backends/cpu/ops/EmbeddingOp.cpp -A10 | head -40

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Look at how Tensor.dtype() behaves - check if it can be called on uninitialized tensor rg -n "dtype\($" mllm/core/Tensor.hpp -B2 -A5 | head -50

Repository: UbiquitousLearning/mllm

Length of output: 543

🏁 Script executed:

# Check if weight_ is checked for nullability before use rg -n "weight_" mllm/core/aops/EmbeddingOp.cpp

Repository: UbiquitousLearning/mllm

Length of output: 587

🏁 Script executed:

# Search for how backends initialize ops - find load() and reshape() calling patterns rg -n "load\(|reshape\(" mllm/compile/jit/interpreter/LinalgInterpreter.cpp -B3 -A3 | head -80

Repository: UbiquitousLearning/mllm

Length of output: 157

🏁 Script executed:

# Find interpreter or executor files find . -type f -name "*Interpreter*" -o -name "*Executor*" | grep -E "\.(cpp|hpp)$" | head -10

Repository: UbiquitousLearning/mllm

Length of output: 164

🏁 Script executed:

# Look for how ops are invoked - check graph execution logic rg -n "reshape\(" --type cpp -B5 -A2 | head -100

Repository: UbiquitousLearning/mllm

Length of output: 8545

🏁 Script executed:

# Check if there's a contract in documentation or comments about load/reshape ordering rg -n "load|reshape" mllm/core/BaseOp.hpp -B2 -A5 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 685

🏁 Script executed:

# Let's check Tensor.dtype() implementation to see what it returns for uninitialized tensor cat -n mllm/core/Tensor.cpp | grep -A15 "DataTypes Tensor::dtype"

Repository: UbiquitousLearning/mllm

Length of output: 660

🏁 Script executed:

# Check the impl() method to see what happens when tensor is nil rg -n "inline.*impl\(" mllm/core/Tensor.hpp -A3 | head -20

Repository: UbiquitousLearning/mllm

Length of output: 220

🏁 Script executed:

# Look for TensorViewImpl to understand what dtype() returns rg -n "class TensorViewImpl" --type hpp -A30 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 94

🏁 Script executed:

# Search for where reshape() is called from - find the actual execution/graph flow rg -n "\.reshape\(inputs" --type cpp -B5 -A2 | head -80

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check if there are any tests that show the expected lifecycle find . -type f -name "*test*.cpp" -o -name "*Test*.cpp" | xargs grep -l "EmbeddingOp" | head -3

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check what happens when impl() is called on a nil tensor cat -n mllm/core/TensorViewImpl.hpp | head -80

Repository: UbiquitousLearning/mllm

Length of output: 2619

🏁 Script executed:

# Look for the bool operator on Tensor class rg -n "operator bool|operator!" mllm/core/Tensor.hpp -B2 -A5

Repository: UbiquitousLearning/mllm

Length of output: 302

🏁 Script executed:

# Search for usage patterns - does reshape() ever get called before load() rg -n "Graph.*reshape|execute.*reshape|forward.*reshape" --type cpp -B3 -A3 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 1185

🏁 Script executed:

# Check backend implementations of EmbeddingOp to see if they call reshape cat -n mllm/backends/cpu/ops/EmbeddingOp.cpp | head -120

Repository: UbiquitousLearning/mllm

Length of output: 2479

🏁 Script executed:

# Check TensorViewImpl::dtype() to see what it does with nullptr storage rg -n "DataTypes TensorViewImpl::dtype" --type cpp -A5 mllm/core/TensorViewImpl.cpp

Repository: UbiquitousLearning/mllm

Length of output: 306

🏁 Script executed:

# Look at trace() more carefully - line 38 shows weight_ is checked before use cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '34,40p'

Repository: UbiquitousLearning/mllm

Length of output: 444

🏁 Script executed:

# Search for other ops to see if they have the same pattern - accessing dtype without guards rg -n "\.dtype" mllm/core/aops/ --type cpp | grep -v "//" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 2352

🏁 Script executed:

# Check if there's validation in graph building or execution that ensures load() before reshape() rg -n "load\(" mllm/compile/ --type cpp | grep -i "before\|order\|sequence" | head -10

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Look at how the graph executor manages ops - does it guarantee load ordering? find . -type f -name "*.cpp" -o -name "*.hpp" | xargs grep -l "GraphExecutor\|OpExecutor" | head -5

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check if there are any assertions or error handling in Tensor::dtype() for nil tensors rg -n "dtype.*isNil|isNil.*dtype" --type cpp | head -10

Repository: UbiquitousLearning/mllm

Length of output: 348

🏁 Script executed:

# Look at a typical usage pattern - how are ops actually called? rg -n "\.load\(|\.reshape\(" --type cpp mllm/nn/layers/Embedding.cpp -B2 -A2 | head -50

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Look at how embedding layers use EmbeddingOp - what's the expected flow? cat -n mllm/nn/layers/Embedding.cpp | head -100

Repository: UbiquitousLearning/mllm

Length of output: 879

🏁 Script executed:

# Check if there's initialization order enforcement in BaseOp or graph building rg -n "class.*Op.*:.*BaseOp" mllm/core/aops/EmbeddingOp.hpp -A5

Repository: UbiquitousLearning/mllm

Length of output: 249

🏁 Script executed:

# Look for tests that show EmbeddingOp usage pattern find . -type f -name "*test*.cpp" -o -name "*Test*.cpp" | xargs grep -l "Embedding\|EmbeddingOp" 2>/dev/null | head -3

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check if weight_ could be uninitialized by looking at constructor cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '1,20p'

Repository: UbiquitousLearning/mllm

Length of output: 874

🏁 Script executed:

# Look for Layer/Model initialization to understand the execution order rg -n "class Layer" mllm/nn/Layer.hpp -A20 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 1253

🏁 Script executed:

# Check if there's a test or example that shows the usage pattern find . -type f $ -name "*.cpp" -o -name "*.h" $ | xargs grep -l "EmbeddingOp" | xargs grep -l "test\|Test\|example" | head -3

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Look at how Graph execution works - is there a guaranteed load before reshape? rg -n "reshape\(" mllm/compile/ir/linalg/Op.cpp -B5 -A5 | head -40

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Look at Layer::__main to see if there's a guaranteed load before reshape rg -n "__main\(" mllm/nn/Layer.cpp -B2 -A20 | head -80

Repository: UbiquitousLearning/mllm

Length of output: 1053

🏁 Script executed:

# Check LayerImpl::load to see when it's called rg -n "void LayerImpl::load" mllm/nn/Layer.cpp -B2 -A15 | head -50

Repository: UbiquitousLearning/mllm

Length of output: 712

🏁 Script executed:

# Look at the key difference: trace() checks weight_ but reshape() doesn't - is this a bug? cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '34,77p'

Repository: UbiquitousLearning/mllm

Length of output: 2201

🏁 Script executed:

# Check when reshape() is called during task execution rg -n "reshape\(" mllm/core/ --type cpp -B3 -A1 | grep -E "Task|Execute|Dispatch" -A3 -B3 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 724

🏁 Script executed:

# Look for the Task execution code find . -type f -name "*Task*" | grep -E "\.(cpp|hpp)$" | head -5

Repository: UbiquitousLearning/mllm

Length of output: 112

🏁 Script executed:

# Check how ops are executed - especially reshape vs load ordering rg -n "class.*Task\|execute.*Op" mllm/ --type cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Search for where reshape is actually invoked during execution rg -n "->reshape\(" mllm/ --type cpp -B5 -A2 | head -80

Repository: UbiquitousLearning/mllm

Length of output: 471

🏁 Script executed:

# Search for where Op reshape is called during execution rg -n "reshape\(" mllm/engine/ --type cpp -B3 -A2 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check Task execution flow cat -n mllm/engine/Task.hpp | head -100

Repository: UbiquitousLearning/mllm

Length of output: 1735

🏁 Script executed:

# Look at how operations are called in sequence rg -n "setUpTask\|reshapeTask\|loadTask" mllm/ --type cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check if there's explicit ordering between load and reshape rg -n "load.*reshape|reshape.*load" mllm/engine/ --type cpp -i | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Look at the dispatcher to understand execution flow cat -n mllm/engine/Task.cpp | head -80

Repository: UbiquitousLearning/mllm

Length of output: 1098

🏁 Script executed:

# Check if there's any setup or initialization phase that runs before reshape rg -n "setup\|Setup" mllm/core/BaseOp.hpp -A3 | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Look for where both load() and reshape() are called in sequence rg -n "op->load\|op->reshape\|op->setup" mllm/ --type cpp | head -30

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Let's verify: is weight_ definitely nil when EmbeddingOp is constructed? rg -n "Tensor weight_" mllm/core/aops/EmbeddingOp.hpp -B2 -A2

Repository: UbiquitousLearning/mllm

Length of output: 146

🏁 Script executed:

# Final check: Look at setup() in BaseOp to understand when reshape() is called cat -n mllm/core/BaseOp.cpp | grep -A30 "void BaseOp::setup"

Repository: UbiquitousLearning/mllm

Length of output: 762

🏁 Script executed:

# Check if there's documentation about the operation lifecycle rg -n "reshape\|setup\|load" mllm/core/BaseOp.hpp -B1 -A2 | head -40

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Let's verify the asymmetry: trace checks weight_, reshape doesn't echo "=== trace() method check ===" && \ cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '34,40p' && \ echo "" && \ echo "=== reshape() method - NO CHECK ===" && \ cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '67,77p'

Repository: UbiquitousLearning/mllm

Length of output: 1142

Add guard against uninitialized weight_ in reshape().

The reshape() method calls weight_.dtype() without checking if weight_ has been initialized, but trace() explicitly guards with if (weight_ && ...) before using it. If reshape() is called before load(), dereferencing a nil tensor will crash. Add the same guard:

if (!weight_) { MLLM_ERROR("EmbeddingOp::reshape: weight not loaded"); return; } auto out_dtype = weight_.dtype();

The special-case mapping kUInt16 → kUInt16PerTensorAsy for quantized embeddings is intentional and correct.

🤖 Prompt for AI Agents

In `@mllm/core/aops/EmbeddingOp.cpp` around lines 73 - 76, In reshape(), add a guard to ensure weight_ is initialized before calling weight_.dtype(): check if (!weight_) then log MLLM_ERROR("EmbeddingOp::reshape: weight not loaded") and return; after that safely compute out_dtype (preserving the kUInt16 → kUInt16PerTensorAsy mapping) and proceed to create the output tensor (outputs.emplace_back(...)). This mirrors the existing guard used in trace() and prevents dereferencing an uninitialized weight_ when reshape() is invoked before load().

coderabbitai · 2026-01-23T02:40:26Z

+        if hasattr(self, "weight_fake_quant"):
+            del self.weight_fake_quant
+
+        class_name = self.__class__.__name__
+        instance_class_name = type(self).__name__
+        print(
+            f"Class: {class_name}, Instance: {instance_class_name}, Deploy Mode Activated. 'weight' is now {self.weight.dtype} buffer. zp is {zero_point}"
+        )


🛠️ Refactor suggestion | 🟠 Major

Replace print() with proper logging; disable_quant() will fail after deployment conversion.

Per coding guidelines, use the logging module instead of print() for significant events.

After convert_to_deploy() deletes weight_fake_quant, calling disable_quant() will raise an AttributeError.

Suggested fix

+import logging + +logger = logging.getLogger(__name__) + # ... in convert_to_deploy: # D. Clean up unnecessary modules if hasattr(self, "weight_fake_quant"): del self.weight_fake_quant - class_name = self.__class__.__name__ - instance_class_name = type(self).__name__ - print( - f"Class: {class_name}, Instance: {instance_class_name}, Deploy Mode Activated. 'weight' is now {self.weight.dtype} buffer. zp is {zero_point}" - ) + logger.info( + f"{self.__class__.__name__}: Deploy mode activated. " + f"'weight' is now {self.weight.dtype} buffer, zp={zero_point}" + )

Also consider adding a guard in disable_quant():

def disable_quant(self): """Completely turn off quantization noise and return to floating point mode""" + if not hasattr(self, "weight_fake_quant"): + logger.warning("Cannot disable quant: already converted to deploy mode") + return self.weight_fake_quant.disable_fakequant()

🤖 Prompt for AI Agents

In `@pymllm/backends/qualcomm/transformers/core/embedding.py` around lines 99 - 106, Replace the print() call in convert_to_deploy() with a logging call using the module-level logger (use logging.getLogger(__name__)) and an appropriate level (info/debug) to record class name, instance name, weight dtype and zero_point; additionally, modify disable_quant() to guard against missing attributes by checking for hasattr(self, "weight_fake_quant") before accessing or deleting it (and handle the case where convert_to_deploy() already removed it), ensuring calls to disable_quant() after deployment conversion do not raise AttributeError.

… input layer normalization handling in Qwen3DecoderLayer. Update weight conversion logic in training script to address model compatibility issues.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pymllm/backends/qualcomm/transformers/qwen3/train.py (1)

39-58: Fail fast for unsupported 1.7B before loading the model.

The guard runs after Qwen3Quantizer construction, which loads the model and can waste GPU memory or OOM. Move the check right after args parsing to fail early.

🔧 Proposed fix

     args = parser.parse_args()
 
-    m = Qwen3Quantizer(args.model_path, mllm_qualcomm_max_length=args.max_length)
-
-    # FIXME: Should disable or not.
-    m.disable_fake_quant()
-    m.calibrate(num_samples=args.num_samples, max_seq_length=args.max_length)
-    m.enable_fake_quant()
-    m.infer(args.infer_text)
-
-    # !!!
-    # Things below is for deploy. We will turn all fp32 weights and some buffers(rope) to quantized dtype.
-    # !!!
-    # This line maybe error. we need use quantized weight!!! not embed_tokens.weight!!!
-    # m.model.lm_head.weight = torch.nn.Parameter(
-    #     m.model.model.embed_tokens.weight.clone()
-    # )
     if "1.7B" in args.model_path:
         raise ValueError(
             "1.7B model is not supported for now due to tied embedding weights is not supported."
         )
+
+    m = Qwen3Quantizer(args.model_path, mllm_qualcomm_max_length=args.max_length)
+
+    # FIXME: Should disable or not.
+    m.disable_fake_quant()
+    m.calibrate(num_samples=args.num_samples, max_seq_length=args.max_length)
+    m.enable_fake_quant()
+    m.infer(args.infer_text)

🧹 Nitpick comments (1)

pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (1)

397-399: Add a comment explaining why layer 0 skips input_layernorm_input_qdq initialization.

The code safely skips creating and using input_layernorm_input_qdq for layer 0 only, with matching conditions in both initialization (line 398) and forward (line 418). No other code in the codebase accesses this attribute. However, adding a brief comment explaining the rationale for this special case would improve code clarity for future maintainers.

UbiquitousLearning

LGTM

chenghuaWang added 2 commits January 23, 2026 02:24

feat(qnn): Enhance QNNBackend initialization with improved logging an…

46edde3

…d error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.

feat(qnn): Update quantization handling and embedding output data typ…

83e6cf7

…es; ensure position-independent code for flatbuffers. Enhance context creation with existing context checks and improve weight quantization specifications.

chenghuaWang requested review from liang1232018 and oreomaker as code owners January 23, 2026 02:26

coderabbitai Bot reviewed Jan 23, 2026

View reviewed changes

feat(qwen3): Integrate QEmbedding for quantized embeddings and refine…

fa7b1a0

… input layer normalization handling in Qwen3DecoderLayer. Update weight conversion logic in training script to address model compatibility issues.

coderabbitai Bot reviewed Jan 23, 2026

View reviewed changes

UbiquitousLearning approved these changes Jan 23, 2026

View reviewed changes

chenghuaWang merged commit 74a7a3e into UbiquitousLearning:main Jan 23, 2026
4 checks passed

coderabbitai Bot mentioned this pull request Jan 23, 2026

feat(qwen3): Add configuration files and enhance Qwen3 model with layer indexing and quantization improvements. #611

Merged

coderabbitai Bot mentioned this pull request Jan 31, 2026

mileston(qnn): Qnn AOT #624

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(qnn): Enhance QNNBackend initialization with improved logging and error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.#609

chenghuaWang commented Jan 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 23, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Jan 23, 2026

Uh oh!

coderabbitai Bot Jan 23, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

UbiquitousLearning left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chenghuaWang commented Jan 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

UbiquitousLearning left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenghuaWang commented Jan 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 23, 2026 •

edited

Loading