Skip to content

feat(qnn): Enhance QNNBackend initialization with improved logging and error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.#609

Merged
chenghuaWang merged 3 commits intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main
Jan 23, 2026

Conversation

@chenghuaWang
Copy link
Copy Markdown
Collaborator

@chenghuaWang chenghuaWang commented Jan 23, 2026

Summary by CodeRabbit

  • New Features

    • Android ARM64 QNN build pipeline
    • Quantized embedding module with deploy/freeze utilities
  • Bug Fixes & Improvements

    • Compiler/version-aware build options for broader compatibility
    • Stronger validation and reuse checks for quantization and runtime contexts
    • Increased default backend log verbosity and richer runtime diagnostics
    • Unified embedding weight/output quantization and output dtype alignment
  • Chores

    • Updated build configurations and .gitignore entries

✏️ Tip: You can customize this high-level summary in your review settings.

…d error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.
…es; ensure position-independent code for flatbuffers. Enhance context creation with existing context checks and improve weight quantization specifications.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 23, 2026

📝 Walkthrough

Walkthrough

Enhanced QNN backend logging and context caching, compiler-aware build guards and PIC for flatbuffers, embedding quantization unified to uint16 per-tensor asym, new PyTorch QEmbedding with quant/deploy lifecycle, Android ARM64 QNN build pipeline added, and minor formatting tweaks.

Changes

Cohort / File(s) Summary
Build & Tooling
\.gitignore, CMakeLists.txt, mllm/CMakeLists.txt, tasks/*build_sdk*_*.yaml
Added mllm-install-*/ ignore; enabled POSITION_INDEPENDENT_CODE for flatbuffers; made -Wno-comma-subscript compiler-aware (Clang unconditional, GCC >=10 guarded); added Android arm64 QNN build pipeline and changed x86 QNN build type to ReleaseWithDebInfo.
QNN Backend Core
mllm/backends/qnn/QNNBackend.cpp, mllm/backends/qnn/QNNBackend.hpp, mllm/backends/qnn/Register.cpp
Increased default QNN log level to VERBOSE; added extensive success/error logging across symbol loading, runtime/device creation, provider discovery, profiling hooks, and context lifecycle; added memory manager registration log.
QNN AOT / Quantization Pipeline
mllm/backends/qnn/aot/QnnWrappersAPI.cpp, mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp, mllm/backends/qnn/aot/passes/PTQPass.cpp
Added runtime guards/validation for quant params (scale/zero_point); pre-create context deduplication; unified embedding weights to uint16 per-tensor asym quantization and propagated weight spec to outputs; added AsymPerTensor handling and logging in PTQ flows.
Embedding Layers & Ops
mllm/core/aops/EmbeddingOp.cpp, pymllm/backends/qualcomm/transformers/core/embedding.py, pymllm/backends/qualcomm/transformers/qwen3/*
C++ Embedding output dtype now derives from weight dtype (uint16 → kUInt16PerTensorAsy); introduced new PyTorch QEmbedding (quant lifecycle: freeze, convert_to_deploy, disable); wired QEmbedding into Qwen3 model and runner weight-conversion.
Minor / Formatting
mllm/backends/qnn/aot_rt/PromptProcessor.cpp
Removed an extra leading blank line (formatting only).

Sequence Diagram(s)

sequenceDiagram
    participant Python as Python (pymllm / QEmbedding)
    participant Runner as Runner / PTQ
    participant AOT as QNN AOT (QnnWrappersAPI)
    participant FS as File System
    participant QNN as QNN Runtime (backend/device/system)

    Python->>Runner: instantiate model with QEmbedding
    Runner->>Runner: freeze / convert_to_deploy (QEmbedding.freeze -> observer -> scale/zp)
    Runner->>AOT: request context load/create (context_path?)
    AOT->>FS: check context file exists
    alt context exists
        FS-->>AOT: file present
        AOT->>QNN: load context from binary
        QNN-->>AOT: success / runtime created
    else no context
        AOT->>QNN: create context from graph/binary
        QNN-->>AOT: success / runtime created
    end
    QNN->>QNN: probe providers / create device
    QNN-->>Runner: runtime + device ready
    Runner-->>Python: model ready for inference
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

Suggested reviewers

  • liang1232018
  • oreomaker
  • yirongjie

Poem

🐰
I hopped through logs and quantized beds of green,
Turned weights to sixteen bits where they’d once been seen,
I cached the context, fixed build flags with cheer,
Now QNN hums loud and embeddings draw near —
Tiny rabbit, big changes, one carrot to cheer.

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning No pull request description was provided; the template requires a meaningful description but the author left it completely empty. Add a detailed description following the repository template, explaining the motivation, implementation details, and testing approach for the changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 16.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title comprehensively summarizes the three main changes: QNNBackend logging/error handling enhancements, QEmbedding class addition, and new build tasks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (1)

989-1022: Add explicit dtype update for embedding output to align with weight quantization spec.

The output tensor o_0 receives the weight's quant_recipe attribute (which specifies kUInt16PerTensorAsy), but its actual tensor dtype is never explicitly updated. This creates a mismatch: the quantization recipe specifies kUInt16PerTensorAsy, but o_0->tensor_.dtype() may not reflect this. Add o_0->cast_<ir::tensor::TensorValue>()->tensor_ = o_0->cast_<ir::tensor::TensorValue>()->tensor_.__unsafeSetDType(kUInt16PerTensorAsy); before setting the quant_recipe attribute to ensure consistency.

🤖 Fix all issues with AI agents
In `@mllm/backends/qnn/aot/passes/PTQPass.cpp`:
- Around line 114-128: The per-tensor branch must enforce that scale and
zero_point are single-element scalars and that zero_point is within [quant_min,
quant_max]; update the QuantizationSpecAsymPerTensor handling (use symbols
this_spec, scale, zero_point, this_spec->quant_min/quant_max,
weight_spec->solved) to: validate scale and zero_point have exactly one element
(e.g., rank==0 or rank==1 with size==1) instead of allowing arbitrary length,
extract the scalar values (scale_val = scale.item<float>(), zp_val =
zero_point.item<int32_t>()), assert scale_val > 0 and dtype checks, clamp zp_val
to the range [this_spec->quant_min, this_spec->quant_max] before assigning
this_spec->zero_point, and then set this_spec->scale and mark
weight_spec->solved = true; keep existing checkTypeLimits usage for the weight
tensor.

In `@mllm/core/aops/EmbeddingOp.cpp`:
- Around line 73-76: In reshape(), add a guard to ensure weight_ is initialized
before calling weight_.dtype(): check if (!weight_) then log
MLLM_ERROR("EmbeddingOp::reshape: weight not loaded") and return; after that
safely compute out_dtype (preserving the kUInt16 → kUInt16PerTensorAsy mapping)
and proceed to create the output tensor (outputs.emplace_back(...)). This
mirrors the existing guard used in trace() and prevents dereferencing an
uninitialized weight_ when reshape() is invoked before load().

In `@pymllm/backends/qualcomm/transformers/core/embedding.py`:
- Around line 99-106: Replace the print() call in convert_to_deploy() with a
logging call using the module-level logger (use logging.getLogger(__name__)) and
an appropriate level (info/debug) to record class name, instance name, weight
dtype and zero_point; additionally, modify disable_quant() to guard against
missing attributes by checking for hasattr(self, "weight_fake_quant") before
accessing or deleting it (and handle the case where convert_to_deploy() already
removed it), ensuring calls to disable_quant() after deployment conversion do
not raise AttributeError.
🧹 Nitpick comments (5)
pymllm/backends/qualcomm/transformers/core/embedding.py (3)

6-38: Add input validation and class docstring.

Per coding guidelines, public APIs should validate inputs and have clear docstrings. Consider adding:

  1. A class-level docstring explaining purpose, parameters, and usage.
  2. Validation for num_embeddings and embedding_dim (must be positive integers).
Suggested improvements
 class QEmbedding(nn.Module):
+    """
+    Quantized embedding layer with per-tensor affine quantization.
+
+    Args:
+        num_embeddings: Size of the embedding dictionary.
+        embedding_dim: Size of each embedding vector.
+        padding_idx: If specified, entries at this index are zeroed.
+        quant_bits: Number of bits for quantization (default: 16).
+    """
     def __init__(
         self,
         num_embeddings,
         embedding_dim,
         padding_idx=None,
         quant_bits=16,
     ):
         super().__init__()
+        if num_embeddings <= 0:
+            raise ValueError(f"num_embeddings must be positive, got {num_embeddings}")
+        if embedding_dim <= 0:
+            raise ValueError(f"embedding_dim must be positive, got {embedding_dim}")
+        if not (1 <= quant_bits <= 32):
+            raise ValueError(f"quant_bits must be in [1, 32], got {quant_bits}")
         self.num_embeddings = num_embeddings

119-123: Replace print() with proper logging.

Consistent with the previous comment, use logging instead of print() for calibration status messages.

Suggested fix
-        class_name = self.__class__.__name__
-        instance_class_name = type(self).__name__
-        print(
-            f"Class: {class_name}, Instance: {instance_class_name}, Weight Quantized: scale={self.weight_fake_quant.scale}, zp={self.weight_fake_quant.zero_point}"
-        )
+        logger.info(
+            f"{self.__class__.__name__}: Weight quantized, "
+            f"scale={self.weight_fake_quant.scale}, zp={self.weight_fake_quant.zero_point}"
+        )

129-133: Consider including quant_bits in the representation.

For debugging and inspection, including the quantization bit-width would be helpful.

Suggested enhancement
     def extra_repr(self):
         s = f"{self.num_embeddings}, {self.embedding_dim}"
         if self.padding_idx is not None:
             s += f", padding_idx={self.padding_idx}"
+        s += f", quant_bits={self.quant_bits}"
         return s
mllm/backends/qnn/aot/QnnWrappersAPI.cpp (1)

439-443: Context caching logic looks correct; consider removing redundant assertion.

The early return for existing contexts is a good addition for preventing duplicate context creation. However, the assertion at line 496 (MLLM_RT_ASSERT_EQ(contexts_.count(name), 0)) is now redundant—it will never fail because we return early if the context exists. Consider removing it to avoid confusion.

♻️ Suggested cleanup
   MLLM_RT_ASSERT_EQ(QNN_CONTEXT_NO_ERROR, status);
   }

   // 5. Register MLLM's Qnn Opset
   // clang-format off
   {
     // FIXME(wch): we need to register our own opset of qnn.
   }
   // clang-format on

-  MLLM_RT_ASSERT_EQ(contexts_.count(name), 0);
   contexts_[name] = context;
   return context;
mllm/backends/qnn/QNNBackend.cpp (1)

287-287: Remove dead code or clarify intention for OpPackage registration.

The empty opPackages vector at line 287 creates unreachable code in the registration loop (lines 290-297). While custom QNN operations (DequantizeAdd, SiLU, RMSNorm, etc.) function correctly via the plugin system's loadOpPackage() mechanism, this registration block is dead code. Either remove it if QNN backend does not support direct OpPackage registration, or populate the vector if this feature should be enabled.

Comment thread mllm/backends/qnn/aot/passes/PTQPass.cpp
Comment on lines +73 to +76
// Output dtype should match weight dtype (e.g., uint16 for AsymPerTensor quantization)
auto out_dtype = weight_.dtype();
if (weight_.dtype() == kUInt16) { out_dtype = kUInt16PerTensorAsy; }
outputs.emplace_back(Tensor::empty(o_shape, out_dtype, i.device()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, locate the EmbeddingOp class definition and examine the full implementation
find . -type f -name "*.cpp" -o -name "*.h" | xargs grep -l "EmbeddingOp" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 641


🏁 Script executed:

# Look at the EmbeddingOp class definition and method implementations
find . -type f -name "EmbeddingOp.*"

Repository: UbiquitousLearning/mllm

Length of output: 298


🏁 Script executed:

# Examine the reshape() and load() methods in EmbeddingOp
cat -n mllm/core/aops/EmbeddingOp.cpp | head -100

Repository: UbiquitousLearning/mllm

Length of output: 3814


🏁 Script executed:

# Check the EmbeddingOp header file
cat -n mllm/core/aops/EmbeddingOp.hpp

Repository: UbiquitousLearning/mllm

Length of output: 1451


🏁 Script executed:

# Look for BaseOp and its lifecycle methods
find . -type f -name "BaseOp.hpp" -o -name "BaseOp.cpp" | head -5

Repository: UbiquitousLearning/mllm

Length of output: 112


🏁 Script executed:

# Check how reshape() is called in the codebase - look at the base class or graph executor
rg -n "\.reshape\(" --type cpp -B2 -A2 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 6341


🏁 Script executed:

# Check if Tensor has a default constructor and what dtype() returns for uninitialized tensor
rg -n "class Tensor" --type hpp -A20 | head -80

Repository: UbiquitousLearning/mllm

Length of output: 94


🏁 Script executed:

# Search for Tensor class definition in .h or .hpp files
find . -type f \( -name "*.hpp" -o -name "*.h" \) -exec grep -l "^class Tensor" {} \; | head -5

Repository: UbiquitousLearning/mllm

Length of output: 330


🏁 Script executed:

# Look at Tensor.hpp to understand default construction and dtype()
cat -n mllm/core/Tensor.hpp | head -150

Repository: UbiquitousLearning/mllm

Length of output: 5269


🏁 Script executed:

# Check BaseOp to understand when load() vs reshape() are called
cat -n mllm/core/BaseOp.hpp | head -80

Repository: UbiquitousLearning/mllm

Length of output: 3073


🏁 Script executed:

# Search for lifecycle/execution patterns - how ops are used
rg -n "load\(\|reshape\(" --type cpp mllm/core/BaseOp.cpp -B2 -A2

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Search for reshape() implementations in the backends to see when it's called
rg -n "void reshape" --type cpp mllm/backends/cpu/ops/EmbeddingOp.cpp -A10 | head -40

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Look at how Tensor.dtype() behaves - check if it can be called on uninitialized tensor
rg -n "dtype\(\)" mllm/core/Tensor.hpp -B2 -A5 | head -50

Repository: UbiquitousLearning/mllm

Length of output: 543


🏁 Script executed:

# Check if weight_ is checked for nullability before use
rg -n "weight_" mllm/core/aops/EmbeddingOp.cpp

Repository: UbiquitousLearning/mllm

Length of output: 587


🏁 Script executed:

# Search for how backends initialize ops - find load() and reshape() calling patterns
rg -n "load\(|reshape\(" mllm/compile/jit/interpreter/LinalgInterpreter.cpp -B3 -A3 | head -80

Repository: UbiquitousLearning/mllm

Length of output: 157


🏁 Script executed:

# Find interpreter or executor files
find . -type f -name "*Interpreter*" -o -name "*Executor*" | grep -E "\.(cpp|hpp)$" | head -10

Repository: UbiquitousLearning/mllm

Length of output: 164


🏁 Script executed:

# Look for how ops are invoked - check graph execution logic
rg -n "reshape\(" --type cpp -B5 -A2 | head -100

Repository: UbiquitousLearning/mllm

Length of output: 8545


🏁 Script executed:

# Check if there's a contract in documentation or comments about load/reshape ordering
rg -n "load|reshape" mllm/core/BaseOp.hpp -B2 -A5 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 685


🏁 Script executed:

# Let's check Tensor.dtype() implementation to see what it returns for uninitialized tensor
cat -n mllm/core/Tensor.cpp | grep -A15 "DataTypes Tensor::dtype"

Repository: UbiquitousLearning/mllm

Length of output: 660


🏁 Script executed:

# Check the impl() method to see what happens when tensor is nil
rg -n "inline.*impl\(" mllm/core/Tensor.hpp -A3 | head -20

Repository: UbiquitousLearning/mllm

Length of output: 220


🏁 Script executed:

# Look for TensorViewImpl to understand what dtype() returns
rg -n "class TensorViewImpl" --type hpp -A30 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 94


🏁 Script executed:

# Search for where reshape() is called from - find the actual execution/graph flow
rg -n "\.reshape\(inputs" --type cpp -B5 -A2 | head -80

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check if there are any tests that show the expected lifecycle
find . -type f -name "*test*.cpp" -o -name "*Test*.cpp" | xargs grep -l "EmbeddingOp" | head -3

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check what happens when impl() is called on a nil tensor
cat -n mllm/core/TensorViewImpl.hpp | head -80

Repository: UbiquitousLearning/mllm

Length of output: 2619


🏁 Script executed:

# Look for the bool operator on Tensor class
rg -n "operator bool|operator!" mllm/core/Tensor.hpp -B2 -A5

Repository: UbiquitousLearning/mllm

Length of output: 302


🏁 Script executed:

# Search for usage patterns - does reshape() ever get called before load()
rg -n "Graph.*reshape|execute.*reshape|forward.*reshape" --type cpp -B3 -A3 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 1185


🏁 Script executed:

# Check backend implementations of EmbeddingOp to see if they call reshape
cat -n mllm/backends/cpu/ops/EmbeddingOp.cpp | head -120

Repository: UbiquitousLearning/mllm

Length of output: 2479


🏁 Script executed:

# Check TensorViewImpl::dtype() to see what it does with nullptr storage
rg -n "DataTypes TensorViewImpl::dtype" --type cpp -A5 mllm/core/TensorViewImpl.cpp

Repository: UbiquitousLearning/mllm

Length of output: 306


🏁 Script executed:

# Look at trace() more carefully - line 38 shows weight_ is checked before use
cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '34,40p'

Repository: UbiquitousLearning/mllm

Length of output: 444


🏁 Script executed:

# Search for other ops to see if they have the same pattern - accessing dtype without guards
rg -n "\.dtype\(\)" mllm/core/aops/ --type cpp | grep -v "//" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 2352


🏁 Script executed:

# Check if there's validation in graph building or execution that ensures load() before reshape()
rg -n "load\(" mllm/compile/ --type cpp | grep -i "before\|order\|sequence" | head -10

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Look at how the graph executor manages ops - does it guarantee load ordering?
find . -type f -name "*.cpp" -o -name "*.hpp" | xargs grep -l "GraphExecutor\|OpExecutor" | head -5

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check if there are any assertions or error handling in Tensor::dtype() for nil tensors
rg -n "dtype.*isNil|isNil.*dtype" --type cpp | head -10

Repository: UbiquitousLearning/mllm

Length of output: 348


🏁 Script executed:

# Look at a typical usage pattern - how are ops actually called?
rg -n "\.load\(|\.reshape\(" --type cpp mllm/nn/layers/Embedding.cpp -B2 -A2 | head -50

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Look at how embedding layers use EmbeddingOp - what's the expected flow?
cat -n mllm/nn/layers/Embedding.cpp | head -100

Repository: UbiquitousLearning/mllm

Length of output: 879


🏁 Script executed:

# Check if there's initialization order enforcement in BaseOp or graph building
rg -n "class.*Op.*:.*BaseOp" mllm/core/aops/EmbeddingOp.hpp -A5

Repository: UbiquitousLearning/mllm

Length of output: 249


🏁 Script executed:

# Look for tests that show EmbeddingOp usage pattern
find . -type f -name "*test*.cpp" -o -name "*Test*.cpp" | xargs grep -l "Embedding\|EmbeddingOp" 2>/dev/null | head -3

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check if weight_ could be uninitialized by looking at constructor
cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '1,20p'

Repository: UbiquitousLearning/mllm

Length of output: 874


🏁 Script executed:

# Look for Layer/Model initialization to understand the execution order
rg -n "class Layer" mllm/nn/Layer.hpp -A20 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 1253


🏁 Script executed:

# Check if there's a test or example that shows the usage pattern
find . -type f \( -name "*.cpp" -o -name "*.h" \) | xargs grep -l "EmbeddingOp" | xargs grep -l "test\|Test\|example" | head -3

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Look at how Graph execution works - is there a guaranteed load before reshape?
rg -n "reshape\(" mllm/compile/ir/linalg/Op.cpp -B5 -A5 | head -40

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Look at Layer::__main to see if there's a guaranteed load before reshape
rg -n "__main\(" mllm/nn/Layer.cpp -B2 -A20 | head -80

Repository: UbiquitousLearning/mllm

Length of output: 1053


🏁 Script executed:

# Check LayerImpl::load to see when it's called
rg -n "void LayerImpl::load" mllm/nn/Layer.cpp -B2 -A15 | head -50

Repository: UbiquitousLearning/mllm

Length of output: 712


🏁 Script executed:

# Look at the key difference: trace() checks weight_ but reshape() doesn't - is this a bug?
cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '34,77p'

Repository: UbiquitousLearning/mllm

Length of output: 2201


🏁 Script executed:

# Check when reshape() is called during task execution
rg -n "reshape\(" mllm/core/ --type cpp -B3 -A1 | grep -E "Task|Execute|Dispatch" -A3 -B3 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 724


🏁 Script executed:

# Look for the Task execution code
find . -type f -name "*Task*" | grep -E "\.(cpp|hpp)$" | head -5

Repository: UbiquitousLearning/mllm

Length of output: 112


🏁 Script executed:

# Check how ops are executed - especially reshape vs load ordering
rg -n "class.*Task\|execute.*Op" mllm/ --type cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Search for where reshape is actually invoked during execution
rg -n "->reshape\(" mllm/ --type cpp -B5 -A2 | head -80

Repository: UbiquitousLearning/mllm

Length of output: 471


🏁 Script executed:

# Search for where Op reshape is called during execution
rg -n "reshape\(" mllm/engine/ --type cpp -B3 -A2 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check Task execution flow
cat -n mllm/engine/Task.hpp | head -100

Repository: UbiquitousLearning/mllm

Length of output: 1735


🏁 Script executed:

# Look at how operations are called in sequence
rg -n "setUpTask\|reshapeTask\|loadTask" mllm/ --type cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check if there's explicit ordering between load and reshape
rg -n "load.*reshape|reshape.*load" mllm/engine/ --type cpp -i | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Look at the dispatcher to understand execution flow
cat -n mllm/engine/Task.cpp | head -80

Repository: UbiquitousLearning/mllm

Length of output: 1098


🏁 Script executed:

# Check if there's any setup or initialization phase that runs before reshape
rg -n "setup\|Setup" mllm/core/BaseOp.hpp -A3 | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Look for where both load() and reshape() are called in sequence
rg -n "op->load\|op->reshape\|op->setup" mllm/ --type cpp | head -30

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Let's verify: is weight_ definitely nil when EmbeddingOp is constructed?
rg -n "Tensor weight_" mllm/core/aops/EmbeddingOp.hpp -B2 -A2

Repository: UbiquitousLearning/mllm

Length of output: 146


🏁 Script executed:

# Final check: Look at setup() in BaseOp to understand when reshape() is called
cat -n mllm/core/BaseOp.cpp | grep -A30 "void BaseOp::setup"

Repository: UbiquitousLearning/mllm

Length of output: 762


🏁 Script executed:

# Check if there's documentation about the operation lifecycle
rg -n "reshape\|setup\|load" mllm/core/BaseOp.hpp -B1 -A2 | head -40

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Let's verify the asymmetry: trace checks weight_, reshape doesn't
echo "=== trace() method check ===" && \
cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '34,40p' && \
echo "" && \
echo "=== reshape() method - NO CHECK ===" && \
cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '67,77p'

Repository: UbiquitousLearning/mllm

Length of output: 1142


Add guard against uninitialized weight_ in reshape().

The reshape() method calls weight_.dtype() without checking if weight_ has been initialized, but trace() explicitly guards with if (weight_ && ...) before using it. If reshape() is called before load(), dereferencing a nil tensor will crash. Add the same guard:

if (!weight_) {
  MLLM_ERROR("EmbeddingOp::reshape: weight not loaded");
  return;
}
auto out_dtype = weight_.dtype();

The special-case mapping kUInt16 → kUInt16PerTensorAsy for quantized embeddings is intentional and correct.

🤖 Prompt for AI Agents
In `@mllm/core/aops/EmbeddingOp.cpp` around lines 73 - 76, In reshape(), add a
guard to ensure weight_ is initialized before calling weight_.dtype(): check if
(!weight_) then log MLLM_ERROR("EmbeddingOp::reshape: weight not loaded") and
return; after that safely compute out_dtype (preserving the kUInt16 →
kUInt16PerTensorAsy mapping) and proceed to create the output tensor
(outputs.emplace_back(...)). This mirrors the existing guard used in trace() and
prevents dereferencing an uninitialized weight_ when reshape() is invoked before
load().

Comment on lines +99 to +106
if hasattr(self, "weight_fake_quant"):
del self.weight_fake_quant

class_name = self.__class__.__name__
instance_class_name = type(self).__name__
print(
f"Class: {class_name}, Instance: {instance_class_name}, Deploy Mode Activated. 'weight' is now {self.weight.dtype} buffer. zp is {zero_point}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Replace print() with proper logging; disable_quant() will fail after deployment conversion.

  1. Per coding guidelines, use the logging module instead of print() for significant events.
  2. After convert_to_deploy() deletes weight_fake_quant, calling disable_quant() will raise an AttributeError.
Suggested fix
+import logging
+
+logger = logging.getLogger(__name__)
+
 # ... in convert_to_deploy:
         # D. Clean up unnecessary modules
         if hasattr(self, "weight_fake_quant"):
             del self.weight_fake_quant
 
-        class_name = self.__class__.__name__
-        instance_class_name = type(self).__name__
-        print(
-            f"Class: {class_name}, Instance: {instance_class_name}, Deploy Mode Activated. 'weight' is now {self.weight.dtype} buffer. zp is {zero_point}"
-        )
+        logger.info(
+            f"{self.__class__.__name__}: Deploy mode activated. "
+            f"'weight' is now {self.weight.dtype} buffer, zp={zero_point}"
+        )

Also consider adding a guard in disable_quant():

     def disable_quant(self):
         """Completely turn off quantization noise and return to floating point mode"""
+        if not hasattr(self, "weight_fake_quant"):
+            logger.warning("Cannot disable quant: already converted to deploy mode")
+            return
         self.weight_fake_quant.disable_fakequant()
🤖 Prompt for AI Agents
In `@pymllm/backends/qualcomm/transformers/core/embedding.py` around lines 99 -
106, Replace the print() call in convert_to_deploy() with a logging call using
the module-level logger (use logging.getLogger(__name__)) and an appropriate
level (info/debug) to record class name, instance name, weight dtype and
zero_point; additionally, modify disable_quant() to guard against missing
attributes by checking for hasattr(self, "weight_fake_quant") before accessing
or deleting it (and handle the case where convert_to_deploy() already removed
it), ensuring calls to disable_quant() after deployment conversion do not raise
AttributeError.

… input layer normalization handling in Qwen3DecoderLayer. Update weight conversion logic in training script to address model compatibility issues.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pymllm/backends/qualcomm/transformers/qwen3/train.py (1)

39-58: Fail fast for unsupported 1.7B before loading the model.

The guard runs after Qwen3Quantizer construction, which loads the model and can waste GPU memory or OOM. Move the check right after args parsing to fail early.

🔧 Proposed fix
     args = parser.parse_args()
 
-    m = Qwen3Quantizer(args.model_path, mllm_qualcomm_max_length=args.max_length)
-
-    # FIXME: Should disable or not.
-    m.disable_fake_quant()
-    m.calibrate(num_samples=args.num_samples, max_seq_length=args.max_length)
-    m.enable_fake_quant()
-    m.infer(args.infer_text)
-
-    # !!!
-    # Things below is for deploy. We will turn all fp32 weights and some buffers(rope) to quantized dtype.
-    # !!!
-    # This line maybe error. we need use quantized weight!!! not embed_tokens.weight!!!
-    # m.model.lm_head.weight = torch.nn.Parameter(
-    #     m.model.model.embed_tokens.weight.clone()
-    # )
     if "1.7B" in args.model_path:
         raise ValueError(
             "1.7B model is not supported for now due to tied embedding weights is not supported."
         )
+
+    m = Qwen3Quantizer(args.model_path, mllm_qualcomm_max_length=args.max_length)
+
+    # FIXME: Should disable or not.
+    m.disable_fake_quant()
+    m.calibrate(num_samples=args.num_samples, max_seq_length=args.max_length)
+    m.enable_fake_quant()
+    m.infer(args.infer_text)
🧹 Nitpick comments (1)
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (1)

397-399: Add a comment explaining why layer 0 skips input_layernorm_input_qdq initialization.

The code safely skips creating and using input_layernorm_input_qdq for layer 0 only, with matching conditions in both initialization (line 398) and forward (line 418). No other code in the codebase accesses this attribute. However, adding a brief comment explaining the rationale for this special case would improve code clarity for future maintainers.

Copy link
Copy Markdown
Owner

@UbiquitousLearning UbiquitousLearning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenghuaWang chenghuaWang merged commit 74a7a3e into UbiquitousLearning:main Jan 23, 2026
4 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request Jan 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants