feat(qnn): Enhance QNNBackend initialization with improved logging and error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.#609
feat(qnn): Enhance QNNBackend initialization with improved logging and error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.#609chenghuaWang merged 3 commits intoUbiquitousLearning:mainfrom
Conversation
…d error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.
…es; ensure position-independent code for flatbuffers. Enhance context creation with existing context checks and improve weight quantization specifications.
📝 WalkthroughWalkthroughEnhanced QNN backend logging and context caching, compiler-aware build guards and PIC for flatbuffers, embedding quantization unified to uint16 per-tensor asym, new PyTorch QEmbedding with quant/deploy lifecycle, Android ARM64 QNN build pipeline added, and minor formatting tweaks. Changes
Sequence Diagram(s)sequenceDiagram
participant Python as Python (pymllm / QEmbedding)
participant Runner as Runner / PTQ
participant AOT as QNN AOT (QnnWrappersAPI)
participant FS as File System
participant QNN as QNN Runtime (backend/device/system)
Python->>Runner: instantiate model with QEmbedding
Runner->>Runner: freeze / convert_to_deploy (QEmbedding.freeze -> observer -> scale/zp)
Runner->>AOT: request context load/create (context_path?)
AOT->>FS: check context file exists
alt context exists
FS-->>AOT: file present
AOT->>QNN: load context from binary
QNN-->>AOT: success / runtime created
else no context
AOT->>QNN: create context from graph/binary
QNN-->>AOT: success / runtime created
end
QNN->>QNN: probe providers / create device
QNN-->>Runner: runtime + device ready
Runner-->>Python: model ready for inference
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (1)
989-1022: Add explicit dtype update for embedding output to align with weight quantization spec.The output tensor
o_0receives the weight'squant_recipeattribute (which specifieskUInt16PerTensorAsy), but its actual tensor dtype is never explicitly updated. This creates a mismatch: the quantization recipe specifieskUInt16PerTensorAsy, buto_0->tensor_.dtype()may not reflect this. Addo_0->cast_<ir::tensor::TensorValue>()->tensor_ = o_0->cast_<ir::tensor::TensorValue>()->tensor_.__unsafeSetDType(kUInt16PerTensorAsy);before setting thequant_recipeattribute to ensure consistency.
🤖 Fix all issues with AI agents
In `@mllm/backends/qnn/aot/passes/PTQPass.cpp`:
- Around line 114-128: The per-tensor branch must enforce that scale and
zero_point are single-element scalars and that zero_point is within [quant_min,
quant_max]; update the QuantizationSpecAsymPerTensor handling (use symbols
this_spec, scale, zero_point, this_spec->quant_min/quant_max,
weight_spec->solved) to: validate scale and zero_point have exactly one element
(e.g., rank==0 or rank==1 with size==1) instead of allowing arbitrary length,
extract the scalar values (scale_val = scale.item<float>(), zp_val =
zero_point.item<int32_t>()), assert scale_val > 0 and dtype checks, clamp zp_val
to the range [this_spec->quant_min, this_spec->quant_max] before assigning
this_spec->zero_point, and then set this_spec->scale and mark
weight_spec->solved = true; keep existing checkTypeLimits usage for the weight
tensor.
In `@mllm/core/aops/EmbeddingOp.cpp`:
- Around line 73-76: In reshape(), add a guard to ensure weight_ is initialized
before calling weight_.dtype(): check if (!weight_) then log
MLLM_ERROR("EmbeddingOp::reshape: weight not loaded") and return; after that
safely compute out_dtype (preserving the kUInt16 → kUInt16PerTensorAsy mapping)
and proceed to create the output tensor (outputs.emplace_back(...)). This
mirrors the existing guard used in trace() and prevents dereferencing an
uninitialized weight_ when reshape() is invoked before load().
In `@pymllm/backends/qualcomm/transformers/core/embedding.py`:
- Around line 99-106: Replace the print() call in convert_to_deploy() with a
logging call using the module-level logger (use logging.getLogger(__name__)) and
an appropriate level (info/debug) to record class name, instance name, weight
dtype and zero_point; additionally, modify disable_quant() to guard against
missing attributes by checking for hasattr(self, "weight_fake_quant") before
accessing or deleting it (and handle the case where convert_to_deploy() already
removed it), ensuring calls to disable_quant() after deployment conversion do
not raise AttributeError.
🧹 Nitpick comments (5)
pymllm/backends/qualcomm/transformers/core/embedding.py (3)
6-38: Add input validation and class docstring.Per coding guidelines, public APIs should validate inputs and have clear docstrings. Consider adding:
- A class-level docstring explaining purpose, parameters, and usage.
- Validation for
num_embeddingsandembedding_dim(must be positive integers).Suggested improvements
class QEmbedding(nn.Module): + """ + Quantized embedding layer with per-tensor affine quantization. + + Args: + num_embeddings: Size of the embedding dictionary. + embedding_dim: Size of each embedding vector. + padding_idx: If specified, entries at this index are zeroed. + quant_bits: Number of bits for quantization (default: 16). + """ def __init__( self, num_embeddings, embedding_dim, padding_idx=None, quant_bits=16, ): super().__init__() + if num_embeddings <= 0: + raise ValueError(f"num_embeddings must be positive, got {num_embeddings}") + if embedding_dim <= 0: + raise ValueError(f"embedding_dim must be positive, got {embedding_dim}") + if not (1 <= quant_bits <= 32): + raise ValueError(f"quant_bits must be in [1, 32], got {quant_bits}") self.num_embeddings = num_embeddings
119-123: Replaceprint()with proper logging.Consistent with the previous comment, use
logginginstead ofprint()for calibration status messages.Suggested fix
- class_name = self.__class__.__name__ - instance_class_name = type(self).__name__ - print( - f"Class: {class_name}, Instance: {instance_class_name}, Weight Quantized: scale={self.weight_fake_quant.scale}, zp={self.weight_fake_quant.zero_point}" - ) + logger.info( + f"{self.__class__.__name__}: Weight quantized, " + f"scale={self.weight_fake_quant.scale}, zp={self.weight_fake_quant.zero_point}" + )
129-133: Consider includingquant_bitsin the representation.For debugging and inspection, including the quantization bit-width would be helpful.
Suggested enhancement
def extra_repr(self): s = f"{self.num_embeddings}, {self.embedding_dim}" if self.padding_idx is not None: s += f", padding_idx={self.padding_idx}" + s += f", quant_bits={self.quant_bits}" return smllm/backends/qnn/aot/QnnWrappersAPI.cpp (1)
439-443: Context caching logic looks correct; consider removing redundant assertion.The early return for existing contexts is a good addition for preventing duplicate context creation. However, the assertion at line 496 (
MLLM_RT_ASSERT_EQ(contexts_.count(name), 0)) is now redundant—it will never fail because we return early if the context exists. Consider removing it to avoid confusion.♻️ Suggested cleanup
MLLM_RT_ASSERT_EQ(QNN_CONTEXT_NO_ERROR, status); } // 5. Register MLLM's Qnn Opset // clang-format off { // FIXME(wch): we need to register our own opset of qnn. } // clang-format on - MLLM_RT_ASSERT_EQ(contexts_.count(name), 0); contexts_[name] = context; return context;mllm/backends/qnn/QNNBackend.cpp (1)
287-287: Remove dead code or clarify intention for OpPackage registration.The empty
opPackagesvector at line 287 creates unreachable code in the registration loop (lines 290-297). While custom QNN operations (DequantizeAdd, SiLU, RMSNorm, etc.) function correctly via the plugin system'sloadOpPackage()mechanism, this registration block is dead code. Either remove it if QNN backend does not support direct OpPackage registration, or populate the vector if this feature should be enabled.
| // Output dtype should match weight dtype (e.g., uint16 for AsymPerTensor quantization) | ||
| auto out_dtype = weight_.dtype(); | ||
| if (weight_.dtype() == kUInt16) { out_dtype = kUInt16PerTensorAsy; } | ||
| outputs.emplace_back(Tensor::empty(o_shape, out_dtype, i.device())); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate the EmbeddingOp class definition and examine the full implementation
find . -type f -name "*.cpp" -o -name "*.h" | xargs grep -l "EmbeddingOp" | head -20Repository: UbiquitousLearning/mllm
Length of output: 641
🏁 Script executed:
# Look at the EmbeddingOp class definition and method implementations
find . -type f -name "EmbeddingOp.*"Repository: UbiquitousLearning/mllm
Length of output: 298
🏁 Script executed:
# Examine the reshape() and load() methods in EmbeddingOp
cat -n mllm/core/aops/EmbeddingOp.cpp | head -100Repository: UbiquitousLearning/mllm
Length of output: 3814
🏁 Script executed:
# Check the EmbeddingOp header file
cat -n mllm/core/aops/EmbeddingOp.hppRepository: UbiquitousLearning/mllm
Length of output: 1451
🏁 Script executed:
# Look for BaseOp and its lifecycle methods
find . -type f -name "BaseOp.hpp" -o -name "BaseOp.cpp" | head -5Repository: UbiquitousLearning/mllm
Length of output: 112
🏁 Script executed:
# Check how reshape() is called in the codebase - look at the base class or graph executor
rg -n "\.reshape\(" --type cpp -B2 -A2 | head -60Repository: UbiquitousLearning/mllm
Length of output: 6341
🏁 Script executed:
# Check if Tensor has a default constructor and what dtype() returns for uninitialized tensor
rg -n "class Tensor" --type hpp -A20 | head -80Repository: UbiquitousLearning/mllm
Length of output: 94
🏁 Script executed:
# Search for Tensor class definition in .h or .hpp files
find . -type f \( -name "*.hpp" -o -name "*.h" \) -exec grep -l "^class Tensor" {} \; | head -5Repository: UbiquitousLearning/mllm
Length of output: 330
🏁 Script executed:
# Look at Tensor.hpp to understand default construction and dtype()
cat -n mllm/core/Tensor.hpp | head -150Repository: UbiquitousLearning/mllm
Length of output: 5269
🏁 Script executed:
# Check BaseOp to understand when load() vs reshape() are called
cat -n mllm/core/BaseOp.hpp | head -80Repository: UbiquitousLearning/mllm
Length of output: 3073
🏁 Script executed:
# Search for lifecycle/execution patterns - how ops are used
rg -n "load\(\|reshape\(" --type cpp mllm/core/BaseOp.cpp -B2 -A2Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Search for reshape() implementations in the backends to see when it's called
rg -n "void reshape" --type cpp mllm/backends/cpu/ops/EmbeddingOp.cpp -A10 | head -40Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Look at how Tensor.dtype() behaves - check if it can be called on uninitialized tensor
rg -n "dtype\(\)" mllm/core/Tensor.hpp -B2 -A5 | head -50Repository: UbiquitousLearning/mllm
Length of output: 543
🏁 Script executed:
# Check if weight_ is checked for nullability before use
rg -n "weight_" mllm/core/aops/EmbeddingOp.cppRepository: UbiquitousLearning/mllm
Length of output: 587
🏁 Script executed:
# Search for how backends initialize ops - find load() and reshape() calling patterns
rg -n "load\(|reshape\(" mllm/compile/jit/interpreter/LinalgInterpreter.cpp -B3 -A3 | head -80Repository: UbiquitousLearning/mllm
Length of output: 157
🏁 Script executed:
# Find interpreter or executor files
find . -type f -name "*Interpreter*" -o -name "*Executor*" | grep -E "\.(cpp|hpp)$" | head -10Repository: UbiquitousLearning/mllm
Length of output: 164
🏁 Script executed:
# Look for how ops are invoked - check graph execution logic
rg -n "reshape\(" --type cpp -B5 -A2 | head -100Repository: UbiquitousLearning/mllm
Length of output: 8545
🏁 Script executed:
# Check if there's a contract in documentation or comments about load/reshape ordering
rg -n "load|reshape" mllm/core/BaseOp.hpp -B2 -A5 | head -60Repository: UbiquitousLearning/mllm
Length of output: 685
🏁 Script executed:
# Let's check Tensor.dtype() implementation to see what it returns for uninitialized tensor
cat -n mllm/core/Tensor.cpp | grep -A15 "DataTypes Tensor::dtype"Repository: UbiquitousLearning/mllm
Length of output: 660
🏁 Script executed:
# Check the impl() method to see what happens when tensor is nil
rg -n "inline.*impl\(" mllm/core/Tensor.hpp -A3 | head -20Repository: UbiquitousLearning/mllm
Length of output: 220
🏁 Script executed:
# Look for TensorViewImpl to understand what dtype() returns
rg -n "class TensorViewImpl" --type hpp -A30 | head -60Repository: UbiquitousLearning/mllm
Length of output: 94
🏁 Script executed:
# Search for where reshape() is called from - find the actual execution/graph flow
rg -n "\.reshape\(inputs" --type cpp -B5 -A2 | head -80Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check if there are any tests that show the expected lifecycle
find . -type f -name "*test*.cpp" -o -name "*Test*.cpp" | xargs grep -l "EmbeddingOp" | head -3Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check what happens when impl() is called on a nil tensor
cat -n mllm/core/TensorViewImpl.hpp | head -80Repository: UbiquitousLearning/mllm
Length of output: 2619
🏁 Script executed:
# Look for the bool operator on Tensor class
rg -n "operator bool|operator!" mllm/core/Tensor.hpp -B2 -A5Repository: UbiquitousLearning/mllm
Length of output: 302
🏁 Script executed:
# Search for usage patterns - does reshape() ever get called before load()
rg -n "Graph.*reshape|execute.*reshape|forward.*reshape" --type cpp -B3 -A3 | head -60Repository: UbiquitousLearning/mllm
Length of output: 1185
🏁 Script executed:
# Check backend implementations of EmbeddingOp to see if they call reshape
cat -n mllm/backends/cpu/ops/EmbeddingOp.cpp | head -120Repository: UbiquitousLearning/mllm
Length of output: 2479
🏁 Script executed:
# Check TensorViewImpl::dtype() to see what it does with nullptr storage
rg -n "DataTypes TensorViewImpl::dtype" --type cpp -A5 mllm/core/TensorViewImpl.cppRepository: UbiquitousLearning/mllm
Length of output: 306
🏁 Script executed:
# Look at trace() more carefully - line 38 shows weight_ is checked before use
cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '34,40p'Repository: UbiquitousLearning/mllm
Length of output: 444
🏁 Script executed:
# Search for other ops to see if they have the same pattern - accessing dtype without guards
rg -n "\.dtype\(\)" mllm/core/aops/ --type cpp | grep -v "//" | head -20Repository: UbiquitousLearning/mllm
Length of output: 2352
🏁 Script executed:
# Check if there's validation in graph building or execution that ensures load() before reshape()
rg -n "load\(" mllm/compile/ --type cpp | grep -i "before\|order\|sequence" | head -10Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Look at how the graph executor manages ops - does it guarantee load ordering?
find . -type f -name "*.cpp" -o -name "*.hpp" | xargs grep -l "GraphExecutor\|OpExecutor" | head -5Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check if there are any assertions or error handling in Tensor::dtype() for nil tensors
rg -n "dtype.*isNil|isNil.*dtype" --type cpp | head -10Repository: UbiquitousLearning/mllm
Length of output: 348
🏁 Script executed:
# Look at a typical usage pattern - how are ops actually called?
rg -n "\.load\(|\.reshape\(" --type cpp mllm/nn/layers/Embedding.cpp -B2 -A2 | head -50Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Look at how embedding layers use EmbeddingOp - what's the expected flow?
cat -n mllm/nn/layers/Embedding.cpp | head -100Repository: UbiquitousLearning/mllm
Length of output: 879
🏁 Script executed:
# Check if there's initialization order enforcement in BaseOp or graph building
rg -n "class.*Op.*:.*BaseOp" mllm/core/aops/EmbeddingOp.hpp -A5Repository: UbiquitousLearning/mllm
Length of output: 249
🏁 Script executed:
# Look for tests that show EmbeddingOp usage pattern
find . -type f -name "*test*.cpp" -o -name "*Test*.cpp" | xargs grep -l "Embedding\|EmbeddingOp" 2>/dev/null | head -3Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check if weight_ could be uninitialized by looking at constructor
cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '1,20p'Repository: UbiquitousLearning/mllm
Length of output: 874
🏁 Script executed:
# Look for Layer/Model initialization to understand the execution order
rg -n "class Layer" mllm/nn/Layer.hpp -A20 | head -60Repository: UbiquitousLearning/mllm
Length of output: 1253
🏁 Script executed:
# Check if there's a test or example that shows the usage pattern
find . -type f \( -name "*.cpp" -o -name "*.h" \) | xargs grep -l "EmbeddingOp" | xargs grep -l "test\|Test\|example" | head -3Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Look at how Graph execution works - is there a guaranteed load before reshape?
rg -n "reshape\(" mllm/compile/ir/linalg/Op.cpp -B5 -A5 | head -40Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Look at Layer::__main to see if there's a guaranteed load before reshape
rg -n "__main\(" mllm/nn/Layer.cpp -B2 -A20 | head -80Repository: UbiquitousLearning/mllm
Length of output: 1053
🏁 Script executed:
# Check LayerImpl::load to see when it's called
rg -n "void LayerImpl::load" mllm/nn/Layer.cpp -B2 -A15 | head -50Repository: UbiquitousLearning/mllm
Length of output: 712
🏁 Script executed:
# Look at the key difference: trace() checks weight_ but reshape() doesn't - is this a bug?
cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '34,77p'Repository: UbiquitousLearning/mllm
Length of output: 2201
🏁 Script executed:
# Check when reshape() is called during task execution
rg -n "reshape\(" mllm/core/ --type cpp -B3 -A1 | grep -E "Task|Execute|Dispatch" -A3 -B3 | head -60Repository: UbiquitousLearning/mllm
Length of output: 724
🏁 Script executed:
# Look for the Task execution code
find . -type f -name "*Task*" | grep -E "\.(cpp|hpp)$" | head -5Repository: UbiquitousLearning/mllm
Length of output: 112
🏁 Script executed:
# Check how ops are executed - especially reshape vs load ordering
rg -n "class.*Task\|execute.*Op" mllm/ --type cpp | head -20Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Search for where reshape is actually invoked during execution
rg -n "->reshape\(" mllm/ --type cpp -B5 -A2 | head -80Repository: UbiquitousLearning/mllm
Length of output: 471
🏁 Script executed:
# Search for where Op reshape is called during execution
rg -n "reshape\(" mllm/engine/ --type cpp -B3 -A2 | head -60Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check Task execution flow
cat -n mllm/engine/Task.hpp | head -100Repository: UbiquitousLearning/mllm
Length of output: 1735
🏁 Script executed:
# Look at how operations are called in sequence
rg -n "setUpTask\|reshapeTask\|loadTask" mllm/ --type cpp | head -20Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check if there's explicit ordering between load and reshape
rg -n "load.*reshape|reshape.*load" mllm/engine/ --type cpp -i | head -20Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Look at the dispatcher to understand execution flow
cat -n mllm/engine/Task.cpp | head -80Repository: UbiquitousLearning/mllm
Length of output: 1098
🏁 Script executed:
# Check if there's any setup or initialization phase that runs before reshape
rg -n "setup\|Setup" mllm/core/BaseOp.hpp -A3 | head -20Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Look for where both load() and reshape() are called in sequence
rg -n "op->load\|op->reshape\|op->setup" mllm/ --type cpp | head -30Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Let's verify: is weight_ definitely nil when EmbeddingOp is constructed?
rg -n "Tensor weight_" mllm/core/aops/EmbeddingOp.hpp -B2 -A2Repository: UbiquitousLearning/mllm
Length of output: 146
🏁 Script executed:
# Final check: Look at setup() in BaseOp to understand when reshape() is called
cat -n mllm/core/BaseOp.cpp | grep -A30 "void BaseOp::setup"Repository: UbiquitousLearning/mllm
Length of output: 762
🏁 Script executed:
# Check if there's documentation about the operation lifecycle
rg -n "reshape\|setup\|load" mllm/core/BaseOp.hpp -B1 -A2 | head -40Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Let's verify the asymmetry: trace checks weight_, reshape doesn't
echo "=== trace() method check ===" && \
cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '34,40p' && \
echo "" && \
echo "=== reshape() method - NO CHECK ===" && \
cat -n mllm/core/aops/EmbeddingOp.cpp | sed -n '67,77p'Repository: UbiquitousLearning/mllm
Length of output: 1142
Add guard against uninitialized weight_ in reshape().
The reshape() method calls weight_.dtype() without checking if weight_ has been initialized, but trace() explicitly guards with if (weight_ && ...) before using it. If reshape() is called before load(), dereferencing a nil tensor will crash. Add the same guard:
if (!weight_) {
MLLM_ERROR("EmbeddingOp::reshape: weight not loaded");
return;
}
auto out_dtype = weight_.dtype();The special-case mapping kUInt16 → kUInt16PerTensorAsy for quantized embeddings is intentional and correct.
🤖 Prompt for AI Agents
In `@mllm/core/aops/EmbeddingOp.cpp` around lines 73 - 76, In reshape(), add a
guard to ensure weight_ is initialized before calling weight_.dtype(): check if
(!weight_) then log MLLM_ERROR("EmbeddingOp::reshape: weight not loaded") and
return; after that safely compute out_dtype (preserving the kUInt16 →
kUInt16PerTensorAsy mapping) and proceed to create the output tensor
(outputs.emplace_back(...)). This mirrors the existing guard used in trace() and
prevents dereferencing an uninitialized weight_ when reshape() is invoked before
load().
| if hasattr(self, "weight_fake_quant"): | ||
| del self.weight_fake_quant | ||
|
|
||
| class_name = self.__class__.__name__ | ||
| instance_class_name = type(self).__name__ | ||
| print( | ||
| f"Class: {class_name}, Instance: {instance_class_name}, Deploy Mode Activated. 'weight' is now {self.weight.dtype} buffer. zp is {zero_point}" | ||
| ) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Replace print() with proper logging; disable_quant() will fail after deployment conversion.
- Per coding guidelines, use the
loggingmodule instead ofprint()for significant events. - After
convert_to_deploy()deletesweight_fake_quant, callingdisable_quant()will raise anAttributeError.
Suggested fix
+import logging
+
+logger = logging.getLogger(__name__)
+
# ... in convert_to_deploy:
# D. Clean up unnecessary modules
if hasattr(self, "weight_fake_quant"):
del self.weight_fake_quant
- class_name = self.__class__.__name__
- instance_class_name = type(self).__name__
- print(
- f"Class: {class_name}, Instance: {instance_class_name}, Deploy Mode Activated. 'weight' is now {self.weight.dtype} buffer. zp is {zero_point}"
- )
+ logger.info(
+ f"{self.__class__.__name__}: Deploy mode activated. "
+ f"'weight' is now {self.weight.dtype} buffer, zp={zero_point}"
+ )Also consider adding a guard in disable_quant():
def disable_quant(self):
"""Completely turn off quantization noise and return to floating point mode"""
+ if not hasattr(self, "weight_fake_quant"):
+ logger.warning("Cannot disable quant: already converted to deploy mode")
+ return
self.weight_fake_quant.disable_fakequant()🤖 Prompt for AI Agents
In `@pymllm/backends/qualcomm/transformers/core/embedding.py` around lines 99 -
106, Replace the print() call in convert_to_deploy() with a logging call using
the module-level logger (use logging.getLogger(__name__)) and an appropriate
level (info/debug) to record class name, instance name, weight dtype and
zero_point; additionally, modify disable_quant() to guard against missing
attributes by checking for hasattr(self, "weight_fake_quant") before accessing
or deleting it (and handle the case where convert_to_deploy() already removed
it), ensuring calls to disable_quant() after deployment conversion do not raise
AttributeError.
… input layer normalization handling in Qwen3DecoderLayer. Update weight conversion logic in training script to address model compatibility issues.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pymllm/backends/qualcomm/transformers/qwen3/train.py (1)
39-58: Fail fast for unsupported 1.7B before loading the model.The guard runs after
Qwen3Quantizerconstruction, which loads the model and can waste GPU memory or OOM. Move the check right afterargsparsing to fail early.🔧 Proposed fix
args = parser.parse_args() - m = Qwen3Quantizer(args.model_path, mllm_qualcomm_max_length=args.max_length) - - # FIXME: Should disable or not. - m.disable_fake_quant() - m.calibrate(num_samples=args.num_samples, max_seq_length=args.max_length) - m.enable_fake_quant() - m.infer(args.infer_text) - - # !!! - # Things below is for deploy. We will turn all fp32 weights and some buffers(rope) to quantized dtype. - # !!! - # This line maybe error. we need use quantized weight!!! not embed_tokens.weight!!! - # m.model.lm_head.weight = torch.nn.Parameter( - # m.model.model.embed_tokens.weight.clone() - # ) if "1.7B" in args.model_path: raise ValueError( "1.7B model is not supported for now due to tied embedding weights is not supported." ) + + m = Qwen3Quantizer(args.model_path, mllm_qualcomm_max_length=args.max_length) + + # FIXME: Should disable or not. + m.disable_fake_quant() + m.calibrate(num_samples=args.num_samples, max_seq_length=args.max_length) + m.enable_fake_quant() + m.infer(args.infer_text)
🧹 Nitpick comments (1)
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (1)
397-399: Add a comment explaining why layer 0 skipsinput_layernorm_input_qdqinitialization.The code safely skips creating and using
input_layernorm_input_qdqfor layer 0 only, with matching conditions in both initialization (line 398) and forward (line 418). No other code in the codebase accesses this attribute. However, adding a brief comment explaining the rationale for this special case would improve code clarity for future maintainers.
Summary by CodeRabbit
New Features
Bug Fixes & Improvements
Chores
✏️ Tip: You can customize this high-level summary in your review settings.