fix: Qualcomm QNN AOT Pass#579
Conversation
📝 WalkthroughWalkthroughThis PR adds PTQ integration into the QNN AOT lowering pipeline, introduces Sigmoid IR/op/backend support, expands quantization types and tensor view handling, implements PTQ solver passes and QDQ wrappers for Qwen3, and adds Qualcomm PTQ tooling and Python workflows for Qwen3. Changes
Sequence Diagram(s)sequenceDiagram
participant User as compile.cpp
participant AOT as createQnnAOTLoweringPipeline
participant AOTCtx as AOTCompileContext
participant PTQ as PTQPass
participant ParamFile as ParameterFile
participant IR as ModuleOp/IRGraph
participant Specs as QuantizationSpec
User->>AOT: call createQnnAOTLoweringPipeline(env, cfg, params)
AOT->>AOTCtx: ctx.setParamFile(params)
AOT->>PTQ: insert PTQPass into pipeline (llm_recipe path)
AOT->>User: return pipeline
Note over PTQ,IR: PTQ run traverses IR and resolves quant specs
PTQ->>IR: run(ModuleOp)
PTQ->>PTQ: recursiveSolveWeights(subgraph)
PTQ->>ParamFile: read weight scales/zeros for layer
ParamFile-->>PTQ: return scale/zero_point
PTQ->>Specs: populate spec (scale/zero_point) and mark solved
PTQ->>PTQ: recursiveSolveNormal(tensors)
PTQ->>IR: read attachedViews (scale/zero_point tensors)
IR-->>PTQ: provide attached views
PTQ->>Specs: populate/specify & mark solved
PTQ-->>IR: PASS_RET_SUCCESS
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (4){mllm,mllm-cli,pymllm}/**/*📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
🧬 Code graph analysis (1)mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (5)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (9)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 17
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Fix all issues with AI Agents 🤖
In @examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp:
- Around line 32-38: Remove the dead if/else by computing a single prefix from
m->getModuleName() and then assign scale_name and zp_name once; e.g., build
prefix = empty ? "" : moduleName + "." using m->getModuleName(), then set
scale_name = prefix + qdq_name_in_pytorch + ".fake_quant.scale" and zp_name =
prefix + qdq_name_in_pytorch + ".fake_quant.zero_point" (replace the duplicated
if/else that currently references scale_name, zp_name, m->getModuleName(), and
qdq_name_in_pytorch).
In @mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp:
- Around line 390-393: The quantization spec uses an out-of-range max value:
change the call to ir::linalg::QuantizationSpecAsymPerTensor::create(...) used
when constructing weight_spec_attr so that the quant_max argument is 65535
(uint16 max) instead of 65536; update the value passed to
QuantizationSpecAsymPerTensor::create in the weight_spec_attr initialization
(leave the FIXME about hardcoded dtype for later).
- Around line 865-872: The output tensor handling block in
LLMQuantRecipePass.cpp incorrectly appends an existing quant_recipe to
annotation_attr->annotation_.inputs instead of outputs; locate the branch that
checks o_0->getAttr("quant_recipe") (the o_0/output handling near
genSimpleQuantizationSpecAttr and annotation_attr) and change the emplace_back
call to push the spec onto annotation_attr->annotation_.outputs (using the same
cast to ir::linalg::LinalgIRQuantizatonSpecAttr and ->spec_) so existing output
specs are recorded under outputs instead of inputs.
- Around line 770-771: The QuantizationSpecAsymPerTensor creation uses an
off-by-one max for uint16; update the call that constructs out_quant_spec
(ir::linalg::QuantizationSpecAsymPerTensor::create) to use 65535 instead of
65536 for quant_max so the per-tensor asymmetric uint16 range is correct (match
the fix used in the RMSNorm pattern).
In @mllm/backends/qnn/aot/passes/PTQPass.cpp:
- Line 135: The function name __recursiveSolveNormalImpl uses a reserved
double-underscore prefix; rename it to recursiveSolveNormalImpl (or another
single-underscore-free identifier) and update its definition and all call sites
in this file to the new name, including any forward declarations or references
that invoke __recursiveSolveNormalImpl so linkage and compilation remain
consistent.
In @mllm/core/aops/SigmoidOp.cpp:
- Around line 27-33: SigmoidOp::reshape currently accesses inputs[0] without
checking inputs.size(); add validation at the start of reshape to ensure inputs
is non-empty (e.g., if inputs.empty() throw or return an error), and also
validate that outputs is empty or sized appropriately before emplacing; after
the check, proceed to use inputs[0] as before to construct outputs (either
in-place or empty tensor), so you avoid out-of-bounds access when inputs is
empty.
- Around line 16-21: SigmoidOp::trace currently assumes inputs and outputs are
non-empty; add the same validation used in reshape() to check that inputs.size()
> 0 and outputs.size() > 0 at the start of SigmoidOp::trace and handle the
failure path (e.g., throw an exception or return early) before calling
ir::tensor::wrapTensors2TensorIR and ir_ctx->create<ir::linalg::SigmoidOp>(...),
so you never pass empty vectors into the IR wrapping/creation logic.
- Around line 23-25: The SigmoidOp::forward currently calls NYI and will crash
if the base aops::SigmoidOp is ever used; replace the NYI with a proper
implementation or a clear runtime error and/or documentation. If a generic
fallback is desired, implement element-wise sigmoid using inputs[0] ->
outputs[0] (apply 1/(1+exp(-x)) across the tensor elements) or delegate to
available element-wise utilities; otherwise change the NYI to throw a
descriptive exception (e.g., "SigmoidOp::forward must be implemented by
backend-specific subclass like CPUSigmoidOp") and add a comment indicating the
base class is abstract and not to be instantiated directly. Ensure you reference
SigmoidOp::forward and the inputs/outputs vectors when making the change.
In @mllm/core/OpTypes.hpp:
- Around line 98-99: The optype2Str function is missing a switch case for the
new enum value kSigmoid, causing it to fall through to "Unknown"; update the
switch in optype2Str to add a case for kSigmoid that returns the appropriate
string (e.g., "kSigmoid") so logging/diagnostics report the correct op type;
locate the switch handling OpType enum values in optype2Str and insert the new
case alongside the other cases for consistency.
In @mllm/core/Tensor.cpp:
- Around line 33-36: The current Tensor::operator delete and delete_() call
impl_.reset() before accessing impl_->attachedViews(), causing a use-after-free;
fix by first obtaining a copy or reference of the attached views (e.g., auto
views = ((Tensor*)ptr)->impl_->attachedViews() or check impl_ and iterate its
attachedViews()), iterate that collection to reset each view->second, and only
after cleaning up the attached views call impl_.reset(); also add a null check
on ptr and on impl_ before accessing to be safe.
- Around line 78-91: The switch in Tensor::constant mis-casts
kUInt16PerTensorAsy to int16_t causing wrong values for unsigned 16-bit; update
the kUInt16PerTensorAsy case to write through rhs_tensor.ptr<uint16_t>() and
cast x to uint16_t (or use static_cast<uint16_t>(x)), ensuring the
dtype-specific pointer type matches the unsigned 16-bit DataTypes enum; verify
similar patterns for kInt16PerTensorSym remain int16_t.
In @mllm/core/Tensor.hpp:
- Around line 470-476: The public method name __unsafeSetDType uses a
double-underscore prefix reserved by C++; rename it to a non-reserved identifier
(e.g., unsafeSetDType) across the Tensor API: update the declaration in class
Tensor (replace __unsafeSetDType with unsafeSetDType), adjust all
implementations, usages, and any related tests or headers to the new name, and
ensure symbols in documentation and bindings are updated to avoid the
reserved-identifier conflict.
In @pymllm/backends/qualcomm/transformers/.gitignore:
- Line 1: The .gitignore contains an unclear placeholder entry
"static_one_more_thing.py"; either remove this leftover if it’s not needed, or
replace it with a descriptive name (e.g., "static_qnn_debug.py") and add an
inline comment in the same .gitignore explaining its purpose and why it should
be ignored so future maintainers understand why "static_one_more_thing.py" (or
the new name) is excluded.
In @pymllm/backends/qualcomm/transformers/core/qdq.py:
- Around line 18-19: The code sets self.dtype = torch.qint32 which FakeQuantize
does not support; update the simulation dtype to a supported type (e.g., set
self.dtype = torch.qint8 for symmetric or torch.quint8 for asymmetric
quantization) and adjust the MinMaxObserver initialization used by FakeQuantize
(e.g., observer dtype/quant_min/quant_max or dtype parameter) to match the
chosen qint8/quint8 type so dtype validation passes at runtime; locate and
change the self.dtype assignment in qdq.py and any places that construct
MinMaxObserver or pass dtype to FakeQuantize to use the chosen 8-bit qtype
consistently.
In @pymllm/backends/qualcomm/transformers/core/qlinear.py:
- Around line 169-180: DoubleQuantizer.forward currently leaves the frozen-path
reconstruction unimplemented: when self.is_frozen is True and
self.w_recon_cached is None you must reconstruct the full-precision weights from
the stored quantized buffers (e.g., weight_q, scale_1, scale_2) instead of
falling back to quantize_dequantize(w). Implement logic in
DoubleQuantizer.forward to read the stored buffers (self.weight_q, self.scale_1,
self.scale_2), apply the inverse quantization/scale math used by this quantizer
to produce the reconstructed tensor, assign it to self.w_recon_cached, and
return it; ensure you preserve original dtype/shape and handle missing buffers
by raising a clear error rather than silently calling quantize_dequantize; keep
the existing behavior for the non-frozen path (return
self.quantize_dequantize(w)).
- Around line 193-207: The convert_to_deploy method must ensure the quantizer is
frozen before accessing its buffers: mirror the pattern used in
QLinearW8A16_PerChannelSym.convert_to_deploy by checking the quantizer's frozen
state (or calling its freeze() method) and validating that
self.weight_quant.weight_q is not None before deleting/registrering buffers; if
weight_q is None raise a clear RuntimeError explaining the quantizer must be
frozen, otherwise proceed to register_buffer("weight", ...), scale1 and scale2
and then delete self.weight_quant and set deploy_mode.
In @pymllm/backends/qualcomm/transformers/core/rms_norm.py:
- Around line 122-123: The extra_repr method in the RMSNorm class references a
nonexistent attribute self.variance_epsilon causing an AttributeError; change
the representation to use the existing attribute self.eps (or create an alias
self.variance_epsilon = self.eps in __init__), e.g., update extra_repr to return
f"{tuple(self.weight.shape)}, eps={self.eps}" so it uses the defined attribute.
In @pymllm/backends/qualcomm/transformers/core/test_qlinear.py:
- Around line 41-48: The test constructs QLinearLPBQ with unsupported kwargs
which triggers a TypeError; remove the invalid parameters
already_quantized_weight and already_quantized_activation from the
QLinearLPBQ(...) call in the test (keep only in_features, out_features, bias,
block_size) or if those flags are required update QLinearLPBQ.__init__ to accept
and store them; locate the test instantiation of QLinearLPBQ in test_qlinear.py
and adjust the constructor call or update QLinearLPBQ.__init__ accordingly.
In @pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py:
- Around line 583-596: The code assumes input_ids is always set when computing
seq_len and causal_mask, but input_ids can be None when inputs_embeds is
provided; update the logic in the causal mask block (the code around computing
seq_len and causal_mask) to derive seq_len from position_ids if available,
otherwise from inputs_embeds.shape[1], and only fall back to input_ids.shape[1]
if needed; also obtain the device from whichever tensor you use (position_ids,
inputs_embeds, or input_ids) and keep the existing dtype and unsqueeze behavior
so causal_mask creation remains unchanged.
- Around line 558-566: The code uses assert to check mllm_qualcomm_max_length
when both mllm_max_sin_embedding and mllm_max_cos_embedding are None; replace
the assert with an explicit exception (e.g., raise ValueError) that includes a
clear message mentioning mllm_qualcomm_max_length so the check cannot be
bypassed with -O; keep the surrounding logic that reads kwargs and constructs
max_position_ids (using position_ids.dtype and position_ids.device) but change
the assert line to a guarded if that raises the exception when
mllm_qualcomm_max_length is None.
In @pymllm/backends/qualcomm/transformers/qwen3/runner.py:
- Around line 125-132: Replace the mixed-language comment above the
MsDataset.load call: remove the Chinese "几十G" and rewrite in English to explain
that streaming=True is used to avoid downloading the full dataset (which is tens
of gigabytes) and to document why "modelscope/wikitext" with
subset_name="wikitext-103-v1" and split="train" was chosen for calibration;
update the comment to explicitly state calibration assumptions (e.g., using
English Wikipedia-like text, streaming processes examples on the fly, not
shuffled) and ensure the comment sits immediately above the MsDataset.load
invocation and references the dataset and streaming behavior.
In @pymllm/backends/qualcomm/transformers/qwen3/train.py:
- Around line 1-7: The argparse definition for the output directory is missing a
required flag; update the ArgumentParser.add_argument call for "--output_dir"
(and any other add_argument entries in the same blocks that are used later
without defaults) to include required=True (or provide a sensible default) so
the script fails early instead of raising a runtime error when output_dir is
absent; look for the argparse.ArgumentParser and add_argument("--output_dir",
...) call in train.py and change it to required=True (and mirror this fix for
the other add_argument calls referenced in the same sections).
In @requirements-qnn-aot.txt:
- Around line 1-4: Replace the vulnerable transformers pin
"transformers==4.57.3" with a patched release (e.g., the nearest fixed 4.x
release) or with "transformers==5.0.0rc1" after you validate compatibility;
update the requirements entry for "transformers==4.57.3" accordingly, run the
project's test suite and CI to confirm nothing breaks, and regenerate any
lockfiles or dependency artifacts to ensure the updated package is used.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pymllm/utils/mllm_convertor.py (1)
113-113: Update the message to reflect the new handling logic.The message "No pipeline specified" may be misleading since the code now handles the case when no pipeline is specified (using the "_raw" pipeline at lines 69-86). Consider updating this message or clarifying what conditions trigger this else branch.
mllm/backends/qnn/aot/passes/AOTPipeline.cpp (1)
33-36: Empty pipeline returned for non-llm_recipe configs.When the config doesn't contain
quant_recipe.llm_recipe = true, the function logs a warning but returns an empty vector. This means calling code receives no passes, which may silently fail. Consider whether this should be an error or if there should be a fallback pipeline.
🟡 Minor comments (7)
pymllm/backends/qualcomm/transformers/.gitignore-1-1 (1)
1-1: Clarify the purpose of this .gitignore entry—the name appears to be a placeholder.The entry
static_one_more_thing.pyhas an atypical name that suggests it may be a debug or temporary file. If this is intentional, consider renaming it to something descriptive (e.g.,static_qnn_debug.py) and adding a comment to.gitignoreexplaining its purpose. If it's a leftover from development, remove it.mllm/core/Tensor.cpp-78-91 (1)
78-91: Type mismatch forkUInt16PerTensorAsy.Line 87 casts to
int16_t*for an unsigned type (kUInt16PerTensorAsy), which can cause incorrect values for values exceedingINT16_MAX.🔎 Proposed fix
- case kUInt16PerTensorAsy: *(rhs_tensor.ptr<int16_t>()) = x; break; + case kUInt16PerTensorAsy: *(rhs_tensor.ptr<uint16_t>()) = static_cast<uint16_t>(x); break;mllm/core/Tensor.hpp-470-476 (1)
470-476: Reserved identifier in public API.The method
__unsafeSetDTypeuses a double-underscore prefix which is reserved by the C++ standard for implementation use. This could cause conflicts with compiler internals. Consider renaming tounsafeSetDTypeor similar.🔎 Proposed fix
- [[nodiscard]] Tensor __unsafeSetDType(DataTypes dt) const; + [[nodiscard]] Tensor unsafeSetDType(DataTypes dtype) const;pymllm/backends/qualcomm/transformers/qwen3/train.py-1-7 (1)
1-7: Add required flag for output_dir argument.The
output_dirparameter is used without a default value but is not marked as required. This will cause a runtime error when the argument is not provided.🔎 Proposed fix
parser.add_argument( "--output_dir", type=str, + required=True, help="Directory to save the quantized model", )Also applies to: 8-36, 52-54
pymllm/backends/qualcomm/transformers/qwen3/runner.py-125-132 (1)
125-132: Fix mixed-language comment and document calibration assumptions.The comment contains Chinese characters ("几十G") which should be English-only per coding guidelines. Also, consider documenting the dataset choice and streaming behavior.
🔎 Proposed fix
- # 2. Load Wikipedia dataset (English version example) - # Use streaming=True to download and process on the fly, without downloading the full几十G dataset + # 2. Load Wikipedia dataset (English WikiText-103) + # Uses ModelScope dataset loader; streaming avoids downloading the full dataset (~GB scale) dataset = MsDataset.load(examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp-32-38 (1)
32-38: Remove dead code: identical if/else branches.Both the
ifandelsebranches perform the same assignments. Theif (m->getModuleName().empty())check at line 32 is redundant since lines 33-34 and 36-37 are identical.🔎 Proposed fix
Tensor QDQ(nn::Module* m, Tensor in, const std::string& qdq_name_in_pytorch) { - std::string scale_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.scale"; - std::string zp_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.zero_point"; - - if (m->getModuleName().empty()) { - scale_name = qdq_name_in_pytorch + ".fake_quant.scale"; - zp_name = qdq_name_in_pytorch + ".fake_quant.zero_point"; - } else { - scale_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.scale"; - zp_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.zero_point"; - } + std::string prefix = m->getModuleName().empty() ? "" : m->getModuleName() + "."; + std::string scale_name = prefix + qdq_name_in_pytorch + ".fake_quant.scale"; + std::string zp_name = prefix + qdq_name_in_pytorch + ".fake_quant.zero_point";Committable suggestion skipped: line range outside the PR's diff.
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-558-566 (1)
558-566: Replaceassertwith explicit exception for production code.Using
assertfor input validation can be disabled with-Oflag. Use an explicit exception for more robust error handling.🔎 Proposed fix
if self.mllm_max_sin_embedding is None and self.mllm_max_cos_embedding is None: mllm_qualcomm_max_length = kwargs.get("mllm_qualcomm_max_length", None) - assert mllm_qualcomm_max_length is not None + if mllm_qualcomm_max_length is None: + raise ValueError("mllm_qualcomm_max_length must be provided when initializing embeddings") max_position_ids = torch.arange( 0, mllm_qualcomm_max_length,
🧹 Nitpick comments (30)
pymllm/convertor/mllm_type_mapping.py (1)
94-94: Consider relocating for better organization.The
torch.uint16mapping is placed after the quantized type mappings section, but it's not a quantized type. For improved code organization and maintainability, consider moving this entry to line 89 (aftertorch.uint8: 129) to group unsigned integer types together.🔎 Proposed reorganization
torch.int64: 132, # kInt64 torch.uint8: 129, # kUInt8 + torch.uint16: 130, # kUInt16 torch.bool: 129, # kUInt8 (Boolean type in PyTorch is usually represented as uint8) # Quantized type mappings torch.qint8: 16, # kInt8 torch.quint8: 129, # kUInt8 torch.qint32: 18, # kInt32 - torch.uint16: 130, # kUInt16 } )pymllm/quantize/pipeline.py (1)
23-25: Add docstring to clarify the purpose of the raw pipeline.The
build_raw_pipeline()function is identical tobuild_cast2fp32_pipeline()(lines 18-20), both returning an emptyQuantizeSolverwith no passes. While they may serve different semantic purposes based on how they're used in the convertor (with differentcast_left_2_fp32values), adding a docstring would clarify the intent and use case for the "raw" pipeline.💡 Suggested docstring
def build_raw_pipeline() -> QuantizeSolver: + """Create a raw quantization pipeline with no passes registered. + + This pipeline performs no transformations on parameters and is used + when no configuration or custom pipeline is specified. + + Returns: + QuantizeSolver: An empty solver with no registered passes. + """ ret = QuantizeSolver() return retpymllm/utils/mllm_convertor.py (1)
69-86: Consider refactoring to reduce code duplication.This block is nearly identical to the branches at lines 51-68 and 87-111, with only minor differences in pipeline selection and the
cast_left_2_fp32parameter value. Extracting the common quantization logic into a helper function would improve maintainability and reduce duplication.💡 Refactoring suggestion
Consider extracting common logic:
def _perform_streaming_quantization( cfg, pipeline_name, params, output_path, model_name, cast_left_2_fp32, verbose ): """Perform streaming quantization with the specified pipeline.""" pipeline: QuantizeSolver = BUILTIN_QUANTIZE_PIPELINE[pipeline_name]() old_param_size = len(params) new_param_size = pipeline.stream_quantize_params_size(cfg, params) print(f"Params Num: Before: {old_param_size}, After: {new_param_size}") pipeline.stream_quantize( cfg, params, writer=ModelFileV2( output_path, model_name, "Streaming", max_params_descriptor_buffer_num=new_param_size, ), cast_left_2_fp32=cast_left_2_fp32, verbose=verbose, )Then simplify the branches:
if args.cfg_path is None and args.pipeline is not None and args.format == "v2": _perform_streaming_quantization( None, args.pipeline, params, args.output_path, args.model_name, True, args.verbose ) elif args.cfg_path is None and args.pipeline is None and args.format == "v2": _perform_streaming_quantization( None, "_raw", params, args.output_path, args.model_name, False, args.verbose )mllm/core/TensorViewImpl.hpp (2)
92-92: Consider refactoring to provide controlled access instead of exposing internal map.Returning a non-const reference to the internal
attached_views_map breaks encapsulation and allows external code to modify internal state without validation or synchronization.🔎 Suggested refactor with controlled access
Instead of exposing the raw map, consider providing controlled accessors:
// Add a view inline void attachView(const std::string& name, bool exclude_from_hash, const TensorViewImpl::ptr_t& view) { attached_views_[name] = {exclude_from_hash, view}; } // Get a view (const) inline const TensorViewImpl::ptr_t& getAttachedView(const std::string& name) const { auto it = attached_views_.find(name); return (it != attached_views_.end()) ? it->second.second : nullptr; } // Check if view exists inline bool hasAttachedView(const std::string& name) const { return attached_views_.find(name) != attached_views_.end(); } // Get exclude_from_hash flag inline bool getViewExcludeFlag(const std::string& name) const { auto it = attached_views_.find(name); return (it != attached_views_.end()) ? it->second.first : false; } // If mutable access is truly needed, provide const access by default inline const std::unordered_map<std::string, std::pair<bool, TensorViewImpl::ptr_t>>& attachedViews() const { return attached_views_; }This approach maintains encapsulation while still allowing necessary access patterns.
101-102: Document the semantics of the bool flag in the pair.The comment explains the bool is for "judge if this tensor should be considered in hashing," but the semantics could be clearer. Consider documenting:
- What does
truevsfalsemean?- When should views be excluded from hashing?
- How does this affect tensor equality/comparison?
mllm/core/Tensor.cpp (2)
295-314: Consider passingrhsby const reference.The
addConstant,subConstant, andmulConstantmethods takeTensor rhsby value, incurring a copy. If the copy is not required for ownership transfer, passing byconst Tensor&would be more efficient.
508-511: Reserved identifier prefix and const-correctness concern.The
__unsafeSetDTypename uses a reserved identifier prefix (__). The method is markedconstbut mutates internal state, which is intentionally unsafe. Consider:
- Renaming to avoid reserved prefix (e.g.,
unsafeSetDType_)- Adding a comment explaining the safety requirements for callers
pymllm/backends/qualcomm/transformers/core/qlinear.py (1)
27-30: Redundant null check.Line 29 checks
self.weight_quant is not Nonebut this is already guaranteed by the outer condition on line 25.🔎 Proposed fix
if self.weight_quant is not None: # Compatible with official FakeQuantize module - if ( - isinstance(self.weight_quant, FakeQuantize) - and self.weight_quant is not None - ): + if isinstance(self.weight_quant, FakeQuantize): _ = self.weight_quant(self.weight)pymllm/backends/qualcomm/transformers/core/test_qlinear.py (1)
80-89: Consider using assertions for test framework integration.The test returns
True/Falseinstead of using assertions. For integration with pytest or unittest, consider usingassertstatements or framework-specific assertions that provide better error reporting.🔎 Proposed fix
- if relative_error < tolerance: - print(f"\n✓ TEST PASSED: Relative error {relative_error:.6e} < {tolerance}") - return True - else: - print(f"\n✗ TEST FAILED: Relative error {relative_error:.6e} >= {tolerance}") - return False + assert relative_error < tolerance, ( + f"Relative error {relative_error:.6e} >= tolerance {tolerance}" + ) + print(f"\n✓ TEST PASSED: Relative error {relative_error:.6e} < {tolerance}")mllm/compile/ir/NodeRTTIClassOfImpl.hpp (1)
229-230: SigmoidOp RTTI macro follows established pattern.The new
RTTI_RK_OP_LINALGIROP_SIGMOIDOP_IMPLmacro is consistent with other operation macros in this auto-generated file. While the static analysis tool suggests using a constexpr template function instead of macros, this change should be applied at the code generator level (likelymllm/compile/ir/rtti_kind_gen.py) to affect all generated macros consistently.If you'd like to modernize the RTTI implementation to use constexpr templates, consider updating the code generator script to emit constexpr template functions for all operations rather than function-like macros.
mllm/backends/qnn/aot/passes/AOTCompileContext.hpp (1)
39-39: Consider explicit member initialization in the constructor.While
params_(smart pointer) defaults to nullptr andconfig_(nlohmann::json) has a default constructor, explicit initialization improves clarity and aligns with modern C++ best practices.🔎 Proposed fix
private: // Private constructor - AOTCompileContext() = default; + AOTCompileContext() : env_(nullptr), config_(), params_(nullptr) {}mllm/backends/qnn/aot/passes/PTQPass.hpp (1)
17-27: Consider explicitly deleting copy and move operations.Pass objects are typically not copyable since they operate on IR graphs with side effects. Consider adding deleted copy/move constructors and assignment operators to make the intent explicit and prevent accidental copying.
🔎 Proposed fix
class PTQPass final : public ir::Pass { public: PTQPass() = default; ~PTQPass() override = default; + // Delete copy and move operations + PTQPass(const PTQPass&) = delete; + PTQPass& operator=(const PTQPass&) = delete; + PTQPass(PTQPass&&) = delete; + PTQPass& operator=(PTQPass&&) = delete; + // Run the PTQ pass on the given operation // Expected input: ModuleOp containing the computation graph // Output: Modified IR with PTQ transformations applied uint8_t run(const ir::node_ptr_t& op) override; };pymllm/backends/qualcomm/transformers/qwen3/train.py (1)
39-50: Consider adding error handling for the quantization workflow.The script lacks error handling for critical operations like model loading, calibration, and conversion. Consider wrapping the workflow in try-except blocks to provide meaningful error messages if any step fails.
🔎 Proposed enhancement
args = parser.parse_args() + try: - m = Qwen3Quantizer(args.model_path, mllm_qualcomm_max_length=args.max_length) - m.calibrate(num_samples=args.num_samples, max_seq_length=args.max_length) - # m.compile() - m.infer(args.infer_text) - - # !!! - # Things below is for deploy. We will turn all fp32 weights and some buffers(rope) to quantized dtype. - # !!! - m.model.lm_head.weight = torch.nn.Parameter( - m.model.model.embed_tokens.weight.clone() - ) - m.convert() + m = Qwen3Quantizer(args.model_path, mllm_qualcomm_max_length=args.max_length) + m.calibrate(num_samples=args.num_samples, max_seq_length=args.max_length) + # m.compile() + m.infer(args.infer_text) + + # !!! + # Things below is for deploy. We will turn all fp32 weights and some buffers(rope) to quantized dtype. + # !!! + m.model.lm_head.weight = torch.nn.Parameter( + m.model.model.embed_tokens.weight.clone() + ) + m.convert() + except Exception as e: + print(f"Error during quantization workflow: {e}") + raiseexamples/qwen3_qnn_aot/compile.cpp (1)
39-40: Consider validatingparamsbefore use.The
paramspointer is used in multiple places (lines 67-71 for attaching quantization metadata, line 82 for pipeline creation), but there's no validation thatmllm::load()succeeded. If the model file is missing or corrupt,paramscould be null or invalid, leading to crashes when callingpull().🔎 Suggested validation
auto params = mllm::load(model_path.get(), mllm::ModelFileVersion::kV2); + if (!params) { + MLLM_ERROR_EXIT(mllm::ExitCode::kCoreError, "Failed to load model parameters from {}", model_path.get()); + return -1; + } model.load(params);mllm/backends/cpu/ops/SigmoidOp.cpp (2)
4-4: Unused include.The
<cstring>header appears unused in this file. Consider removing it.-#include <cstring>
12-14: Consider adding input validation.The function assumes
inputshas at least one element andoutputshas a valid pre-allocated tensor. For robustness, consider adding assertions or checks.🔎 Suggested validation
void CPUSigmoidOp::forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) { + MLLM_ASSERT(!inputs.empty() && !outputs.empty(), "SigmoidOp requires at least one input and output tensor"); const auto& X = inputs[0]; auto& Y = outputs[0];mllm/core/DataTypes.cpp (1)
109-109: Inconsistent error message format.The
lanesOfTypefunction now includes the dtype value in the error message (line 55:"Unknown data type {}"with(int32_t)dtype), butbytesOfType(line 109) andnameOfType(line 163) still use the old format without the value. Consider updating these for consistency.🔎 Suggested fix
// In bytesOfType: - default: NYI("Unknown data type"); + default: NYI("Unknown data type {}", (int32_t)dtype); // In nameOfType: - default: NYI("Unknown data type"); + default: NYI("Unknown data type {}", (int32_t)dtype);Also applies to: 163-163
mllm/backends/cpu/kernels/x86/sigmoid.cpp (2)
19-20: Potential type mismatch betweenvector_sizeand loop variables.
hn::Lanes(d)returnssize_t, butaligned_len,len, and loop counters areint. This could cause issues on large tensors wherelenexceedsINT_MAX, or subtle bugs from signed/unsigned comparisons.🔎 Suggested fix
-void sigmoid_fp32(const mllm_fp32_t* __restrict X, mllm_fp32_t* __restrict Y, int len, int thread_count) { +void sigmoid_fp32(const mllm_fp32_t* __restrict X, mllm_fp32_t* __restrict Y, size_t len, int thread_count) { using D = hn::ScalableTag<float>; const D d; - const auto vector_size = hn::Lanes(d); - const int aligned_len = len - (len % vector_size); + const size_t vector_size = hn::Lanes(d); + const size_t aligned_len = len - (len % vector_size);Note: This would require updating the header declaration as well.
31-31: Consider adding<cmath>include forstd::exp.The scalar fallback uses
std::exp()but the file doesn't explicitly include<cmath>. While it may work via transitive includes from Highway, explicit includes are safer and more portable.#include "mllm/backends/cpu/kernels/x86/sigmoid.hpp" #include "mllm/core/Parallel.hpp" +#include <cmath>Also applies to: 41-41
mllm/backends/cpu/kernels/arm/sigmoid.cpp (1)
13-86: Consider extracting common vectorized loop logic to reduce duplication.The multi-threaded and single-threaded paths share nearly identical vectorized processing logic (lines 17-50 vs 53-84). While functionally correct, this duplication increases maintenance burden.
🔎 Suggested approach
Extract the tail-handling loop (8-wide, 4-wide, scalar) into a helper function that can be reused by both paths:
static inline void sigmoid_fp32_tail(const mllm_fp32_t* X, mllm_fp32_t* Y, int start, int len) { int i = start; for (; i <= len - 8; i += 8) { // 8-wide processing... } for (; i <= len - 4; i += 4) { // 4-wide processing... } for (; i < len; i++) { Y[i] = 1.0f / (1.0f + std::exp(-X[i])); } }pymllm/backends/qualcomm/transformers/qwen3/runner.py (3)
19-21: Simplify multipleisinstancechecks using tuple.Multiple
isinstancecalls can be combined for readability.🔎 Proposed fix
def freeze_qwen3_linear_weight(m): - if isinstance(m, QLinearLPBQ) or isinstance(m, QLinearW8A16_PerChannelSym): + if isinstance(m, (QLinearLPBQ, QLinearW8A16_PerChannelSym)): m.freeze_weight()Similarly for
convert_weightat lines 35-40:def convert_weight(m): - if ( - isinstance(m, QLinearLPBQ) - or isinstance(m, QLinearW8A16_PerChannelSym) - or isinstance(m, QRMSNorm) - ): + if isinstance(m, (QLinearLPBQ, QLinearW8A16_PerChannelSym, QRMSNorm)): m.convert_to_deploy()
94-99: Extract magic number151668to a named constant.The token ID for
</think>should be a named constant for clarity and maintainability. As per coding guidelines, use named constants instead of magic numbers.🔎 Proposed fix
+# Token ID for </think> end marker +THINK_END_TOKEN_ID = 151668 + class Qwen3Quantizer: ... def infer(self, prompt: str): ... # parsing thinking content try: # rindex finding 151668 (</think>) - index = len(output_ids) - output_ids[::-1].index(151668) + index = len(output_ids) - output_ids[::-1].index(THINK_END_TOKEN_ID) except ValueError: index = 0
143-145: Consider making the minimum text length configurable.The hardcoded
1024character minimum may not be suitable for all calibration scenarios. Consider making this a parameter or named constant.- def calibrate(self, num_samples=64, max_seq_length=512): + def calibrate(self, num_samples=64, max_seq_length=512, min_text_length=1024): ... - if len(entry["text"].strip()) < 1024: + if len(entry["text"].strip()) < min_text_length: continuemllm/backends/qnn/aot/passes/PTQPass.cpp (3)
173-174: Remove unusedpfparameter fromrecursiveSolveNormal.The
pfparameter is declared but never used in the function body. This was likely carried over fromrecursiveSolveWeightsbut isn't needed here.🔎 Proposed fix
-void recursiveSolveNormal(const std::shared_ptr<ir::IRContext>& ir_ctx, const ir::graph::SubGraphOp::ptr_t& call_op, - const ParameterFile::ptr_t& pf) { +void recursiveSolveNormal(const std::shared_ptr<ir::IRContext>& ir_ctx, const ir::graph::SubGraphOp::ptr_t& call_op) {Update the recursive call at line 189:
- recursiveSolveNormal(w.getContext(), w.getContext()->lookupSymbolTable(ns)->cast_<ir::graph::SubGraphOp>(), pf); + recursiveSolveNormal(w.getContext(), w.getContext()->lookupSymbolTable(ns)->cast_<ir::graph::SubGraphOp>());And the call at lines 221-223:
recursiveSolveNormal(writer.getContext(), - getCtx()->lookupSymbolTable(call_main_graph_op->getSymbolAttr()->str())->cast_<ir::graph::SubGraphOp>(), - pf); + getCtx()->lookupSymbolTable(call_main_graph_op->getSymbolAttr()->str())->cast_<ir::graph::SubGraphOp>());
24-30: Potential integer type mismatch in loop.The loop index
iisintbutnumel()returnssize_t. For very large tensors, this could cause issues. Consider usingsize_tfor the loop variable.🔎 Proposed fix
template<typename T> void checkTypeLimits(Tensor in, int quant_min, int quant_max) { // NOLINT auto numel = in.numel(); - for (int i = 0; i < numel; ++i) { + for (size_t i = 0; i < numel; ++i) { MLLM_RT_ASSERT(*(in.ptr<T>() + i) >= quant_min); MLLM_RT_ASSERT(*(in.ptr<T>() + i) <= quant_max); } }
48-50: Address the FIXME comment regarding int8 vs uint8.The FIXME indicates uncertainty about whether QNN expects int8 or uint8. This should be resolved before release to ensure correct quantization behavior.
Would you like me to help investigate QNN's expected data types for quantized weights, or open an issue to track this?
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (1)
217-226: Complex nested QDQ wrapping in RoPE application.The RoPE application involves deeply nested QDQ calls. While functionally correct, consider extracting this into a helper function for readability:
// Lines 219-226 could become: query_states = applyRoPEWithQDQ(this, query_states, cos, sin, "q_rope"); key_states = applyRoPEWithQDQ(this, key_states, cos, sin, "k_rope");pymllm/backends/qualcomm/transformers/core/rms_norm.py (1)
97-99: Consider usingloggingmodule instead ofUsing
loggingmodule for consistency with production logging practices. As per coding guidelines, appropriate logging should be used for significant events.🔎 Proposed fix
Add import at the top of the file:
import logging logger = logging.getLogger(__name__)Then replace the print statements:
- print( - f"Class: {class_name}, Instance: {instance_class_name}, Deploy Mode Activated. 'weight' is now {self.weight.dtype} buffer. zp is {zero_point}" - ) + logger.info( + f"Class: {class_name}, Instance: {instance_class_name}, Deploy Mode Activated. 'weight' is now {self.weight.dtype} buffer. zp is {zero_point}" + )Similarly for lines 114-116.
pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (2)
279-297: Consider caching constant tensors to avoid repeated allocation.The
torch.ones(1, ...)tensors for scaling and mask fill values are created on every forward pass. For better efficiency, consider registering these as buffers during initialization.🔎 Example approach
In
__init__:self.register_buffer("_neg_20", torch.tensor(-20.0, dtype=torch.bfloat16))In
forward:self.neg_20_qdq(self._neg_20.to(device=value_states.device))This is a minor optimization since PTQ calibration prioritizes accuracy over speed.
733-740: Consider sorting__all__for consistency.Static analysis indicates
__all__is unsorted. Alphabetical sorting improves maintainability.🔎 Proposed fix (isort-style)
__all__ = [ "Qwen3ForCausalLM", "Qwen3ForQuestionAnswering", + "Qwen3ForSequenceClassification", + "Qwen3ForTokenClassification", + "Qwen3Model", "Qwen3PreTrainedModel", - "Qwen3Model", - "Qwen3ForSequenceClassification", - "Qwen3ForTokenClassification", ]
There was a problem hiding this comment.
Actionable comments posted: 5
Fix all issues with AI Agents 🤖
In @mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp:
- Around line 857-863: The input quantization spec generated for i_0 is not
appended to annotation_attr->annotation_.inputs when i_0 lacks a quant_recipe;
currently only the else branch appends the spec. Modify the first branch that
creates i_0_spec via genSimpleQuantizationSpecAttr(writer.getContext(),
i_0->cast_<ir::tensor::TensorValue>()) to also push that spec into
annotation_attr->annotation_.inputs (the same way the else branch does, using
i_0_spec->spec_ or the appropriate cast to
ir::linalg::LinalgIRQuantizatonSpecAttr), so both branches add the input spec to
the annotation.
In @mllm/backends/qnn/aot/passes/PTQPass.cpp:
- Line 135: The function name _recursiveSolveNormalImpl uses a reserved
identifier pattern; rename it to recursiveSolveNormalImpl (remove the leading
underscore) and update all call sites that invoke _recursiveSolveNormalImpl
(e.g., the calls at the former lines 183-184) to the new name so declarations,
definitions, and invocations match.
🧹 Nitpick comments (3)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (2)
390-393: Previous off-by-one error resolved.The uint16 maximum value is now correctly set to
65536 - 1(i.e., 65535). The FIXME comment appropriately acknowledges the hardcoded dtype limitation for future work.Optionally, consider using the literal
65535for improved clarity:🔎 Optional clarity improvement
auto weight_spec_attr = writer.create<ir::linalg::LinalgIRQuantizatonSpecAttr>( - ir::linalg::QuantizationSpecAsymPerTensor::create(0, 65536 - 1, kUInt16, kFloat32, kInt32, Tensor::nil(), Tensor::nil())); + ir::linalg::QuantizationSpecAsymPerTensor::create(0, 65535, kUInt16, kFloat32, kInt32, Tensor::nil(), Tensor::nil()));
770-771: Previous off-by-one error resolved.The uint16 maximum value is now correctly set to
65536 - 1(i.e., 65535).Optionally, consider using the literal
65535for improved clarity:🔎 Optional clarity improvement
- auto out_quant_spec = ir::linalg::QuantizationSpecAsymPerTensor::create(0, 65536 - 1, kUInt16, kFloat32, kInt32, + auto out_quant_spec = ir::linalg::QuantizationSpecAsymPerTensor::create(0, 65535, kUInt16, kFloat32, kInt32, Tensor::nil(), Tensor::nil());mllm/backends/qnn/aot/passes/PTQPass.cpp (1)
49-50: Consider using named constants for quantization limits.The hardcoded values (-8, 7, 0, 16) represent quantization bounds but lack semantic meaning. Named constants would improve readability and maintainability.
🔎 Example: Define constants at file or namespace scope
namespace { constexpr int32_t kInt4Min = -8; constexpr int32_t kInt4Max = 7; constexpr int32_t kUInt4Min = 0; constexpr int32_t kUInt4Max = 16; // Note: actual uint4 max is 15, verify if 16 is intentional // Then use: checkTypeLimits<int8_t>(weight, kInt4Min, kInt4Max); checkTypeLimits<uint8_t>(scale1, kUInt4Min, kUInt4Max);
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cppmllm/backends/qnn/aot/passes/PTQPass.cppmllm/core/OpTypes.hpppymllm/backends/qualcomm/transformers/core/rms_norm.py
🚧 Files skipped from review as they are similar to previous changes (1)
- mllm/core/OpTypes.hpp
🧰 Additional context used
📓 Path-based instructions (5)
{mllm,mllm-cli,pymllm}/**/*
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).
Files:
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cppmllm/backends/qnn/aot/passes/PTQPass.cpppymllm/backends/qualcomm/transformers/core/rms_norm.py
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.
Files:
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cppmllm/backends/qnn/aot/passes/PTQPass.cpppymllm/backends/qualcomm/transformers/core/rms_norm.py
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.
Files:
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cppmllm/backends/qnn/aot/passes/PTQPass.cpppymllm/backends/qualcomm/transformers/core/rms_norm.py
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).
Files:
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cppmllm/backends/qnn/aot/passes/PTQPass.cpppymllm/backends/qualcomm/transformers/core/rms_norm.py
{mllm,mllm-cli,pymllm}/**/*.{sh,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
If a file starts with a shebang, it must be the first line and contain only the interpreter path and optional single argument, encoded in UTF-8.
Files:
pymllm/backends/qualcomm/transformers/core/rms_norm.py
🧬 Code graph analysis (3)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (4)
mllm/compile/ir/linalg/Attribute.hpp (14)
spec(100-105)spec(130-135)spec(160-165)spec(192-197)spec(226-231)spec(260-265)spec(297-302)create(77-77)node(27-27)node(27-27)node(327-327)node(327-327)node(348-348)node(348-348)mllm/backends/qnn/aot/visitor/Elewise.cpp (2)
isMatch(13-15)isMatch(13-13)mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp (19)
op(40-40)op(54-54)op(66-66)op(80-80)op(92-92)op(106-106)op(120-120)op(132-132)op(144-144)op(156-156)op(170-170)op(184-184)op(198-198)op(212-212)op(226-226)op(238-238)shareQuantSpecSingleInputToSingleOutputAndSetOpQuantAnnoAttr(27-28)genSimpleQuantizationSpecAttr(24-25)cloneQuantizationSpecType(32-33)mllm/compile/ir/builtin/Attribute.hpp (10)
node(24-24)node(24-24)node(44-44)node(44-44)node(64-64)node(64-64)node(84-84)node(84-84)node(104-104)node(104-104)
mllm/backends/qnn/aot/passes/PTQPass.cpp (3)
mllm/compile/ir/linalg/Attribute.hpp (14)
quant_min(87-98)quant_min(87-88)quant_min(116-128)quant_min(116-117)quant_min(146-158)quant_min(146-147)quant_min(177-190)quant_min(177-178)quant_min(210-224)quant_min(210-211)quant_min(244-258)quant_min(244-245)quant_min(279-295)quant_min(279-281)mllm/backends/qnn/aot/passes/PTQPass.hpp (1)
op(26-26)mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (2)
run(983-1013)run(983-983)
pymllm/backends/qualcomm/transformers/core/rms_norm.py (1)
pymllm/backends/qualcomm/transformers/core/qdq.py (4)
forward(50-53)disable_observer(60-62)disable_fakequant(68-70)extra_repr(72-74)
🪛 Clang (14.0.6)
mllm/backends/qnn/aot/passes/PTQPass.cpp
[error] 32-32: 3 adjacent parameters of 'solveLinearWeight' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 32-32: parameter name 'pf' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 33-33: parameter name 'op' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 64-64: 3 adjacent parameters of 'solveRMSNormWeight' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 64-64: parameter name 'pf' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 65-65: parameter name 'op' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 100-100: 3 adjacent parameters of 'solveEmbeddingWeight' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 100-100: parameter name 'pf' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 101-101: parameter name 'op' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 120-120: 3 adjacent parameters of 'recursiveSolveWeights' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 121-121: parameter name 'pf' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 135-135: parameter name 'v' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 137-137: variable name 'tv' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 173-173: 3 adjacent parameters of 'recursiveSolveNormal' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 174-174: parameter name 'pf' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-android
- GitHub Check: build-x86
- GitHub Check: build-macos
🔇 Additional comments (21)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (9)
41-46: LGTM! Enhanced logging for pattern processing.The nested loop correctly matches patterns to their names for improved observability during quantization pass execution.
97-101: LGTM! Correct asymmetric uint16 quantization support.The implementation correctly uses the uint16 maximum value (65535) and properly includes zero_point parameters for asymmetric per-tensor quantization.
256-267: LGTM! Correct sigmoid quantization pattern.The pattern appropriately uses
noSharingSingleInAndSingleOutQuantAnnoAttrsince sigmoid transforms value ranges and should not share input/output quantization specs.
268-279: LGTM! Correct negation quantization pattern.The pattern correctly shares quantization specs between input and output, which is appropriate for negation operations that preserve the value range.
425-434: LGTM! Improved index pattern with helper reuse.The refactored implementation correctly ensures the input has a quantization recipe before delegating to the shared helper function, improving code consistency.
436-447: LGTM! Correct slice quantization pattern.The pattern appropriately shares quantization specs between input and output, which is correct for slice operations that preserve value ranges.
470-472: LGTM! Proper quantization spec cloning for constants.The use of
cloneQuantizationSpecTypecorrectly creates a new quantization spec with the same type as the first input, which is appropriate for constant operands.
874-886: LGTM! Correct embedding weight quantization handling.The implementation properly looks up the weight tensor, generates an appropriate quantization spec, and correctly adds it to the annotation's weights map.
961-963: LGTM! Correct pattern registration.The three new patterns (Neg, Slice, Sigmoid) are properly registered with descriptive names and consistent priority values.
pymllm/backends/qualcomm/transformers/core/rms_norm.py (4)
32-44: LGTM!The forward method correctly implements RMS normalization with numerical stability (float32 casting) and properly applies fake quantization to the weight parameter.
101-116: LGTM!The
freeze_weightmethod correctly implements the PTQ calibration pattern: it observes the weight, calculates quantization parameters, locks them, and provides useful logging.
118-120: LGTM!The
disable_quantmethod correctly delegates todisable_fakequant()to turn off quantization noise.
122-123: LGTM!The
extra_reprmethod now correctly referencesself.eps, which resolves the issue from the previous review.mllm/backends/qnn/aot/passes/PTQPass.cpp (8)
23-30: LGTM - Helper function logic is correct.The bounds-checking logic correctly validates quantization limits for each tensor element.
64-98: LGTM - RMSNorm weight solving logic is correct.The function properly handles both raw and asymmetric per-tensor quantization, with thorough validation of scale and zero_point properties.
100-118: LGTM - Embedding weight solver correctly handles raw quantization.The function appropriately marks raw embeddings as solved and uses NYI for unsupported quantization types.
120-133: LGTM - Graph traversal and weight solving dispatch is correct.The function properly walks the IR graph, dispatches to type-specific solvers, and handles nested subgraphs recursively.
136-171: LGTM - Tensor quantization solving logic is correct.The function properly handles asymmetric per-tensor, symmetric per-tensor, and raw quantization specs, with appropriate early returns when required attached views are missing.
173-193: LGTM - Normal tensor solving traversal is correct.The function properly walks LinalgIROps and their input/output tensors, delegating to the solver implementation and handling nested subgraphs.
197-226: LGTM - Pass entry point follows expected structure.The run method correctly retrieves the parameter file, locates the main graph, and executes the two-stage solving process (weights, then normal tensors). The structure is consistent with similar passes in the codebase.
228-228: LGTM - Factory function is correct.
| // FIXME weight maybe error, Check qnn eats int8 or uint8. Here weight using int8 to store int4. | ||
| checkTypeLimits<int8_t>(weight, -8, 7); // Int4 | ||
| checkTypeLimits<uint8_t>(scale1, 0, 16); // UInt4 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the PTQPass.cpp file
find . -name "PTQPass.cpp" -type fRepository: UbiquitousLearning/mllm
Length of output: 109
🏁 Script executed:
# Check file size and read PTQPass.cpp
wc -l ./mllm/backends/qnn/aot/passes/PTQPass.cppRepository: UbiquitousLearning/mllm
Length of output: 113
🏁 Script executed:
# Read PTQPass.cpp with focus on the relevant section
cat -n ./mllm/backends/qnn/aot/passes/PTQPass.cpp | head -70Repository: UbiquitousLearning/mllm
Length of output: 3133
🏁 Script executed:
# Search for LPBQ-related code and tests
rg -i "lpbq|int4|uint4" --type cpp --type py -B 2 -A 2 mllm/backends/qnn/ | head -100Repository: UbiquitousLearning/mllm
Length of output: 5795
🏁 Script executed:
# Search for weight usage in QNN wrapper API
rg -B 5 -A 5 "scale_level_0_int|weight" mllm/backends/qnn/aot/QnnWrappersAPI.cpp | head -150Repository: UbiquitousLearning/mllm
Length of output: 5734
🏁 Script executed:
# Search for how weights are prepared/handled in LPBQ quantization
rg -B 3 -A 3 "\.weight\"|prepare.*weight" mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp | head -100Repository: UbiquitousLearning/mllm
Length of output: 1193
🏁 Script executed:
# Look for the LPBQ weight quantization spec creation with detailed context
rg -B 10 -A 10 "create\(-8, 7" mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cppRepository: UbiquitousLearning/mllm
Length of output: 1136
🏁 Script executed:
# Search for tests related to LPBQ quantization
find . -path "*/test*" -name "*.cpp" -o -name "*.py" | xargs rg -l "LPBQ|int4.*weight" | head -5Repository: UbiquitousLearning/mllm
Length of output: 242
🏁 Script executed:
# Check the QLinear implementation to understand weight storage
cat -n ./pymllm/backends/qualcomm/transformers/core/qlinear.py | head -120Repository: UbiquitousLearning/mllm
Length of output: 5156
🏁 Script executed:
# Continue reading QLinear.py to find LPBQ weight handling
cat -n ./pymllm/backends/qualcomm/transformers/core/qlinear.py | sed -n '115,250p'Repository: UbiquitousLearning/mllm
Length of output: 4110
🏁 Script executed:
# Check if there's any special reason for using 16 as upper bound (maybe loose validation)
rg -B 3 -A 3 "0, 16" mllm/backends/qnn/aot/Repository: UbiquitousLearning/mllm
Length of output: 660
🏁 Script executed:
# Double-check the Python quantization range for scale1 one more time
rg -B 2 -A 2 "clamp.*15" ./pymllm/backends/qualcomm/transformers/core/qlinear.pyRepository: UbiquitousLearning/mllm
Length of output: 223
Remove incorrect FIXME and fix uint4 validation range.
The FIXME comment is unfounded—int8_t is the correct choice for storing signed int4 quantized weights, as confirmed by the quantization spec creating weights with range -8 to 7. However, line 50 has an off-by-one error: uint4 range is 0–15, not 0–16. Additionally, extract the magic numbers (-8, 7, 0, 15) into named constants per coding guidelines.
Fix
- // FIXME weight maybe error, Check qnn eats int8 or uint8. Here weight using int8 to store int4.
checkTypeLimits<int8_t>(weight, -8, 7); // Int4
- checkTypeLimits<uint8_t>(scale1, 0, 16); // UInt4
+ checkTypeLimits<uint8_t>(scale1, 0, 15); // UInt4📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // FIXME weight maybe error, Check qnn eats int8 or uint8. Here weight using int8 to store int4. | |
| checkTypeLimits<int8_t>(weight, -8, 7); // Int4 | |
| checkTypeLimits<uint8_t>(scale1, 0, 16); // UInt4 | |
| checkTypeLimits<int8_t>(weight, -8, 7); // Int4 | |
| checkTypeLimits<uint8_t>(scale1, 0, 15); // UInt4 |
| }); | ||
| } | ||
|
|
||
| void _recursiveSolveNormalImpl(const ir::Val::ptr_t& v) { |
There was a problem hiding this comment.
Rename function to avoid reserved identifier.
The function name _recursiveSolveNormalImpl uses an underscore followed by a capital letter, which is a reserved identifier pattern in C++. This can cause undefined behavior or conflicts with implementation-defined identifiers.
🔎 Proposed fix
-void _recursiveSolveNormalImpl(const ir::Val::ptr_t& v) {
+void recursiveSolveNormalImpl(const ir::Val::ptr_t& val) {Also update the call sites at lines 183-184:
- for (auto iii : inputs) { _recursiveSolveNormalImpl(iii->cast_<ir::Val>()); }
- for (auto ooo : outputs) { _recursiveSolveNormalImpl(ooo->cast_<ir::Val>()); }
+ for (auto iii : inputs) { recursiveSolveNormalImpl(iii->cast_<ir::Val>()); }
+ for (auto ooo : outputs) { recursiveSolveNormalImpl(ooo->cast_<ir::Val>()); }Note: This issue was flagged in a previous review but remains unresolved in the current code.
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Clang (14.0.6)
[error] 135-135: parameter name 'v' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/passes/PTQPass.cpp around line 135, The function name
_recursiveSolveNormalImpl uses a reserved identifier pattern; rename it to
recursiveSolveNormalImpl (remove the leading underscore) and update all call
sites that invoke _recursiveSolveNormalImpl (e.g., the calls at the former lines
183-184) to the new name so declarations, definitions, and invocations match.
| # Quantization configuration for Weight | ||
| self.weight_fake_quant = FakeQuantize( | ||
| observer=MinMaxObserver.with_args( | ||
| qscheme=torch.per_tensor_affine, dtype=torch.qint32 | ||
| ), | ||
| quant_min=0, | ||
| quant_max=2 ** (quant_bits) - 1, | ||
| dtype=torch.qint32, | ||
| qscheme=torch.per_tensor_affine, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's read the rms_norm.py file to see the actual implementation
cd /repo && find . -name "rms_norm.py" -path "*/qualcomm/*" | head -5Repository: UbiquitousLearning/mllm
Length of output: 122
🏁 Script executed:
# Search for the file in common locations
fd "rms_norm.py" . --type f 2>/dev/null | grep -i qualcommRepository: UbiquitousLearning/mllm
Length of output: 123
🏁 Script executed:
# Broader search
fd "rms_norm" . --type f 2>/dev/nullRepository: UbiquitousLearning/mllm
Length of output: 123
🏁 Script executed:
cat -n ./pymllm/backends/qualcomm/transformers/core/rms_norm.pyRepository: UbiquitousLearning/mllm
Length of output: 5318
🌐 Web query:
PyTorch torch.quint8 torch.quint16 quantization dtype support
💡 Result:
Short answer: PyTorch's built-in quantized dtypes are torch.quint8, torch.qint8 and torch.qint32. torch.quint16 is not provided as a native quantized dtype (and is not supported by torch.quantize_per_tensor / the core quantization APIs). New lower-bit/int dtypes (uint1–uint8, int1–int8) and other numeric types are being discussed/added in torch/torch.ao, but that is separate from the classic quantized-dtype API. [1][2][3]
Sources:
- PyTorch quantization API (lists supported quantized dtypes: quint8, qint8, qint32). [1]
- torch.quantize_per_tensor docs (dtype parameter accepted values: torch.quint8, torch.qint8, torch.qint32). [2]
- torch.ao / developer discussion on adding broader integer dtypes and roadmap (notes about uint/int variants and that quantized-tensor support is evolving). [3]
Fix dtype and quantization range mismatch between training and deployment.
The code uses an unsigned quantization range [0, 2^quant_bits - 1] with dtype=torch.qint32 during training, but converts weights to signed integer dtypes (int8, int16, int32) during deployment. This creates a critical mismatch:
- 8-bit case: Quantizes to unsigned range [0, 255], but converts to
torch.int8(signed range [-128, 127]), causing overflow for values 128–255. - 16-bit case: Quantizes to unsigned range [0, 65535], but converts to
torch.int16(signed range [-32768, 32767]), causing overflow for values 32768–65535.
PyTorch's quantization API only supports torch.quint8, torch.qint8, and torch.qint32—not torch.quint16.
Solution: Use signed quantization ranges that match the deployment dtypes. For 8-bit, use torch.quint8 and adjust the range to [0, 255] (unsigned), or use torch.qint8 with a signed range [-128, 127]. For 16-bit, use signed qint32 with a signed range or find an alternative approach since PyTorch doesn't provide quint16.
| # 2. Calculate integer values | ||
| # w_int = round(w / s + zp) | ||
| w_int = torch.round(self.weight / scale + zero_point).clamp( | ||
| quant_min, quant_max | ||
| ) | ||
|
|
||
| # 3. Set target integer type | ||
| if self.quant_bits <= 8: | ||
| target_dtype = torch.int8 | ||
| elif self.quant_bits <= 16: | ||
| target_dtype = torch.int16 | ||
| else: | ||
| target_dtype = torch.int32 | ||
|
|
||
| w_int = w_int.to(target_dtype) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n pymllm/backends/qualcomm/transformers/core/rms_norm.py | head -100Repository: UbiquitousLearning/mllm
Length of output: 4106
🏁 Script executed:
# Search for quant_min and quant_max definitions in the file
rg -n "quant_min|quant_max" pymllm/backends/qualcomm/transformers/core/rms_norm.py -B2 -A2Repository: UbiquitousLearning/mllm
Length of output: 744
🏁 Script executed:
# Search for the __init__ method to see the quantization scheme initialization
rg -n "__init__|quant_min|quant_max" pymllm/backends/qualcomm/transformers/core/rms_norm.py -B1 -A1Repository: UbiquitousLearning/mllm
Length of output: 633
🏁 Script executed:
# Look for QNN backend quantization examples or documentation
fd -e py -e md -e txt . pymllm/backends/qualcomm/ | head -20Repository: UbiquitousLearning/mllm
Length of output: 690
🏁 Script executed:
# Check if there are other quantization classes that might show the intended pattern
rg -n "quant_min|quant_max|int8|int16" pymllm/backends/qualcomm/transformers/core/ -A3 -B3Repository: UbiquitousLearning/mllm
Length of output: 8119
🏁 Script executed:
# Look at the QDQ (Quantize-Dequantize) module
cat -n pymllm/backends/qualcomm/transformers/core/qdq.pyRepository: UbiquitousLearning/mllm
Length of output: 3446
🏁 Script executed:
# Look at the QLinear module
cat -n pymllm/backends/qualcomm/transformers/core/qlinear.py | head -150Repository: UbiquitousLearning/mllm
Length of output: 6582
🏁 Script executed:
# Search for tests related to QRMSNorm
fd -e py . pymllm/ | xargs grep -l "QRMSNorm" 2>/dev/nullRepository: UbiquitousLearning/mllm
Length of output: 237
🏁 Script executed:
# Look at the README for quantization context
cat pymllm/backends/qualcomm/transformers/README.mdRepository: UbiquitousLearning/mllm
Length of output: 263
Critical: Signed/unsigned dtype mismatch will corrupt quantized weights.
The quantization uses an unsigned range [0, 2^quant_bits - 1] (asymmetric scheme), but convert_to_deploy casts to signed integer types:
- 8-bit:
w_intclamped to[0, 255]but cast toint8with range[-128, 127]→ values above 127 will overflow - 16-bit:
w_intclamped to[0, 65535]but cast toint16with range[-32768, 32767]→ values above 32767 will overflow
Fix by either using unsigned dtypes (if QNN backend supports them) or switching to symmetric quantization with signed ranges:
# 3. Set target integer type
if self.quant_bits <= 8:
- target_dtype = torch.int8
+ target_dtype = torch.uint8
elif self.quant_bits <= 16:
- target_dtype = torch.int16
+ target_dtype = torch.uint16
else:
target_dtype = torch.int32Or in __init__, change to symmetric quantization matching other quantization modules in the codebase:
- quant_min=0,
- quant_max=2 ** (quant_bits) - 1,
+ quant_min=-(2 ** (quant_bits - 1)),
+ quant_max=2 ** (quant_bits - 1) - 1,Committable suggestion skipped: line range outside the PR's diff.
Summary by CodeRabbit
New Features
Improvements
Chores
✏️ Tip: You can customize this high-level summary in your review settings.