Skip to content

fix: Qualcomm QNN AOT Pass#579

Merged
chenghuaWang merged 14 commits intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main
Jan 5, 2026
Merged

fix: Qualcomm QNN AOT Pass#579
chenghuaWang merged 14 commits intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main

Conversation

@chenghuaWang
Copy link
Copy Markdown
Collaborator

@chenghuaWang chenghuaWang commented Jan 5, 2026

Summary by CodeRabbit

  • New Features

    • Sigmoid activation added to CPU backend (x86 & ARM)
    • Qwen3 transformer with integrated quantization and PTQ workflow
    • Activation QDQ and quantized linear/RMSNorm modules for deployment
  • Improvements

    • Broader quantization pipeline and pattern recognition (PTQ/LPBQ)
    • Support for additional unsigned integer data types (UInt8/UInt16)
    • Improved deploy-time conversion and conversion tooling
  • Chores

    • Added reproducible Python requirements for QNN AOT workflows

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 5, 2026

📝 Walkthrough

Walkthrough

This PR adds PTQ integration into the QNN AOT lowering pipeline, introduces Sigmoid IR/op/backend support, expands quantization types and tensor view handling, implements PTQ solver passes and QDQ wrappers for Qwen3, and adds Qualcomm PTQ tooling and Python workflows for Qwen3.

Changes

Cohort / File(s) Summary
QNN AOT pipeline & context
mllm/backends/qnn/aot/passes/AOTPipeline.hpp, .../AOTPipeline.cpp, .../AOTCompileContext.hpp, .../AOTCompileContext.cpp
createQnnAOTLoweringPipeline now accepts a ParameterFile::ptr_t and sets it into AOTCompileContext (set/getParamFile). PTQ pass insertion added.
PTQ pass & patterns
mllm/backends/qnn/aot/passes/PTQPass.hpp, .../PTQPass.cpp, .../LLMQuantRecipePass.hpp, .../LLMQuantRecipePass.cpp
New PTQPass implementing recursive weight/tensor quant solvers; added Sigmoid/Neg/Slice quant patterns and registration changes in LLMQuantRecipePass.
QNN wrappers & utils
mllm/backends/qnn/QNNUtils.hpp, mllm/backends/qnn/aot/QnnWrappersAPI.cpp
Adjusted attachedViews access to pair semantics (.second) for quant scale; create context tensor call added for static-weight capture.
Examples: Qwen3 QNN AOT
examples/qwen3_qnn_aot/compile.cpp, examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp
Removed pre-generation of RoPE sin/cos, switched sequence tensor to int32, changed KV cache dtypes to unsigned variants, attached per-tensor quant metadata to KV cache, added rotateHalf and QDQ wrappers, updated lowering pipeline invocation to pass params.
Tensor & view refactor
mllm/core/Tensor.hpp, mllm/core/Tensor.cpp, mllm/core/TensorViewImpl.hpp
attachedViews now stores pair<bool,impl>; attach gained exclude_from_hash flag; added constant/addConstant/subConstant/mulConstant and _unsafeSetDType; moved attached_views into TensorViewImpl with accessor.
Quantization spec & attributes
mllm/compile/ir/linalg/Attribute.hpp, .../Attribute.cpp
Added solved flag to QuantizationSpec and include solved in dumps.
Op/IR: Sigmoid support
mllm/core/OpTypes.hpp, mllm/core/aops/SigmoidOp.hpp, .../SigmoidOp.cpp, mllm/nn/Functional.hpp, mllm/nn/Functional.cpp, mllm/compile/ir/GeneratedRTTIKind.hpp, .../NodeRTTIClassOfImpl.hpp, mllm/compile/ir/linalg/Op.hpp, mllm/compile/ir/linalg/Op.cpp, mllm/compile/ir/rtti_kind_gen.py
Added OpTypes::kSigmoid, aops SigmoidOp, functional wrapper, and corresponding linalg/RTTI registrations.
CPU sigmoid kernels & op
mllm/backends/cpu/kernels/.../sigmoid.{hpp,cpp}, mllm/backends/cpu/ops/SigmoidOp.{hpp,cpp}, mllm/backends/cpu/Kernels.hpp, mllm/backends/cpu/CPUBackend.cpp
New architecture-specific sigmoid kernels (x86 fp32, ARM fp32/fp16) and CPUSigmoidOp with factory registration in CPUBackend.
Linear op dtype tweak
mllm/backends/cpu/ops/LinearOp.cpp
For LPBQ w4a16o16 impls, output dtype changed from kInt16PerTensorSym to kUInt16PerTensorAsy.
DataTypes mapping
mllm/core/DataTypes.cpp, pymllm/convertor/mllm_type_mapping.py
Added uint8/uint16 per-tensor types handling and torch.uint16 mapping in Python convertor.
Module ParameterFile plumbing
mllm/nn/Module.hpp, mllm/nn/Module.cpp
resources_mapped_files_ now stores ParameterFile::ptr_t; added getTopParameterFile() on Module/ModuleImpl.
Compile IR RTTI / Op registration
mllm/compile/ir/...
RTTI and op registration updated for new Sigmoid/linalg op kinds.
AOps / ParamOp change
mllm/core/aops/ParamOp.cpp
ParamOp::trace now registers weight parameter; reshape emits weight_ as output.
Elementwise op registration
mllm/core/aops/ElewiseOps.cpp
Neg operator moved from binary to unary registration.
Python Qualcomm PTQ tooling
pymllm/backends/qualcomm/transformers/core/{qdq.py,qlinear.py,rms_norm.py}, .../qwen3/{modeling_qwen3.py,runner.py,train.py}, pymllm/backends/qualcomm/transformers/README.md, static_qwen3.py, requirements-qnn-aot.txt
Added ActivationQDQ, QLinear families (LPBQ, per-channel), QRMSNorm, full Qwen3 Qualcomm model, Qwen3Quantizer runner and training script, doc updates, removed some torch/torchao imports in static_qwen3.py, and added requirements for QNN AOT.
Quantize pipeline & convertor
pymllm/quantize/pipeline.py, pymllm/utils/mllm_convertor.py
Exposed build_raw_pipeline() and integrated a raw pipeline path for v2 format conversion (cast_left_2_fp32=False).
Miscellaneous
mllm/backends/qnn/aot/passes/AOTCompileContext.cpp, mllm/backends/qnn/aot/QnnWrappersAPI.cpp, requirements-qnn-aot.txt, .gitignore
Added AOTCompileContext param accessors, qnn context tensor creation for static tensors, new requirements file and gitignore entry.

Sequence Diagram(s)

sequenceDiagram
    participant User as compile.cpp
    participant AOT as createQnnAOTLoweringPipeline
    participant AOTCtx as AOTCompileContext
    participant PTQ as PTQPass
    participant ParamFile as ParameterFile
    participant IR as ModuleOp/IRGraph
    participant Specs as QuantizationSpec

    User->>AOT: call createQnnAOTLoweringPipeline(env, cfg, params)
    AOT->>AOTCtx: ctx.setParamFile(params)
    AOT->>PTQ: insert PTQPass into pipeline (llm_recipe path)
    AOT->>User: return pipeline

    Note over PTQ,IR: PTQ run traverses IR and resolves quant specs
    PTQ->>IR: run(ModuleOp)
    PTQ->>PTQ: recursiveSolveWeights(subgraph)
    PTQ->>ParamFile: read weight scales/zeros for layer
    ParamFile-->>PTQ: return scale/zero_point
    PTQ->>Specs: populate spec (scale/zero_point) and mark solved
    PTQ->>PTQ: recursiveSolveNormal(tensors)
    PTQ->>IR: read attachedViews (scale/zero_point tensors)
    IR-->>PTQ: provide attached views
    PTQ->>Specs: populate/specify & mark solved
    PTQ-->>IR: PASS_RET_SUCCESS
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • liang1232018
  • oreomaker
  • yirongjie

Poem

🐰 Whiskers twitch at sigmoid's grace,
Quant-spec blossoms in each place,
PTQ pass solves weights with care,
QDQ wraps Qwen3 fair!
UInt16 hops, tensors bound anew—hoppity-hop through IR and view!

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning No pull request description was provided by the author. The PR template requires a clear description, but the author submitted an empty description. Add a comprehensive description explaining the purpose of the changes, key modifications to the QNN AOT pass, and any testing performed.
Docstring Coverage ⚠️ Warning Docstring coverage is 12.79% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: fixing the Qualcomm QNN AOT Pass with updates to pipeline invocation, quantization patterns, and related infrastructure.
✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54d9927 and fb40759.

📒 Files selected for processing (1)
  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
🧰 Additional context used
📓 Path-based instructions (4)
{mllm,mllm-cli,pymllm}/**/*

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).

Files:

  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Files:

  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.

Files:

  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).

Files:

  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
🧬 Code graph analysis (1)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (5)
mllm/compile/ir/linalg/Attribute.hpp (14)
  • spec (100-105)
  • spec (130-135)
  • spec (160-165)
  • spec (192-197)
  • spec (226-231)
  • spec (260-265)
  • spec (297-302)
  • create (77-77)
  • node (27-27)
  • node (27-27)
  • node (327-327)
  • node (327-327)
  • node (348-348)
  • node (348-348)
mllm/backends/qnn/aot/QnnWrappersAPI.hpp (2)
  • create (48-50)
  • create (48-48)
mllm/backends/qnn/aot/visitor/Elewise.cpp (2)
  • isMatch (13-15)
  • isMatch (13-13)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp (16)
  • op (40-40)
  • op (54-54)
  • op (66-66)
  • op (80-80)
  • op (92-92)
  • op (106-106)
  • op (120-120)
  • op (132-132)
  • op (144-144)
  • op (156-156)
  • op (170-170)
  • op (184-184)
  • op (198-198)
  • op (212-212)
  • op (226-226)
  • op (238-238)
mllm/compile/ir/builtin/Attribute.hpp (10)
  • node (24-24)
  • node (24-24)
  • node (44-44)
  • node (44-44)
  • node (64-64)
  • node (64-64)
  • node (84-84)
  • node (84-84)
  • node (104-104)
  • node (104-104)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-macos
  • GitHub Check: build-android
  • GitHub Check: build-x86
🔇 Additional comments (9)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (9)

41-46: LGTM! Useful debugging enhancement.

The logging addition helps track which pattern is processing each operation, improving observability during the quantization recipe pass.


97-101: LGTM! Correct uint16 asymmetric quantization support.

The implementation correctly uses 65535 as the maximum value for uint16 asymmetric per-tensor quantization and follows the established pattern for other quantization types.


256-279: LGTM! New patterns follow established conventions.

The Sigmoid and Neg pattern implementations are structurally sound and make appropriate quantization decisions:

  • Sigmoid uses noSharingSingleInAndSingleOutQuantAnnoAttr (correct for non-linear activation)
  • Neg uses shareQuantSpecSingleInputToSingleOutputAndSetOpQuantAnnoAttr (correct for linear transformation)

423-434: LGTM! Good refactoring to use shared helper.

The Index pattern has been simplified to use shareQuantSpecSingleInputToSingleOutputAndSetOpQuantAnnoAttr, making it more maintainable while preserving the correct semantics (index operations preserve quantization specs).


436-447: LGTM! Slice pattern correctly shares quantization specs.

The Slice pattern implementation correctly uses shareQuantSpecSingleInputToSingleOutputAndSetOpQuantAnnoAttr, which is semantically appropriate since slicing operations preserve the quantization characteristics of their input.


470-472: LGTM! Good refactoring to use shared helper.

Using cloneQuantizationSpecType improves maintainability by delegating to a shared helper instead of duplicating the cloning logic inline.


857-874: Excellent! Critical bugs from previous reviews are now fixed.

The Embedding pattern rewrite correctly addresses both issues flagged in earlier reviews:

  1. Input spec annotation: Both branches (lines 857-865) now correctly add the input spec to annotation_.inputs
  2. Output spec annotation: Both branches (lines 867-874) now correctly add the output spec to annotation_.outputs (not inputs)

The logic is now symmetric and correct for both the "generate spec" and "use existing spec" paths.


876-890: LGTM! Weight handling is correct and well-structured.

The weight lookup, validation, and quantization spec generation follow the established pattern used in other ops (e.g., RMSNorm). Error handling with MLLM_RETURN_FALSE_IF_NOT is appropriate.


963-965: LGTM! Pattern registrations are correct.

The three new patterns (neg, slice, sigmoid) are properly registered with appropriate names and priority 0, consistent with other patterns in the pass.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Fix all issues with AI Agents 🤖
In @examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp:
- Around line 32-38: Remove the dead if/else by computing a single prefix from
m->getModuleName() and then assign scale_name and zp_name once; e.g., build
prefix = empty ? "" : moduleName + "." using m->getModuleName(), then set
scale_name = prefix + qdq_name_in_pytorch + ".fake_quant.scale" and zp_name =
prefix + qdq_name_in_pytorch + ".fake_quant.zero_point" (replace the duplicated
if/else that currently references scale_name, zp_name, m->getModuleName(), and
qdq_name_in_pytorch).

In @mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp:
- Around line 390-393: The quantization spec uses an out-of-range max value:
change the call to ir::linalg::QuantizationSpecAsymPerTensor::create(...) used
when constructing weight_spec_attr so that the quant_max argument is 65535
(uint16 max) instead of 65536; update the value passed to
QuantizationSpecAsymPerTensor::create in the weight_spec_attr initialization
(leave the FIXME about hardcoded dtype for later).
- Around line 865-872: The output tensor handling block in
LLMQuantRecipePass.cpp incorrectly appends an existing quant_recipe to
annotation_attr->annotation_.inputs instead of outputs; locate the branch that
checks o_0->getAttr("quant_recipe") (the o_0/output handling near
genSimpleQuantizationSpecAttr and annotation_attr) and change the emplace_back
call to push the spec onto annotation_attr->annotation_.outputs (using the same
cast to ir::linalg::LinalgIRQuantizatonSpecAttr and ->spec_) so existing output
specs are recorded under outputs instead of inputs.
- Around line 770-771: The QuantizationSpecAsymPerTensor creation uses an
off-by-one max for uint16; update the call that constructs out_quant_spec
(ir::linalg::QuantizationSpecAsymPerTensor::create) to use 65535 instead of
65536 for quant_max so the per-tensor asymmetric uint16 range is correct (match
the fix used in the RMSNorm pattern).

In @mllm/backends/qnn/aot/passes/PTQPass.cpp:
- Line 135: The function name __recursiveSolveNormalImpl uses a reserved
double-underscore prefix; rename it to recursiveSolveNormalImpl (or another
single-underscore-free identifier) and update its definition and all call sites
in this file to the new name, including any forward declarations or references
that invoke __recursiveSolveNormalImpl so linkage and compilation remain
consistent.

In @mllm/core/aops/SigmoidOp.cpp:
- Around line 27-33: SigmoidOp::reshape currently accesses inputs[0] without
checking inputs.size(); add validation at the start of reshape to ensure inputs
is non-empty (e.g., if inputs.empty() throw or return an error), and also
validate that outputs is empty or sized appropriately before emplacing; after
the check, proceed to use inputs[0] as before to construct outputs (either
in-place or empty tensor), so you avoid out-of-bounds access when inputs is
empty.
- Around line 16-21: SigmoidOp::trace currently assumes inputs and outputs are
non-empty; add the same validation used in reshape() to check that inputs.size()
> 0 and outputs.size() > 0 at the start of SigmoidOp::trace and handle the
failure path (e.g., throw an exception or return early) before calling
ir::tensor::wrapTensors2TensorIR and ir_ctx->create<ir::linalg::SigmoidOp>(...),
so you never pass empty vectors into the IR wrapping/creation logic.
- Around line 23-25: The SigmoidOp::forward currently calls NYI and will crash
if the base aops::SigmoidOp is ever used; replace the NYI with a proper
implementation or a clear runtime error and/or documentation. If a generic
fallback is desired, implement element-wise sigmoid using inputs[0] ->
outputs[0] (apply 1/(1+exp(-x)) across the tensor elements) or delegate to
available element-wise utilities; otherwise change the NYI to throw a
descriptive exception (e.g., "SigmoidOp::forward must be implemented by
backend-specific subclass like CPUSigmoidOp") and add a comment indicating the
base class is abstract and not to be instantiated directly. Ensure you reference
SigmoidOp::forward and the inputs/outputs vectors when making the change.

In @mllm/core/OpTypes.hpp:
- Around line 98-99: The optype2Str function is missing a switch case for the
new enum value kSigmoid, causing it to fall through to "Unknown"; update the
switch in optype2Str to add a case for kSigmoid that returns the appropriate
string (e.g., "kSigmoid") so logging/diagnostics report the correct op type;
locate the switch handling OpType enum values in optype2Str and insert the new
case alongside the other cases for consistency.

In @mllm/core/Tensor.cpp:
- Around line 33-36: The current Tensor::operator delete and delete_() call
impl_.reset() before accessing impl_->attachedViews(), causing a use-after-free;
fix by first obtaining a copy or reference of the attached views (e.g., auto
views = ((Tensor*)ptr)->impl_->attachedViews() or check impl_ and iterate its
attachedViews()), iterate that collection to reset each view->second, and only
after cleaning up the attached views call impl_.reset(); also add a null check
on ptr and on impl_ before accessing to be safe.
- Around line 78-91: The switch in Tensor::constant mis-casts
kUInt16PerTensorAsy to int16_t causing wrong values for unsigned 16-bit; update
the kUInt16PerTensorAsy case to write through rhs_tensor.ptr<uint16_t>() and
cast x to uint16_t (or use static_cast<uint16_t>(x)), ensuring the
dtype-specific pointer type matches the unsigned 16-bit DataTypes enum; verify
similar patterns for kInt16PerTensorSym remain int16_t.

In @mllm/core/Tensor.hpp:
- Around line 470-476: The public method name __unsafeSetDType uses a
double-underscore prefix reserved by C++; rename it to a non-reserved identifier
(e.g., unsafeSetDType) across the Tensor API: update the declaration in class
Tensor (replace __unsafeSetDType with unsafeSetDType), adjust all
implementations, usages, and any related tests or headers to the new name, and
ensure symbols in documentation and bindings are updated to avoid the
reserved-identifier conflict.

In @pymllm/backends/qualcomm/transformers/.gitignore:
- Line 1: The .gitignore contains an unclear placeholder entry
"static_one_more_thing.py"; either remove this leftover if it’s not needed, or
replace it with a descriptive name (e.g., "static_qnn_debug.py") and add an
inline comment in the same .gitignore explaining its purpose and why it should
be ignored so future maintainers understand why "static_one_more_thing.py" (or
the new name) is excluded.

In @pymllm/backends/qualcomm/transformers/core/qdq.py:
- Around line 18-19: The code sets self.dtype = torch.qint32 which FakeQuantize
does not support; update the simulation dtype to a supported type (e.g., set
self.dtype = torch.qint8 for symmetric or torch.quint8 for asymmetric
quantization) and adjust the MinMaxObserver initialization used by FakeQuantize
(e.g., observer dtype/quant_min/quant_max or dtype parameter) to match the
chosen qint8/quint8 type so dtype validation passes at runtime; locate and
change the self.dtype assignment in qdq.py and any places that construct
MinMaxObserver or pass dtype to FakeQuantize to use the chosen 8-bit qtype
consistently.

In @pymllm/backends/qualcomm/transformers/core/qlinear.py:
- Around line 169-180: DoubleQuantizer.forward currently leaves the frozen-path
reconstruction unimplemented: when self.is_frozen is True and
self.w_recon_cached is None you must reconstruct the full-precision weights from
the stored quantized buffers (e.g., weight_q, scale_1, scale_2) instead of
falling back to quantize_dequantize(w). Implement logic in
DoubleQuantizer.forward to read the stored buffers (self.weight_q, self.scale_1,
self.scale_2), apply the inverse quantization/scale math used by this quantizer
to produce the reconstructed tensor, assign it to self.w_recon_cached, and
return it; ensure you preserve original dtype/shape and handle missing buffers
by raising a clear error rather than silently calling quantize_dequantize; keep
the existing behavior for the non-frozen path (return
self.quantize_dequantize(w)).
- Around line 193-207: The convert_to_deploy method must ensure the quantizer is
frozen before accessing its buffers: mirror the pattern used in
QLinearW8A16_PerChannelSym.convert_to_deploy by checking the quantizer's frozen
state (or calling its freeze() method) and validating that
self.weight_quant.weight_q is not None before deleting/registrering buffers; if
weight_q is None raise a clear RuntimeError explaining the quantizer must be
frozen, otherwise proceed to register_buffer("weight", ...), scale1 and scale2
and then delete self.weight_quant and set deploy_mode.

In @pymllm/backends/qualcomm/transformers/core/rms_norm.py:
- Around line 122-123: The extra_repr method in the RMSNorm class references a
nonexistent attribute self.variance_epsilon causing an AttributeError; change
the representation to use the existing attribute self.eps (or create an alias
self.variance_epsilon = self.eps in __init__), e.g., update extra_repr to return
f"{tuple(self.weight.shape)}, eps={self.eps}" so it uses the defined attribute.

In @pymllm/backends/qualcomm/transformers/core/test_qlinear.py:
- Around line 41-48: The test constructs QLinearLPBQ with unsupported kwargs
which triggers a TypeError; remove the invalid parameters
already_quantized_weight and already_quantized_activation from the
QLinearLPBQ(...) call in the test (keep only in_features, out_features, bias,
block_size) or if those flags are required update QLinearLPBQ.__init__ to accept
and store them; locate the test instantiation of QLinearLPBQ in test_qlinear.py
and adjust the constructor call or update QLinearLPBQ.__init__ accordingly.

In @pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py:
- Around line 583-596: The code assumes input_ids is always set when computing
seq_len and causal_mask, but input_ids can be None when inputs_embeds is
provided; update the logic in the causal mask block (the code around computing
seq_len and causal_mask) to derive seq_len from position_ids if available,
otherwise from inputs_embeds.shape[1], and only fall back to input_ids.shape[1]
if needed; also obtain the device from whichever tensor you use (position_ids,
inputs_embeds, or input_ids) and keep the existing dtype and unsqueeze behavior
so causal_mask creation remains unchanged.
- Around line 558-566: The code uses assert to check mllm_qualcomm_max_length
when both mllm_max_sin_embedding and mllm_max_cos_embedding are None; replace
the assert with an explicit exception (e.g., raise ValueError) that includes a
clear message mentioning mllm_qualcomm_max_length so the check cannot be
bypassed with -O; keep the surrounding logic that reads kwargs and constructs
max_position_ids (using position_ids.dtype and position_ids.device) but change
the assert line to a guarded if that raises the exception when
mllm_qualcomm_max_length is None.

In @pymllm/backends/qualcomm/transformers/qwen3/runner.py:
- Around line 125-132: Replace the mixed-language comment above the
MsDataset.load call: remove the Chinese "几十G" and rewrite in English to explain
that streaming=True is used to avoid downloading the full dataset (which is tens
of gigabytes) and to document why "modelscope/wikitext" with
subset_name="wikitext-103-v1" and split="train" was chosen for calibration;
update the comment to explicitly state calibration assumptions (e.g., using
English Wikipedia-like text, streaming processes examples on the fly, not
shuffled) and ensure the comment sits immediately above the MsDataset.load
invocation and references the dataset and streaming behavior.

In @pymllm/backends/qualcomm/transformers/qwen3/train.py:
- Around line 1-7: The argparse definition for the output directory is missing a
required flag; update the ArgumentParser.add_argument call for "--output_dir"
(and any other add_argument entries in the same blocks that are used later
without defaults) to include required=True (or provide a sensible default) so
the script fails early instead of raising a runtime error when output_dir is
absent; look for the argparse.ArgumentParser and add_argument("--output_dir",
...) call in train.py and change it to required=True (and mirror this fix for
the other add_argument calls referenced in the same sections).

In @requirements-qnn-aot.txt:
- Around line 1-4: Replace the vulnerable transformers pin
"transformers==4.57.3" with a patched release (e.g., the nearest fixed 4.x
release) or with "transformers==5.0.0rc1" after you validate compatibility;
update the requirements entry for "transformers==4.57.3" accordingly, run the
project's test suite and CI to confirm nothing breaks, and regenerate any
lockfiles or dependency artifacts to ensure the updated package is used.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pymllm/utils/mllm_convertor.py (1)

113-113: Update the message to reflect the new handling logic.

The message "No pipeline specified" may be misleading since the code now handles the case when no pipeline is specified (using the "_raw" pipeline at lines 69-86). Consider updating this message or clarifying what conditions trigger this else branch.

mllm/backends/qnn/aot/passes/AOTPipeline.cpp (1)

33-36: Empty pipeline returned for non-llm_recipe configs.

When the config doesn't contain quant_recipe.llm_recipe = true, the function logs a warning but returns an empty vector. This means calling code receives no passes, which may silently fail. Consider whether this should be an error or if there should be a fallback pipeline.

🟡 Minor comments (7)
pymllm/backends/qualcomm/transformers/.gitignore-1-1 (1)

1-1: Clarify the purpose of this .gitignore entry—the name appears to be a placeholder.

The entry static_one_more_thing.py has an atypical name that suggests it may be a debug or temporary file. If this is intentional, consider renaming it to something descriptive (e.g., static_qnn_debug.py) and adding a comment to .gitignore explaining its purpose. If it's a leftover from development, remove it.

mllm/core/Tensor.cpp-78-91 (1)

78-91: Type mismatch for kUInt16PerTensorAsy.

Line 87 casts to int16_t* for an unsigned type (kUInt16PerTensorAsy), which can cause incorrect values for values exceeding INT16_MAX.

🔎 Proposed fix
-    case kUInt16PerTensorAsy: *(rhs_tensor.ptr<int16_t>()) = x; break;
+    case kUInt16PerTensorAsy: *(rhs_tensor.ptr<uint16_t>()) = static_cast<uint16_t>(x); break;
mllm/core/Tensor.hpp-470-476 (1)

470-476: Reserved identifier in public API.

The method __unsafeSetDType uses a double-underscore prefix which is reserved by the C++ standard for implementation use. This could cause conflicts with compiler internals. Consider renaming to unsafeSetDType or similar.

🔎 Proposed fix
-  [[nodiscard]] Tensor __unsafeSetDType(DataTypes dt) const;
+  [[nodiscard]] Tensor unsafeSetDType(DataTypes dtype) const;
pymllm/backends/qualcomm/transformers/qwen3/train.py-1-7 (1)

1-7: Add required flag for output_dir argument.

The output_dir parameter is used without a default value but is not marked as required. This will cause a runtime error when the argument is not provided.

🔎 Proposed fix
     parser.add_argument(
         "--output_dir",
         type=str,
+        required=True,
         help="Directory to save the quantized model",
     )

Also applies to: 8-36, 52-54

pymllm/backends/qualcomm/transformers/qwen3/runner.py-125-132 (1)

125-132: Fix mixed-language comment and document calibration assumptions.

The comment contains Chinese characters ("几十G") which should be English-only per coding guidelines. Also, consider documenting the dataset choice and streaming behavior.

🔎 Proposed fix
-        # 2. Load Wikipedia dataset (English version example)
-        # Use streaming=True to download and process on the fly, without downloading the full几十G dataset
+        # 2. Load Wikipedia dataset (English WikiText-103)
+        # Uses ModelScope dataset loader; streaming avoids downloading the full dataset (~GB scale)
         dataset = MsDataset.load(
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp-32-38 (1)

32-38: Remove dead code: identical if/else branches.

Both the if and else branches perform the same assignments. The if (m->getModuleName().empty()) check at line 32 is redundant since lines 33-34 and 36-37 are identical.

🔎 Proposed fix
 Tensor QDQ(nn::Module* m, Tensor in, const std::string& qdq_name_in_pytorch) {
-  std::string scale_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.scale";
-  std::string zp_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.zero_point";
-
-  if (m->getModuleName().empty()) {
-    scale_name = qdq_name_in_pytorch + ".fake_quant.scale";
-    zp_name = qdq_name_in_pytorch + ".fake_quant.zero_point";
-  } else {
-    scale_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.scale";
-    zp_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.zero_point";
-  }
+  std::string prefix = m->getModuleName().empty() ? "" : m->getModuleName() + ".";
+  std::string scale_name = prefix + qdq_name_in_pytorch + ".fake_quant.scale";
+  std::string zp_name = prefix + qdq_name_in_pytorch + ".fake_quant.zero_point";

Committable suggestion skipped: line range outside the PR's diff.

pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py-558-566 (1)

558-566: Replace assert with explicit exception for production code.

Using assert for input validation can be disabled with -O flag. Use an explicit exception for more robust error handling.

🔎 Proposed fix
         if self.mllm_max_sin_embedding is None and self.mllm_max_cos_embedding is None:
             mllm_qualcomm_max_length = kwargs.get("mllm_qualcomm_max_length", None)
-            assert mllm_qualcomm_max_length is not None
+            if mllm_qualcomm_max_length is None:
+                raise ValueError("mllm_qualcomm_max_length must be provided when initializing embeddings")
             max_position_ids = torch.arange(
                 0,
                 mllm_qualcomm_max_length,
🧹 Nitpick comments (30)
pymllm/convertor/mllm_type_mapping.py (1)

94-94: Consider relocating for better organization.

The torch.uint16 mapping is placed after the quantized type mappings section, but it's not a quantized type. For improved code organization and maintainability, consider moving this entry to line 89 (after torch.uint8: 129) to group unsigned integer types together.

🔎 Proposed reorganization
            torch.int64: 132,  # kInt64
            torch.uint8: 129,  # kUInt8
+           torch.uint16: 130,  # kUInt16
            torch.bool: 129,  # kUInt8 (Boolean type in PyTorch is usually represented as uint8)
            # Quantized type mappings
            torch.qint8: 16,  # kInt8
            torch.quint8: 129,  # kUInt8
            torch.qint32: 18,  # kInt32
-           torch.uint16: 130,  # kUInt16
        }
    )
pymllm/quantize/pipeline.py (1)

23-25: Add docstring to clarify the purpose of the raw pipeline.

The build_raw_pipeline() function is identical to build_cast2fp32_pipeline() (lines 18-20), both returning an empty QuantizeSolver with no passes. While they may serve different semantic purposes based on how they're used in the convertor (with different cast_left_2_fp32 values), adding a docstring would clarify the intent and use case for the "raw" pipeline.

💡 Suggested docstring
 def build_raw_pipeline() -> QuantizeSolver:
+    """Create a raw quantization pipeline with no passes registered.
+    
+    This pipeline performs no transformations on parameters and is used
+    when no configuration or custom pipeline is specified.
+    
+    Returns:
+        QuantizeSolver: An empty solver with no registered passes.
+    """
     ret = QuantizeSolver()
     return ret
pymllm/utils/mllm_convertor.py (1)

69-86: Consider refactoring to reduce code duplication.

This block is nearly identical to the branches at lines 51-68 and 87-111, with only minor differences in pipeline selection and the cast_left_2_fp32 parameter value. Extracting the common quantization logic into a helper function would improve maintainability and reduce duplication.

💡 Refactoring suggestion

Consider extracting common logic:

def _perform_streaming_quantization(
    cfg, pipeline_name, params, output_path, model_name, cast_left_2_fp32, verbose
):
    """Perform streaming quantization with the specified pipeline."""
    pipeline: QuantizeSolver = BUILTIN_QUANTIZE_PIPELINE[pipeline_name]()
    old_param_size = len(params)
    new_param_size = pipeline.stream_quantize_params_size(cfg, params)
    print(f"Params Num: Before: {old_param_size}, After: {new_param_size}")
    pipeline.stream_quantize(
        cfg,
        params,
        writer=ModelFileV2(
            output_path,
            model_name,
            "Streaming",
            max_params_descriptor_buffer_num=new_param_size,
        ),
        cast_left_2_fp32=cast_left_2_fp32,
        verbose=verbose,
    )

Then simplify the branches:

if args.cfg_path is None and args.pipeline is not None and args.format == "v2":
    _perform_streaming_quantization(
        None, args.pipeline, params, args.output_path, 
        args.model_name, True, args.verbose
    )
elif args.cfg_path is None and args.pipeline is None and args.format == "v2":
    _perform_streaming_quantization(
        None, "_raw", params, args.output_path, 
        args.model_name, False, args.verbose
    )
mllm/core/TensorViewImpl.hpp (2)

92-92: Consider refactoring to provide controlled access instead of exposing internal map.

Returning a non-const reference to the internal attached_views_ map breaks encapsulation and allows external code to modify internal state without validation or synchronization.

🔎 Suggested refactor with controlled access

Instead of exposing the raw map, consider providing controlled accessors:

// Add a view
inline void attachView(const std::string& name, bool exclude_from_hash, const TensorViewImpl::ptr_t& view) {
  attached_views_[name] = {exclude_from_hash, view};
}

// Get a view (const)
inline const TensorViewImpl::ptr_t& getAttachedView(const std::string& name) const {
  auto it = attached_views_.find(name);
  return (it != attached_views_.end()) ? it->second.second : nullptr;
}

// Check if view exists
inline bool hasAttachedView(const std::string& name) const {
  return attached_views_.find(name) != attached_views_.end();
}

// Get exclude_from_hash flag
inline bool getViewExcludeFlag(const std::string& name) const {
  auto it = attached_views_.find(name);
  return (it != attached_views_.end()) ? it->second.first : false;
}

// If mutable access is truly needed, provide const access by default
inline const std::unordered_map<std::string, std::pair<bool, TensorViewImpl::ptr_t>>& attachedViews() const {
  return attached_views_;
}

This approach maintains encapsulation while still allowing necessary access patterns.


101-102: Document the semantics of the bool flag in the pair.

The comment explains the bool is for "judge if this tensor should be considered in hashing," but the semantics could be clearer. Consider documenting:

  • What does true vs false mean?
  • When should views be excluded from hashing?
  • How does this affect tensor equality/comparison?
mllm/core/Tensor.cpp (2)

295-314: Consider passing rhs by const reference.

The addConstant, subConstant, and mulConstant methods take Tensor rhs by value, incurring a copy. If the copy is not required for ownership transfer, passing by const Tensor& would be more efficient.


508-511: Reserved identifier prefix and const-correctness concern.

The __unsafeSetDType name uses a reserved identifier prefix (__). The method is marked const but mutates internal state, which is intentionally unsafe. Consider:

  1. Renaming to avoid reserved prefix (e.g., unsafeSetDType_)
  2. Adding a comment explaining the safety requirements for callers
pymllm/backends/qualcomm/transformers/core/qlinear.py (1)

27-30: Redundant null check.

Line 29 checks self.weight_quant is not None but this is already guaranteed by the outer condition on line 25.

🔎 Proposed fix
         if self.weight_quant is not None:
             # Compatible with official FakeQuantize module
-            if (
-                isinstance(self.weight_quant, FakeQuantize)
-                and self.weight_quant is not None
-            ):
+            if isinstance(self.weight_quant, FakeQuantize):
                 _ = self.weight_quant(self.weight)
pymllm/backends/qualcomm/transformers/core/test_qlinear.py (1)

80-89: Consider using assertions for test framework integration.

The test returns True/False instead of using assertions. For integration with pytest or unittest, consider using assert statements or framework-specific assertions that provide better error reporting.

🔎 Proposed fix
-    if relative_error < tolerance:
-        print(f"\n✓ TEST PASSED: Relative error {relative_error:.6e} < {tolerance}")
-        return True
-    else:
-        print(f"\n✗ TEST FAILED: Relative error {relative_error:.6e} >= {tolerance}")
-        return False
+    assert relative_error < tolerance, (
+        f"Relative error {relative_error:.6e} >= tolerance {tolerance}"
+    )
+    print(f"\n✓ TEST PASSED: Relative error {relative_error:.6e} < {tolerance}")
mllm/compile/ir/NodeRTTIClassOfImpl.hpp (1)

229-230: SigmoidOp RTTI macro follows established pattern.

The new RTTI_RK_OP_LINALGIROP_SIGMOIDOP_IMPL macro is consistent with other operation macros in this auto-generated file. While the static analysis tool suggests using a constexpr template function instead of macros, this change should be applied at the code generator level (likely mllm/compile/ir/rtti_kind_gen.py) to affect all generated macros consistently.

If you'd like to modernize the RTTI implementation to use constexpr templates, consider updating the code generator script to emit constexpr template functions for all operations rather than function-like macros.

mllm/backends/qnn/aot/passes/AOTCompileContext.hpp (1)

39-39: Consider explicit member initialization in the constructor.

While params_ (smart pointer) defaults to nullptr and config_ (nlohmann::json) has a default constructor, explicit initialization improves clarity and aligns with modern C++ best practices.

🔎 Proposed fix
 private:
   // Private constructor
-  AOTCompileContext() = default;
+  AOTCompileContext() : env_(nullptr), config_(), params_(nullptr) {}
mllm/backends/qnn/aot/passes/PTQPass.hpp (1)

17-27: Consider explicitly deleting copy and move operations.

Pass objects are typically not copyable since they operate on IR graphs with side effects. Consider adding deleted copy/move constructors and assignment operators to make the intent explicit and prevent accidental copying.

🔎 Proposed fix
 class PTQPass final : public ir::Pass {
  public:
   PTQPass() = default;

   ~PTQPass() override = default;

+  // Delete copy and move operations
+  PTQPass(const PTQPass&) = delete;
+  PTQPass& operator=(const PTQPass&) = delete;
+  PTQPass(PTQPass&&) = delete;
+  PTQPass& operator=(PTQPass&&) = delete;
+
   // Run the PTQ pass on the given operation
   // Expected input: ModuleOp containing the computation graph
   // Output: Modified IR with PTQ transformations applied
   uint8_t run(const ir::node_ptr_t& op) override;
 };
pymllm/backends/qualcomm/transformers/qwen3/train.py (1)

39-50: Consider adding error handling for the quantization workflow.

The script lacks error handling for critical operations like model loading, calibration, and conversion. Consider wrapping the workflow in try-except blocks to provide meaningful error messages if any step fails.

🔎 Proposed enhancement
     args = parser.parse_args()

+    try:
-    m = Qwen3Quantizer(args.model_path, mllm_qualcomm_max_length=args.max_length)
-    m.calibrate(num_samples=args.num_samples, max_seq_length=args.max_length)
-    # m.compile()
-    m.infer(args.infer_text)
-
-    # !!!
-    # Things below is for deploy. We will turn all fp32 weights and some buffers(rope) to quantized dtype.
-    # !!!
-    m.model.lm_head.weight = torch.nn.Parameter(
-        m.model.model.embed_tokens.weight.clone()
-    )
-    m.convert()
+        m = Qwen3Quantizer(args.model_path, mllm_qualcomm_max_length=args.max_length)
+        m.calibrate(num_samples=args.num_samples, max_seq_length=args.max_length)
+        # m.compile()
+        m.infer(args.infer_text)
+
+        # !!!
+        # Things below is for deploy. We will turn all fp32 weights and some buffers(rope) to quantized dtype.
+        # !!!
+        m.model.lm_head.weight = torch.nn.Parameter(
+            m.model.model.embed_tokens.weight.clone()
+        )
+        m.convert()
+    except Exception as e:
+        print(f"Error during quantization workflow: {e}")
+        raise
examples/qwen3_qnn_aot/compile.cpp (1)

39-40: Consider validating params before use.

The params pointer is used in multiple places (lines 67-71 for attaching quantization metadata, line 82 for pipeline creation), but there's no validation that mllm::load() succeeded. If the model file is missing or corrupt, params could be null or invalid, leading to crashes when calling pull().

🔎 Suggested validation
  auto params = mllm::load(model_path.get(), mllm::ModelFileVersion::kV2);
+ if (!params) {
+   MLLM_ERROR_EXIT(mllm::ExitCode::kCoreError, "Failed to load model parameters from {}", model_path.get());
+   return -1;
+ }
  model.load(params);
mllm/backends/cpu/ops/SigmoidOp.cpp (2)

4-4: Unused include.

The <cstring> header appears unused in this file. Consider removing it.

-#include <cstring>

12-14: Consider adding input validation.

The function assumes inputs has at least one element and outputs has a valid pre-allocated tensor. For robustness, consider adding assertions or checks.

🔎 Suggested validation
 void CPUSigmoidOp::forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) {
+  MLLM_ASSERT(!inputs.empty() && !outputs.empty(), "SigmoidOp requires at least one input and output tensor");
   const auto& X = inputs[0];
   auto& Y = outputs[0];
mllm/core/DataTypes.cpp (1)

109-109: Inconsistent error message format.

The lanesOfType function now includes the dtype value in the error message (line 55: "Unknown data type {}" with (int32_t)dtype), but bytesOfType (line 109) and nameOfType (line 163) still use the old format without the value. Consider updating these for consistency.

🔎 Suggested fix
// In bytesOfType:
-    default: NYI("Unknown data type");
+    default: NYI("Unknown data type {}", (int32_t)dtype);

// In nameOfType:
-    default: NYI("Unknown data type");
+    default: NYI("Unknown data type {}", (int32_t)dtype);

Also applies to: 163-163

mllm/backends/cpu/kernels/x86/sigmoid.cpp (2)

19-20: Potential type mismatch between vector_size and loop variables.

hn::Lanes(d) returns size_t, but aligned_len, len, and loop counters are int. This could cause issues on large tensors where len exceeds INT_MAX, or subtle bugs from signed/unsigned comparisons.

🔎 Suggested fix
-void sigmoid_fp32(const mllm_fp32_t* __restrict X, mllm_fp32_t* __restrict Y, int len, int thread_count) {
+void sigmoid_fp32(const mllm_fp32_t* __restrict X, mllm_fp32_t* __restrict Y, size_t len, int thread_count) {
   using D = hn::ScalableTag<float>;
   const D d;
-  const auto vector_size = hn::Lanes(d);
-  const int aligned_len = len - (len % vector_size);
+  const size_t vector_size = hn::Lanes(d);
+  const size_t aligned_len = len - (len % vector_size);

Note: This would require updating the header declaration as well.


31-31: Consider adding <cmath> include for std::exp.

The scalar fallback uses std::exp() but the file doesn't explicitly include <cmath>. While it may work via transitive includes from Highway, explicit includes are safer and more portable.

 #include "mllm/backends/cpu/kernels/x86/sigmoid.hpp"
 #include "mllm/core/Parallel.hpp"
+#include <cmath>

Also applies to: 41-41

mllm/backends/cpu/kernels/arm/sigmoid.cpp (1)

13-86: Consider extracting common vectorized loop logic to reduce duplication.

The multi-threaded and single-threaded paths share nearly identical vectorized processing logic (lines 17-50 vs 53-84). While functionally correct, this duplication increases maintenance burden.

🔎 Suggested approach

Extract the tail-handling loop (8-wide, 4-wide, scalar) into a helper function that can be reused by both paths:

static inline void sigmoid_fp32_tail(const mllm_fp32_t* X, mllm_fp32_t* Y, int start, int len) {
  int i = start;
  for (; i <= len - 8; i += 8) {
    // 8-wide processing...
  }
  for (; i <= len - 4; i += 4) {
    // 4-wide processing...
  }
  for (; i < len; i++) {
    Y[i] = 1.0f / (1.0f + std::exp(-X[i]));
  }
}
pymllm/backends/qualcomm/transformers/qwen3/runner.py (3)

19-21: Simplify multiple isinstance checks using tuple.

Multiple isinstance calls can be combined for readability.

🔎 Proposed fix
 def freeze_qwen3_linear_weight(m):
-    if isinstance(m, QLinearLPBQ) or isinstance(m, QLinearW8A16_PerChannelSym):
+    if isinstance(m, (QLinearLPBQ, QLinearW8A16_PerChannelSym)):
         m.freeze_weight()

Similarly for convert_weight at lines 35-40:

 def convert_weight(m):
-    if (
-        isinstance(m, QLinearLPBQ)
-        or isinstance(m, QLinearW8A16_PerChannelSym)
-        or isinstance(m, QRMSNorm)
-    ):
+    if isinstance(m, (QLinearLPBQ, QLinearW8A16_PerChannelSym, QRMSNorm)):
         m.convert_to_deploy()

94-99: Extract magic number 151668 to a named constant.

The token ID for </think> should be a named constant for clarity and maintainability. As per coding guidelines, use named constants instead of magic numbers.

🔎 Proposed fix
+# Token ID for </think> end marker
+THINK_END_TOKEN_ID = 151668
+
 class Qwen3Quantizer:
     ...
     def infer(self, prompt: str):
         ...
         # parsing thinking content
         try:
             # rindex finding 151668 (</think>)
-            index = len(output_ids) - output_ids[::-1].index(151668)
+            index = len(output_ids) - output_ids[::-1].index(THINK_END_TOKEN_ID)
         except ValueError:
             index = 0

143-145: Consider making the minimum text length configurable.

The hardcoded 1024 character minimum may not be suitable for all calibration scenarios. Consider making this a parameter or named constant.

-    def calibrate(self, num_samples=64, max_seq_length=512):
+    def calibrate(self, num_samples=64, max_seq_length=512, min_text_length=1024):
         ...
-                if len(entry["text"].strip()) < 1024:
+                if len(entry["text"].strip()) < min_text_length:
                     continue
mllm/backends/qnn/aot/passes/PTQPass.cpp (3)

173-174: Remove unused pf parameter from recursiveSolveNormal.

The pf parameter is declared but never used in the function body. This was likely carried over from recursiveSolveWeights but isn't needed here.

🔎 Proposed fix
-void recursiveSolveNormal(const std::shared_ptr<ir::IRContext>& ir_ctx, const ir::graph::SubGraphOp::ptr_t& call_op,
-                          const ParameterFile::ptr_t& pf) {
+void recursiveSolveNormal(const std::shared_ptr<ir::IRContext>& ir_ctx, const ir::graph::SubGraphOp::ptr_t& call_op) {

Update the recursive call at line 189:

-      recursiveSolveNormal(w.getContext(), w.getContext()->lookupSymbolTable(ns)->cast_<ir::graph::SubGraphOp>(), pf);
+      recursiveSolveNormal(w.getContext(), w.getContext()->lookupSymbolTable(ns)->cast_<ir::graph::SubGraphOp>());

And the call at lines 221-223:

   recursiveSolveNormal(writer.getContext(),
-                       getCtx()->lookupSymbolTable(call_main_graph_op->getSymbolAttr()->str())->cast_<ir::graph::SubGraphOp>(),
-                       pf);
+                       getCtx()->lookupSymbolTable(call_main_graph_op->getSymbolAttr()->str())->cast_<ir::graph::SubGraphOp>());

24-30: Potential integer type mismatch in loop.

The loop index i is int but numel() returns size_t. For very large tensors, this could cause issues. Consider using size_t for the loop variable.

🔎 Proposed fix
 template<typename T>
 void checkTypeLimits(Tensor in, int quant_min, int quant_max) {  // NOLINT
   auto numel = in.numel();
-  for (int i = 0; i < numel; ++i) {
+  for (size_t i = 0; i < numel; ++i) {
     MLLM_RT_ASSERT(*(in.ptr<T>() + i) >= quant_min);
     MLLM_RT_ASSERT(*(in.ptr<T>() + i) <= quant_max);
   }
 }

48-50: Address the FIXME comment regarding int8 vs uint8.

The FIXME indicates uncertainty about whether QNN expects int8 or uint8. This should be resolved before release to ensure correct quantization behavior.

Would you like me to help investigate QNN's expected data types for quantized weights, or open an issue to track this?

examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (1)

217-226: Complex nested QDQ wrapping in RoPE application.

The RoPE application involves deeply nested QDQ calls. While functionally correct, consider extracting this into a helper function for readability:

// Lines 219-226 could become:
query_states = applyRoPEWithQDQ(this, query_states, cos, sin, "q_rope");
key_states = applyRoPEWithQDQ(this, key_states, cos, sin, "k_rope");
pymllm/backends/qualcomm/transformers/core/rms_norm.py (1)

97-99: Consider using logging module instead of print.

Using print for deployment/quantization status messages bypasses log level controls. Consider using Python's logging module for consistency with production logging practices. As per coding guidelines, appropriate logging should be used for significant events.

🔎 Proposed fix

Add import at the top of the file:

import logging
logger = logging.getLogger(__name__)

Then replace the print statements:

-        print(
-            f"Class: {class_name}, Instance: {instance_class_name}, Deploy Mode Activated. 'weight' is now {self.weight.dtype} buffer. zp is {zero_point}"
-        )
+        logger.info(
+            f"Class: {class_name}, Instance: {instance_class_name}, Deploy Mode Activated. 'weight' is now {self.weight.dtype} buffer. zp is {zero_point}"
+        )

Similarly for lines 114-116.

pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (2)

279-297: Consider caching constant tensors to avoid repeated allocation.

The torch.ones(1, ...) tensors for scaling and mask fill values are created on every forward pass. For better efficiency, consider registering these as buffers during initialization.

🔎 Example approach

In __init__:

self.register_buffer("_neg_20", torch.tensor(-20.0, dtype=torch.bfloat16))

In forward:

self.neg_20_qdq(self._neg_20.to(device=value_states.device))

This is a minor optimization since PTQ calibration prioritizes accuracy over speed.


733-740: Consider sorting __all__ for consistency.

Static analysis indicates __all__ is unsorted. Alphabetical sorting improves maintainability.

🔎 Proposed fix (isort-style)
 __all__ = [
     "Qwen3ForCausalLM",
     "Qwen3ForQuestionAnswering",
+    "Qwen3ForSequenceClassification",
+    "Qwen3ForTokenClassification",
+    "Qwen3Model",
     "Qwen3PreTrainedModel",
-    "Qwen3Model",
-    "Qwen3ForSequenceClassification",
-    "Qwen3ForTokenClassification",
 ]

Comment thread mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
Comment thread mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp Outdated
Comment thread mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
Comment thread mllm/backends/qnn/aot/passes/PTQPass.cpp Outdated
Comment thread mllm/core/aops/ParamOp.cpp
Comment thread pymllm/backends/qualcomm/transformers/core/qlinear.py
Comment thread pymllm/backends/qualcomm/transformers/core/rms_norm.py Outdated
Comment thread pymllm/backends/qualcomm/transformers/core/test_qlinear.py Outdated
Comment thread pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py
Comment thread requirements-qnn-aot.txt
Copy link
Copy Markdown
Owner

@UbiquitousLearning UbiquitousLearning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Fix all issues with AI Agents 🤖
In @mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp:
- Around line 857-863: The input quantization spec generated for i_0 is not
appended to annotation_attr->annotation_.inputs when i_0 lacks a quant_recipe;
currently only the else branch appends the spec. Modify the first branch that
creates i_0_spec via genSimpleQuantizationSpecAttr(writer.getContext(),
i_0->cast_<ir::tensor::TensorValue>()) to also push that spec into
annotation_attr->annotation_.inputs (the same way the else branch does, using
i_0_spec->spec_ or the appropriate cast to
ir::linalg::LinalgIRQuantizatonSpecAttr), so both branches add the input spec to
the annotation.

In @mllm/backends/qnn/aot/passes/PTQPass.cpp:
- Line 135: The function name _recursiveSolveNormalImpl uses a reserved
identifier pattern; rename it to recursiveSolveNormalImpl (remove the leading
underscore) and update all call sites that invoke _recursiveSolveNormalImpl
(e.g., the calls at the former lines 183-184) to the new name so declarations,
definitions, and invocations match.
🧹 Nitpick comments (3)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (2)

390-393: Previous off-by-one error resolved.

The uint16 maximum value is now correctly set to 65536 - 1 (i.e., 65535). The FIXME comment appropriately acknowledges the hardcoded dtype limitation for future work.

Optionally, consider using the literal 65535 for improved clarity:

🔎 Optional clarity improvement
  auto weight_spec_attr = writer.create<ir::linalg::LinalgIRQuantizatonSpecAttr>(
-      ir::linalg::QuantizationSpecAsymPerTensor::create(0, 65536 - 1, kUInt16, kFloat32, kInt32, Tensor::nil(), Tensor::nil()));
+      ir::linalg::QuantizationSpecAsymPerTensor::create(0, 65535, kUInt16, kFloat32, kInt32, Tensor::nil(), Tensor::nil()));

770-771: Previous off-by-one error resolved.

The uint16 maximum value is now correctly set to 65536 - 1 (i.e., 65535).

Optionally, consider using the literal 65535 for improved clarity:

🔎 Optional clarity improvement
-        auto out_quant_spec = ir::linalg::QuantizationSpecAsymPerTensor::create(0, 65536 - 1, kUInt16, kFloat32, kInt32,
+        auto out_quant_spec = ir::linalg::QuantizationSpecAsymPerTensor::create(0, 65535, kUInt16, kFloat32, kInt32,
                                                                                 Tensor::nil(), Tensor::nil());
mllm/backends/qnn/aot/passes/PTQPass.cpp (1)

49-50: Consider using named constants for quantization limits.

The hardcoded values (-8, 7, 0, 16) represent quantization bounds but lack semantic meaning. Named constants would improve readability and maintainability.

🔎 Example: Define constants at file or namespace scope
namespace {
constexpr int32_t kInt4Min = -8;
constexpr int32_t kInt4Max = 7;
constexpr int32_t kUInt4Min = 0;
constexpr int32_t kUInt4Max = 16;  // Note: actual uint4 max is 15, verify if 16 is intentional

// Then use:
checkTypeLimits<int8_t>(weight, kInt4Min, kInt4Max);
checkTypeLimits<uint8_t>(scale1, kUInt4Min, kUInt4Max);
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82900b1 and 54d9927.

📒 Files selected for processing (4)
  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
  • mllm/backends/qnn/aot/passes/PTQPass.cpp
  • mllm/core/OpTypes.hpp
  • pymllm/backends/qualcomm/transformers/core/rms_norm.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • mllm/core/OpTypes.hpp
🧰 Additional context used
📓 Path-based instructions (5)
{mllm,mllm-cli,pymllm}/**/*

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).

Files:

  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
  • mllm/backends/qnn/aot/passes/PTQPass.cpp
  • pymllm/backends/qualcomm/transformers/core/rms_norm.py
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Files:

  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
  • mllm/backends/qnn/aot/passes/PTQPass.cpp
  • pymllm/backends/qualcomm/transformers/core/rms_norm.py
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.

Files:

  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
  • mllm/backends/qnn/aot/passes/PTQPass.cpp
  • pymllm/backends/qualcomm/transformers/core/rms_norm.py
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).

Files:

  • mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
  • mllm/backends/qnn/aot/passes/PTQPass.cpp
  • pymllm/backends/qualcomm/transformers/core/rms_norm.py
{mllm,mllm-cli,pymllm}/**/*.{sh,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

If a file starts with a shebang, it must be the first line and contain only the interpreter path and optional single argument, encoded in UTF-8.

Files:

  • pymllm/backends/qualcomm/transformers/core/rms_norm.py
🧬 Code graph analysis (3)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (4)
mllm/compile/ir/linalg/Attribute.hpp (14)
  • spec (100-105)
  • spec (130-135)
  • spec (160-165)
  • spec (192-197)
  • spec (226-231)
  • spec (260-265)
  • spec (297-302)
  • create (77-77)
  • node (27-27)
  • node (27-27)
  • node (327-327)
  • node (327-327)
  • node (348-348)
  • node (348-348)
mllm/backends/qnn/aot/visitor/Elewise.cpp (2)
  • isMatch (13-15)
  • isMatch (13-13)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.hpp (19)
  • op (40-40)
  • op (54-54)
  • op (66-66)
  • op (80-80)
  • op (92-92)
  • op (106-106)
  • op (120-120)
  • op (132-132)
  • op (144-144)
  • op (156-156)
  • op (170-170)
  • op (184-184)
  • op (198-198)
  • op (212-212)
  • op (226-226)
  • op (238-238)
  • shareQuantSpecSingleInputToSingleOutputAndSetOpQuantAnnoAttr (27-28)
  • genSimpleQuantizationSpecAttr (24-25)
  • cloneQuantizationSpecType (32-33)
mllm/compile/ir/builtin/Attribute.hpp (10)
  • node (24-24)
  • node (24-24)
  • node (44-44)
  • node (44-44)
  • node (64-64)
  • node (64-64)
  • node (84-84)
  • node (84-84)
  • node (104-104)
  • node (104-104)
mllm/backends/qnn/aot/passes/PTQPass.cpp (3)
mllm/compile/ir/linalg/Attribute.hpp (14)
  • quant_min (87-98)
  • quant_min (87-88)
  • quant_min (116-128)
  • quant_min (116-117)
  • quant_min (146-158)
  • quant_min (146-147)
  • quant_min (177-190)
  • quant_min (177-178)
  • quant_min (210-224)
  • quant_min (210-211)
  • quant_min (244-258)
  • quant_min (244-245)
  • quant_min (279-295)
  • quant_min (279-281)
mllm/backends/qnn/aot/passes/PTQPass.hpp (1)
  • op (26-26)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (2)
  • run (983-1013)
  • run (983-983)
pymllm/backends/qualcomm/transformers/core/rms_norm.py (1)
pymllm/backends/qualcomm/transformers/core/qdq.py (4)
  • forward (50-53)
  • disable_observer (60-62)
  • disable_fakequant (68-70)
  • extra_repr (72-74)
🪛 Clang (14.0.6)
mllm/backends/qnn/aot/passes/PTQPass.cpp

[error] 32-32: 3 adjacent parameters of 'solveLinearWeight' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 32-32: parameter name 'pf' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 33-33: parameter name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 64-64: 3 adjacent parameters of 'solveRMSNormWeight' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 64-64: parameter name 'pf' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 65-65: parameter name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 100-100: 3 adjacent parameters of 'solveEmbeddingWeight' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 100-100: parameter name 'pf' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 101-101: parameter name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 120-120: 3 adjacent parameters of 'recursiveSolveWeights' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 121-121: parameter name 'pf' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 135-135: parameter name 'v' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 137-137: variable name 'tv' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 173-173: 3 adjacent parameters of 'recursiveSolveNormal' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 174-174: parameter name 'pf' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-android
  • GitHub Check: build-x86
  • GitHub Check: build-macos
🔇 Additional comments (21)
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp (9)

41-46: LGTM! Enhanced logging for pattern processing.

The nested loop correctly matches patterns to their names for improved observability during quantization pass execution.


97-101: LGTM! Correct asymmetric uint16 quantization support.

The implementation correctly uses the uint16 maximum value (65535) and properly includes zero_point parameters for asymmetric per-tensor quantization.


256-267: LGTM! Correct sigmoid quantization pattern.

The pattern appropriately uses noSharingSingleInAndSingleOutQuantAnnoAttr since sigmoid transforms value ranges and should not share input/output quantization specs.


268-279: LGTM! Correct negation quantization pattern.

The pattern correctly shares quantization specs between input and output, which is appropriate for negation operations that preserve the value range.


425-434: LGTM! Improved index pattern with helper reuse.

The refactored implementation correctly ensures the input has a quantization recipe before delegating to the shared helper function, improving code consistency.


436-447: LGTM! Correct slice quantization pattern.

The pattern appropriately shares quantization specs between input and output, which is correct for slice operations that preserve value ranges.


470-472: LGTM! Proper quantization spec cloning for constants.

The use of cloneQuantizationSpecType correctly creates a new quantization spec with the same type as the first input, which is appropriate for constant operands.


874-886: LGTM! Correct embedding weight quantization handling.

The implementation properly looks up the weight tensor, generates an appropriate quantization spec, and correctly adds it to the annotation's weights map.


961-963: LGTM! Correct pattern registration.

The three new patterns (Neg, Slice, Sigmoid) are properly registered with descriptive names and consistent priority values.

pymllm/backends/qualcomm/transformers/core/rms_norm.py (4)

32-44: LGTM!

The forward method correctly implements RMS normalization with numerical stability (float32 casting) and properly applies fake quantization to the weight parameter.


101-116: LGTM!

The freeze_weight method correctly implements the PTQ calibration pattern: it observes the weight, calculates quantization parameters, locks them, and provides useful logging.


118-120: LGTM!

The disable_quant method correctly delegates to disable_fakequant() to turn off quantization noise.


122-123: LGTM!

The extra_repr method now correctly references self.eps, which resolves the issue from the previous review.

mllm/backends/qnn/aot/passes/PTQPass.cpp (8)

23-30: LGTM - Helper function logic is correct.

The bounds-checking logic correctly validates quantization limits for each tensor element.


64-98: LGTM - RMSNorm weight solving logic is correct.

The function properly handles both raw and asymmetric per-tensor quantization, with thorough validation of scale and zero_point properties.


100-118: LGTM - Embedding weight solver correctly handles raw quantization.

The function appropriately marks raw embeddings as solved and uses NYI for unsupported quantization types.


120-133: LGTM - Graph traversal and weight solving dispatch is correct.

The function properly walks the IR graph, dispatches to type-specific solvers, and handles nested subgraphs recursively.


136-171: LGTM - Tensor quantization solving logic is correct.

The function properly handles asymmetric per-tensor, symmetric per-tensor, and raw quantization specs, with appropriate early returns when required attached views are missing.


173-193: LGTM - Normal tensor solving traversal is correct.

The function properly walks LinalgIROps and their input/output tensors, delegating to the solver implementation and handling nested subgraphs.


197-226: LGTM - Pass entry point follows expected structure.

The run method correctly retrieves the parameter file, locates the main graph, and executes the two-stage solving process (weights, then normal tensors). The structure is consistent with similar passes in the codebase.


228-228: LGTM - Factory function is correct.

Comment thread mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
Comment on lines +48 to +50
// FIXME weight maybe error, Check qnn eats int8 or uint8. Here weight using int8 to store int4.
checkTypeLimits<int8_t>(weight, -8, 7); // Int4
checkTypeLimits<uint8_t>(scale1, 0, 16); // UInt4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the PTQPass.cpp file
find . -name "PTQPass.cpp" -type f

Repository: UbiquitousLearning/mllm

Length of output: 109


🏁 Script executed:

# Check file size and read PTQPass.cpp
wc -l ./mllm/backends/qnn/aot/passes/PTQPass.cpp

Repository: UbiquitousLearning/mllm

Length of output: 113


🏁 Script executed:

# Read PTQPass.cpp with focus on the relevant section
cat -n ./mllm/backends/qnn/aot/passes/PTQPass.cpp | head -70

Repository: UbiquitousLearning/mllm

Length of output: 3133


🏁 Script executed:

# Search for LPBQ-related code and tests
rg -i "lpbq|int4|uint4" --type cpp --type py -B 2 -A 2 mllm/backends/qnn/ | head -100

Repository: UbiquitousLearning/mllm

Length of output: 5795


🏁 Script executed:

# Search for weight usage in QNN wrapper API
rg -B 5 -A 5 "scale_level_0_int|weight" mllm/backends/qnn/aot/QnnWrappersAPI.cpp | head -150

Repository: UbiquitousLearning/mllm

Length of output: 5734


🏁 Script executed:

# Search for how weights are prepared/handled in LPBQ quantization
rg -B 3 -A 3 "\.weight\"|prepare.*weight" mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp | head -100

Repository: UbiquitousLearning/mllm

Length of output: 1193


🏁 Script executed:

# Look for the LPBQ weight quantization spec creation with detailed context
rg -B 10 -A 10 "create\(-8, 7" mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp

Repository: UbiquitousLearning/mllm

Length of output: 1136


🏁 Script executed:

# Search for tests related to LPBQ quantization
find . -path "*/test*" -name "*.cpp" -o -name "*.py" | xargs rg -l "LPBQ|int4.*weight" | head -5

Repository: UbiquitousLearning/mllm

Length of output: 242


🏁 Script executed:

# Check the QLinear implementation to understand weight storage
cat -n ./pymllm/backends/qualcomm/transformers/core/qlinear.py | head -120

Repository: UbiquitousLearning/mllm

Length of output: 5156


🏁 Script executed:

# Continue reading QLinear.py to find LPBQ weight handling
cat -n ./pymllm/backends/qualcomm/transformers/core/qlinear.py | sed -n '115,250p'

Repository: UbiquitousLearning/mllm

Length of output: 4110


🏁 Script executed:

# Check if there's any special reason for using 16 as upper bound (maybe loose validation)
rg -B 3 -A 3 "0, 16" mllm/backends/qnn/aot/

Repository: UbiquitousLearning/mllm

Length of output: 660


🏁 Script executed:

# Double-check the Python quantization range for scale1 one more time
rg -B 2 -A 2 "clamp.*15" ./pymllm/backends/qualcomm/transformers/core/qlinear.py

Repository: UbiquitousLearning/mllm

Length of output: 223


Remove incorrect FIXME and fix uint4 validation range.

The FIXME comment is unfounded—int8_t is the correct choice for storing signed int4 quantized weights, as confirmed by the quantization spec creating weights with range -8 to 7. However, line 50 has an off-by-one error: uint4 range is 0–15, not 0–16. Additionally, extract the magic numbers (-8, 7, 0, 15) into named constants per coding guidelines.

Fix
-      // FIXME weight maybe error, Check qnn eats int8 or uint8. Here weight using int8 to store int4.
       checkTypeLimits<int8_t>(weight, -8, 7);   // Int4
-      checkTypeLimits<uint8_t>(scale1, 0, 16);  // UInt4
+      checkTypeLimits<uint8_t>(scale1, 0, 15);  // UInt4
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// FIXME weight maybe error, Check qnn eats int8 or uint8. Here weight using int8 to store int4.
checkTypeLimits<int8_t>(weight, -8, 7); // Int4
checkTypeLimits<uint8_t>(scale1, 0, 16); // UInt4
checkTypeLimits<int8_t>(weight, -8, 7); // Int4
checkTypeLimits<uint8_t>(scale1, 0, 15); // UInt4

});
}

void _recursiveSolveNormalImpl(const ir::Val::ptr_t& v) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Rename function to avoid reserved identifier.

The function name _recursiveSolveNormalImpl uses an underscore followed by a capital letter, which is a reserved identifier pattern in C++. This can cause undefined behavior or conflicts with implementation-defined identifiers.

🔎 Proposed fix
-void _recursiveSolveNormalImpl(const ir::Val::ptr_t& v) {
+void recursiveSolveNormalImpl(const ir::Val::ptr_t& val) {

Also update the call sites at lines 183-184:

-      for (auto iii : inputs) { _recursiveSolveNormalImpl(iii->cast_<ir::Val>()); }
-      for (auto ooo : outputs) { _recursiveSolveNormalImpl(ooo->cast_<ir::Val>()); }
+      for (auto iii : inputs) { recursiveSolveNormalImpl(iii->cast_<ir::Val>()); }
+      for (auto ooo : outputs) { recursiveSolveNormalImpl(ooo->cast_<ir::Val>()); }

Note: This issue was flagged in a previous review but remains unresolved in the current code.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Clang (14.0.6)

[error] 135-135: parameter name 'v' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

🤖 Prompt for AI Agents
In @mllm/backends/qnn/aot/passes/PTQPass.cpp around line 135, The function name
_recursiveSolveNormalImpl uses a reserved identifier pattern; rename it to
recursiveSolveNormalImpl (remove the leading underscore) and update all call
sites that invoke _recursiveSolveNormalImpl (e.g., the calls at the former lines
183-184) to the new name so declarations, definitions, and invocations match.

Comment on lines +21 to +30
# Quantization configuration for Weight
self.weight_fake_quant = FakeQuantize(
observer=MinMaxObserver.with_args(
qscheme=torch.per_tensor_affine, dtype=torch.qint32
),
quant_min=0,
quant_max=2 ** (quant_bits) - 1,
dtype=torch.qint32,
qscheme=torch.per_tensor_affine,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's read the rms_norm.py file to see the actual implementation
cd /repo && find . -name "rms_norm.py" -path "*/qualcomm/*" | head -5

Repository: UbiquitousLearning/mllm

Length of output: 122


🏁 Script executed:

# Search for the file in common locations
fd "rms_norm.py" . --type f 2>/dev/null | grep -i qualcomm

Repository: UbiquitousLearning/mllm

Length of output: 123


🏁 Script executed:

# Broader search
fd "rms_norm" . --type f 2>/dev/null

Repository: UbiquitousLearning/mllm

Length of output: 123


🏁 Script executed:

cat -n ./pymllm/backends/qualcomm/transformers/core/rms_norm.py

Repository: UbiquitousLearning/mllm

Length of output: 5318


🌐 Web query:

PyTorch torch.quint8 torch.quint16 quantization dtype support

💡 Result:

Short answer: PyTorch's built-in quantized dtypes are torch.quint8, torch.qint8 and torch.qint32. torch.quint16 is not provided as a native quantized dtype (and is not supported by torch.quantize_per_tensor / the core quantization APIs). New lower-bit/int dtypes (uint1–uint8, int1–int8) and other numeric types are being discussed/added in torch/torch.ao, but that is separate from the classic quantized-dtype API. [1][2][3]

Sources:

  • PyTorch quantization API (lists supported quantized dtypes: quint8, qint8, qint32). [1]
  • torch.quantize_per_tensor docs (dtype parameter accepted values: torch.quint8, torch.qint8, torch.qint32). [2]
  • torch.ao / developer discussion on adding broader integer dtypes and roadmap (notes about uint/int variants and that quantized-tensor support is evolving). [3]

Fix dtype and quantization range mismatch between training and deployment.

The code uses an unsigned quantization range [0, 2^quant_bits - 1] with dtype=torch.qint32 during training, but converts weights to signed integer dtypes (int8, int16, int32) during deployment. This creates a critical mismatch:

  • 8-bit case: Quantizes to unsigned range [0, 255], but converts to torch.int8 (signed range [-128, 127]), causing overflow for values 128–255.
  • 16-bit case: Quantizes to unsigned range [0, 65535], but converts to torch.int16 (signed range [-32768, 32767]), causing overflow for values 32768–65535.

PyTorch's quantization API only supports torch.quint8, torch.qint8, and torch.qint32—not torch.quint16.

Solution: Use signed quantization ranges that match the deployment dtypes. For 8-bit, use torch.quint8 and adjust the range to [0, 255] (unsigned), or use torch.qint8 with a signed range [-128, 127]. For 16-bit, use signed qint32 with a signed range or find an alternative approach since PyTorch doesn't provide quint16.

Comment on lines +61 to +76
# 2. Calculate integer values
# w_int = round(w / s + zp)
w_int = torch.round(self.weight / scale + zero_point).clamp(
quant_min, quant_max
)

# 3. Set target integer type
if self.quant_bits <= 8:
target_dtype = torch.int8
elif self.quant_bits <= 16:
target_dtype = torch.int16
else:
target_dtype = torch.int32

w_int = w_int.to(target_dtype)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n pymllm/backends/qualcomm/transformers/core/rms_norm.py | head -100

Repository: UbiquitousLearning/mllm

Length of output: 4106


🏁 Script executed:

# Search for quant_min and quant_max definitions in the file
rg -n "quant_min|quant_max" pymllm/backends/qualcomm/transformers/core/rms_norm.py -B2 -A2

Repository: UbiquitousLearning/mllm

Length of output: 744


🏁 Script executed:

# Search for the __init__ method to see the quantization scheme initialization
rg -n "__init__|quant_min|quant_max" pymllm/backends/qualcomm/transformers/core/rms_norm.py -B1 -A1

Repository: UbiquitousLearning/mllm

Length of output: 633


🏁 Script executed:

# Look for QNN backend quantization examples or documentation
fd -e py -e md -e txt . pymllm/backends/qualcomm/ | head -20

Repository: UbiquitousLearning/mllm

Length of output: 690


🏁 Script executed:

# Check if there are other quantization classes that might show the intended pattern
rg -n "quant_min|quant_max|int8|int16" pymllm/backends/qualcomm/transformers/core/ -A3 -B3

Repository: UbiquitousLearning/mllm

Length of output: 8119


🏁 Script executed:

# Look at the QDQ (Quantize-Dequantize) module
cat -n pymllm/backends/qualcomm/transformers/core/qdq.py

Repository: UbiquitousLearning/mllm

Length of output: 3446


🏁 Script executed:

# Look at the QLinear module
cat -n pymllm/backends/qualcomm/transformers/core/qlinear.py | head -150

Repository: UbiquitousLearning/mllm

Length of output: 6582


🏁 Script executed:

# Search for tests related to QRMSNorm
fd -e py . pymllm/ | xargs grep -l "QRMSNorm" 2>/dev/null

Repository: UbiquitousLearning/mllm

Length of output: 237


🏁 Script executed:

# Look at the README for quantization context
cat pymllm/backends/qualcomm/transformers/README.md

Repository: UbiquitousLearning/mllm

Length of output: 263


Critical: Signed/unsigned dtype mismatch will corrupt quantized weights.

The quantization uses an unsigned range [0, 2^quant_bits - 1] (asymmetric scheme), but convert_to_deploy casts to signed integer types:

  • 8-bit: w_int clamped to [0, 255] but cast to int8 with range [-128, 127] → values above 127 will overflow
  • 16-bit: w_int clamped to [0, 65535] but cast to int16 with range [-32768, 32767] → values above 32767 will overflow

Fix by either using unsigned dtypes (if QNN backend supports them) or switching to symmetric quantization with signed ranges:

         # 3. Set target integer type
         if self.quant_bits <= 8:
-            target_dtype = torch.int8
+            target_dtype = torch.uint8
         elif self.quant_bits <= 16:
-            target_dtype = torch.int16
+            target_dtype = torch.uint16
         else:
             target_dtype = torch.int32

Or in __init__, change to symmetric quantization matching other quantization modules in the codebase:

-            quant_min=0,
-            quant_max=2 ** (quant_bits) - 1,
+            quant_min=-(2 ** (quant_bits - 1)),
+            quant_max=2 ** (quant_bits - 1) - 1,

Committable suggestion skipped: line range outside the PR's diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants