feat(qualcomm): Qnn aot runner by oreomaker · Pull Request #603 · UbiquitousLearning/mllm

oreomaker · 2026-01-14T08:56:09Z

Please check Guidelines for Contributing.

Summary by CodeRabbit

New Features
- QNN AOT runtime for autoregressive token generation with streaming output and KV-cache-backed attention
- New example runner demonstrating streaming token output and CLI options
- QNN backend: load/create/save context support (configurable context file)
Refactor
- Simplified QNN graph input handling and removed verbose debug scaffolding to reduce runtime noise and complexity

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…nment

… and value states

…alization - Add loadContext and createContext methods to QNNBackend for flexible context management - Modify initQnnBackend to accept optional context path parameter - Update QNNRuntime::retrieveContext to accept context binary path as parameter - Remove hardcoded context creation logic from constructor - Add filesystem include for context file existence checking - Refactor tensor wrapper allocation logic to use setDataContainer - Remove debug logging and verification code - Simplify input tensor handling during graph execution

- Add MLLM_BUILD_QNN_BACKEND condition to enable QNN AOT example - Create new aot_run.cpp with Qwen3 AOT runner implementation - Implement KVCacheManager for managing KV cache operations - Add PromptProcessor for handling prompt prefill operations - Create QnnAOTModule as base module for AOT runtime - Implement QnnAOTRuntime with complete generation pipeline - Update CMakeLists.txt to include new AOT runtime sources

coderabbitai · 2026-01-14T08:56:23Z

📝 Walkthrough

Walkthrough

Adds a Qualcomm QNN AOT runtime for Qwen3: KV-cache manager, prompt processor, token generator, AOT module and Runner; exposes context create/load APIs in QNN backend; updates build and initQnnBackend to accept a context path; adds example runner executable and related CMake changes.

Changes

Cohort / File(s)	Summary
Build / Examples `examples/CMakeLists.txt`, `examples/qwen3_qnn_aot/CMakeLists.txt`, `examples/qwen3_qnn_aot/*`	Conditional inclusion of qwen3_qnn_aot based on MLLM_BUILD_QNN_BACKEND or MLLM_QUALCOMM_QNN_AOT_ON_X86_ENABLE; adds `mllm-qwen3-aot-runner` and makes `mllm-qwen3-aot-c` conditional.
Example runner `examples/qwen3_qnn_aot/aot_run.cpp`	New CLI demo executable that loads tokenizer/model, initializes AOT Runner, and streams generated tokens.
QNN backend build sources `mllm/backends/qnn/CMakeLists.txt`	Adds recursive AOT runtime sources (aot_rt) to QNN backend build.
QNN backend core `mllm/backends/qnn/QNNBackend.hpp`, `mllm/backends/qnn/QNNBackend.cpp`, `mllm/backends/qnn/QNNRuntime*`	New public APIs: `createContext()`, `loadContext(path)`; `QNNRuntime::retrieveContext` now takes a binary path; graphExecute input handling simplified (removed copying/padding) and debug scaffolding reduced. Attention: context lifecycle and retrieveContext call sites.
QNNTensor utilities `mllm/backends/qnn/QNNUtils.hpp`, `mllm/backends/qnn/QNNUtils.cpp`	Exposes `QNNTensorWrapper::isAlloc()` and `__setDataContainer(...)`; `alloc()` no longer early-returns when already allocated (warns instead). Review allocation correctness and invariants.
Backend registration / API change `mllm/backends/qnn/Register.cpp`, `mllm/mllm.hpp`	`initQnnBackend` now accepts `context_path` (default "qnn_context.bin"); Register loads or creates context based on filesystem existence. Update callers.
AOT runtime — KV cache `mllm/backends/qnn/aot_rt/KVCacheManager.hpp`, `mllm/backends/qnn/aot_rt/KVCacheManager.cpp`	New templated KVCacheManager managing per-layer K/V buffers, attention mask init/update, rearrange/update logic; explicit instantiations for uint16_t/uint8_t. Large, review memory/alloc patterns.
AOT runtime — Prompt processing `mllm/backends/qnn/aot_rt/PromptProcessor.hpp`, `mllm/backends/qnn/aot_rt/PromptProcessor.cpp`	New templated PromptProcessor for chunked prefill, IO init/prep, and forward orchestration using KV cache. Review tensor binding and iteration semantics.
AOT runtime — Module abstraction `mllm/backends/qnn/aot_rt/QnnAOTModule.hpp`, `mllm/backends/qnn/aot_rt/QnnAOTModule.cpp`	New QnnAOTModule deriving Module and ARGeneration stub; provides forward and output tensor setter. Check NYI ARGeneration path.
AOT runtime — Token generation `mllm/backends/qnn/aot_rt/TokenGenerator.hpp`, `mllm/backends/qnn/aot_rt/TokenGenerator.cpp`	New templated TokenGenerator implementing autoregressive generation, EOS handling, KV updates, logits accumulation; explicit instantiations. Review sampling and EOS checks.
AOT runtime — Runner interface `mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp`, `mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp`	New RunnerConfig and Runner class with `load()` and `generate(...)` implementing high-level flow wiring tokenizer, KV cache, PromptProcessor, and TokenGenerator.
AOT tools / wrappers `mllm/backends/qnn/aot/QnnWrappersAPI.cpp`, `mllm/backends/qnn/aot/visitor/RMSNorm.cpp`	Implemented `QnnAOTEnv::saveContext` (file write) and added runtime fake bias/quant spec for RMSNorm nodes.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Runner
    participant KVCacheMgr as KVCacheManager<T>
    participant PromptProc as PromptProcessor<T>
    participant TokenGen as TokenGenerator<T>
    participant QNNBackend

    Client->>Runner: load()
    Runner->>QNNBackend: get backend / allocator
    Runner->>KVCacheMgr: initCache(allocator, ar_len)
    Runner->>PromptProc: construct + init_io()
    Runner->>TokenGen: construct + init_io()

    Client->>Runner: generate(prompt_tokens, seq_len, callback)
    Runner->>PromptProc: prefill(prompt_tokens, start=0)
    PromptProc->>KVCacheMgr: rearrangeCache / initAttentionMask
    PromptProc->>PromptProc: module forward (QnnAOTModule)
    PromptProc->>KVCacheMgr: updateCache(...)
    PromptProc-->>Runner: next_token

    loop autoregressive
        Runner->>TokenGen: generate(tokens, pos, seq_len, callback)
        TokenGen->>KVCacheMgr: prepare_io(...) / initAttentionMask
        TokenGen->>TokenGen: module forward
        TokenGen->>KVCacheMgr: updateCache(...)
        TokenGen-->>Client: token_callback(token_str)
    end

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

feat(Qnn AOT): AOT and AOT Runtime. Qwen3 AOT Mode. #567: Overlapping QNN AOT runtime and example/CMake/backend integration changes.
feat: Implement Qwen NPU Decoding Support with Memory Management Fixes #537: Related QNN backend allocator/tensor handling and graph execution adjustments.
feat: qwen2 cpu model and connection with npu prefill #555: Related KV-cache integration and prefill/pipeline changes for QNN AOT.

Suggested Reviewers

liang1232018
chenghuaWang
yirongjie

Poem

🐰 I nibble bytes and hop through code so bright,
KV caches tucked, I keep the model light,
Tokens tumble out, a playful, streaming art —
Hooray for AOT, a hopping-start! 🥕✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description only contains the repository's contribution template reminder and a link to the contributing guidelines, but provides no actual description of the changes, objectives, or implementation details of the PR.	Add a comprehensive description explaining: (1) the purpose of the QNN AOT runner, (2) key components added (KVCacheManager, PromptProcessor, TokenGenerator, etc.), (3) why these changes are needed, and (4) any testing performed.
Docstring Coverage	⚠️ Warning	Docstring coverage is 1.82% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'feat(qualcomm): Qnn aot runner' clearly describes the main change: adding a QNN AOT runner for Qualcomm backend, which aligns with the substantial new files and functionality introduced in the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

mllm/backends/qnn/QNNBackend.cpp (3)

333-343: Missing file open check before reading.

The code opens the file and immediately calls tellg() without verifying the file opened successfully. If the file doesn't exist or can't be read, tellg() returns -1, and subsequent operations will fail or exhibit undefined behavior.

🐛 Proposed fix

 bool QNNRuntime::retrieveContext(const std::string& contextBinaryPath, Qnn_ContextHandle_t& context,
                                  std::vector<std::shared_ptr<QNNModel>>& qnnModels, QnnContext_Config_t** contextConfig) {
   // Read the binary from qnn_context.bin and get the size in byte
   std::ifstream file(contextBinaryPath, std::ios::binary | std::ios::ate);
+  if (!file.is_open()) {
+    MLLM_ERROR("Failed to open context binary file: {}", contextBinaryPath);
+    return false;
+  }
   std::streamsize size = file.tellg();
+  if (size <= 0) {
+    MLLM_ERROR("Context binary file is empty or unreadable: {}", contextBinaryPath);
+    return false;
+  }
   file.seekg(0, std::ios::beg);

371-374: Unused variable writtenSize.

The variable writtenSize is declared but never used. Either remove it or use it to validate the context creation result.

🐛 Proposed fix

   // Create context from binary
-  Qnn_ContextBinarySize_t writtenSize = 0;
-  qnnInterface.contextCreateFromBinary(backendHandle, deviceHandle, (const QnnContext_Config_t**)contextConfig,
-                                       binaryBuffer.get(), size, &context, profileHandle);
+  if (QNN_CONTEXT_NO_ERROR != qnnInterface.contextCreateFromBinary(
+      backendHandle, deviceHandle, (const QnnContext_Config_t**)contextConfig,
+      binaryBuffer.get(), size, &context, profileHandle)) {
+    MLLM_ERROR("Failed to create context from binary: {}", contextBinaryPath);
+    return false;
+  }

417-417: Log message contains hardcoded filename instead of actual path.

The log message says "qnn_context.bin" but should reference the actual contextBinaryPath parameter for accurate logging.

🐛 Proposed fix

-  MLLM_INFO("QNN context retrieved from qnn_context.bin with {} QNNModels", graphNum);
+  MLLM_INFO("QNN context retrieved from {} with {} QNNModels", contextBinaryPath, graphNum);

🤖 Fix all issues with AI agents

In `@examples/qwen3_qnn_aot/aot_run.cpp`:
- Around line 22-27: Move the help flag check (help.isSet() /
Argparse::printHelp()) to before any backend initialization so the app exits
without calling mllm::initQnnBackend when the user only requested help; also fix
the initQnnBackend call to pass the QNN context binary (or omit the argument to
use the default "qnn_context.bin") instead of model_path.get() (e.g., replace
model_path.get() with the correct context path or no argument in the
mllm::initQnnBackend(...) call).

In `@mllm/backends/qnn/aot_rt/KVCacheManager.cpp`:
- Around line 35-38: After calling allocator->alloc(k_storage_in) ensure you
check that k_storage_in->ptr_ is non-null before calling memset; if allocation
failed, log or return an error (or throw) and avoid dereferencing the pointer.
Apply the same null-check pattern for every allocator->alloc(...) use in this
file (e.g., any other Storage instances created in KVCacheManager methods) so
you never call memset or access ptr_ when allocation failed.
- Around line 270-273: The loop building true_indices can leave elements
uninitialized if selected has fewer than n_update true values; change the logic
to first count actual trues (or use push_back) and either resize true_indices to
that actual count or validate that actual_count == n_update and handle the
mismatch (log/throw/return error) before using true_indices; update the code
that consumes true_indices (same scope where true_indices, selected, and
n_update are used) to rely on the validated/adjusted size to avoid reading
default zeros.

In `@mllm/backends/qnn/aot_rt/PromptProcessor.cpp`:
- Around line 145-146: The call to kv_manager_->initAttentionMask is using the
wrong tensor index (input_tensors_[3]) which points at a KV cache; change it to
use the attention-mask tensor created in init_io() (input_tensors_[2]) or,
better, replace the hardcoded index with the dedicated attn_mask member if one
exists; update the call to pass the correct ptr (e.g.,
input_tensors_[2].ptr<uint16_t>() or the attn_mask member) while keeping the
same arguments (attention_map, config_.ar_len, start_pos) so initAttentionMask
receives the actual attention mask instead of a past_key buffer.
- Around line 162-163: The attention-mask update call in PromptProcessor.cpp is
using the wrong tensor index: change the call to
kv_manager_->updateAttentionMask(...) to use input_tensors_[2].ptr<uint16_t>()
instead of input_tensors_[3].ptr<uint16_t>(); locate the updateAttentionMask
invocation in PromptProcessor.cpp (the kv_manager_->updateAttentionMask(...)
line) and replace the index so the attention mask source uses input_tensors_[2],
leaving other arguments (config_.ar_len, current_pos, n_update,
config_.sliding_window) unchanged.

In `@mllm/backends/qnn/aot_rt/QnnAOTModule.hpp`:
- Around line 26-33: The private member backend_path_ in class QnnAOTModule is
unused and should be removed; edit QnnAOTModule.hpp to delete the backend_path_
declaration from the private section (and remove any related comments or TODOs
referring to backend_path_), then rebuild/compile to ensure no missing
references — if any references exist elsewhere (e.g., in QnnAOTModule methods or
tests), remove or refactor those usages accordingly.

In `@mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp`:
- Around line 82-97: The call to tokenizer_->detokenize(next_token) in
Runner::generate can dereference a null tokenizer_; add a null check before
using tokenizer_ (e.g., if (token_callback) { if (tokenizer_) { wstr =
tokenizer_->detokenize(next_token); } else { wstr = L""; } ... }) and ensure you
still invoke token_callback with a sensible fallback (empty string or numeric
token string) when tokenizer_ is null; update the code paths around tokenizer_,
detokenize, and token_callback accordingly so no null access occurs.

In `@mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp`:
- Around line 1-12: The header QnnAOTRuntime.hpp is missing an include guard;
add a header guard or a single-line `#pragma` once at the very top of
QnnAOTRuntime.hpp (before the existing includes) and, if using the traditional
guard, wrap the entire file with a unique macro (e.g.,
QNN_AOT_RT_QNNAOTRUNTIME_HPP) and close it with an `#endif` (with a comment) at
the end of the file to prevent multiple inclusion.

In `@mllm/backends/qnn/aot_rt/TokenGenerator.cpp`:
- Around line 107-108: The generate method currently ignores the dump_logits
parameter and never fills token_all_logits_, so either implement logits
collection when dump_logits is true or remove the unused API; to fix, update
TokenGenerator<T>::generate to, when dump_logits is true, capture per-step/logit
outputs from the model/inference path and append them into the member
token_all_logits_ (ensure it is cleared at start of generation), and also expose
those logits via get_all_logits(); alternatively, if logits dumping is not
needed, remove the dump_logits parameter from generate and delete
token_all_logits_ and get_all_logits() to avoid dead code — locate
TokenGenerator<T>::generate, the token_all_logits_ member and get_all_logits()
to apply the change.
- Around line 22-24: The input_ids tensor is created with Tensor::empty(...,
kInt64, ...) but prepare_io reads/writes it as int32_t*, causing a dtype
mismatch; update the creation or the access so both use the same type—preferably
change Tensor::empty call that constructs input_ids to use kInt32 to match
PromptProcessor.cpp (and adjust any Tensor::empty calls creating input_ids) OR
if you must keep 64-bit, change prepare_io to use int64_t* when accessing
input_ids; ensure the symbol names input_ids, Tensor::empty and prepare_io are
updated consistently so allocation dtype and pointer casts match.

In `@mllm/backends/qnn/QNNBackend.cpp`:
- Around line 552-553: The wrapper reuse causes stale inputs because
wrapper->isAlloc() prevents updating dataContainer_; change the flow so the
wrapper's data container is refreshed each graphExecute: either always call
wrapper->__setDataContainer(runtime_input) before allocation/usage (i.e., do not
gate it on wrapper->isAlloc()), or add/invoke a reset method that clears
isAlloc_ and dataContainer_ on reuse (implement resetWrapper/resetAllocation
that zeroes dataContainer_ and sets isAlloc_ = false and call it at start of
graphExecute). Ensure __setDataContainer(), isAlloc_ and dataContainer_ are the
referenced symbols updated so allocation uses the current runtime_input.

In `@mllm/backends/qnn/QNNBackend.hpp`:
- Around line 53-54: Add clear docstrings for the public methods retrieveContext
and createContext in QNNBackend.hpp: for each method describe its purpose, list
and explain every parameter (e.g., contextBinaryPath, context, qnnModels,
contextConfig for retrieveContext and relevant params for createContext), state
the return value semantics (what a true/false or success/failure indicates) and
document possible errors/exceptions or failure conditions and side effects
(e.g., ownership of pointers, when contextConfig may be modified, thread-safety
expectations). Place these comments immediately above the function declarations
so they become part of the public API docs.

🧹 Nitpick comments (16)

mllm/backends/qnn/QNNBackend.hpp (1)
90-92: Add documentation for new public APIs.

Per coding guidelines, public APIs should have clear docstrings or comments explaining purpose, parameters, returns, and errors. Consider adding brief documentation:
📝 Suggested documentation
+  /**
+   * `@brief` Loads a QNN context from a binary file.
+   * `@param` contextPath Path to the context binary file.
+   * `@return` true if context was loaded successfully, false otherwise.
+   */
   bool loadContext(const std::string& contextPath);
+
+  /**
+   * `@brief` Creates a new QNN context.
+   * `@return` true if context was created successfully, false otherwise.
+   */
   bool createContext();
mllm/backends/qnn/QNNUtils.hpp (1)
208-213: Consider renaming __setDataContainer to avoid reserved naming convention.

The double underscore prefix (__) is typically reserved for compiler/implementation use in C++. While this works, a single underscore suffix or different naming pattern would be more idiomatic.

The implementation logic is correct - the assertion guards against overwriting an existing container, and isAlloc_ is properly updated only for non-nil tensors.
♻️ Suggested rename
-  void __setDataContainer(const Tensor& tensor) {
+  void setDataContainerUnchecked(const Tensor& tensor) {
     MLLM_RT_ASSERT(dataContainer_.isNil())
     dataContainer_ = tensor;
     if (!tensor.isNil()) { isAlloc_ = true; }
   }
examples/qwen3_qnn_aot/aot_run.cpp (3)
38-39: Consider making context_len configurable or using a named constant.

The hardcoded value 1024 for context_len is a magic number. Consider either:

Adding a CLI argument for it (like --ar_len)

Using a named constant with documentation

55-58: Type conversion from int64_t to uint64_t may lose sign information.

While token IDs are typically non-negative, the explicit cast from int64_t* to uint64_t could mask unexpected negative values. Consider adding a debug assertion or using static_cast with a comment explaining the assumption.
♻️ Suggested improvement
   std::vector<uint64_t> prompt_tokens;
   auto sequence = input_tensor["sequence"];
   int64_t* ptr = sequence.ptr<int64_t>();
-  for (int i = 0; i < sequence.shape()[1]; ++i) { prompt_tokens.push_back((uint64_t)ptr[i]); }
+  for (int i = 0; i < sequence.shape()[1]; ++i) {
+    MLLM_RT_ASSERT(ptr[i] >= 0);  // Token IDs should be non-negative
+    prompt_tokens.push_back(static_cast<uint64_t>(ptr[i]));
+  }
43-48: Add validation for empty prompt input.

If the user enters an empty string or just whitespace, the code proceeds without validation. Consider adding a check to handle this gracefully, especially since convertMessage behavior with empty input may be undefined.
🔧 Suggested validation
   std::string prompt_text;
   fmt::print("💬 Prompt text (or 'exit/quit'): ");
   std::getline(std::cin, prompt_text);

+  if (prompt_text.empty() || prompt_text == "exit" || prompt_text == "quit") {
+    return 0;
+  }
+
   auto input_tensor = tokenizer.convertMessage({.prompt = prompt_text});
mllm/backends/qnn/aot_rt/QnnAOTModule.hpp (1)

12-23: Class design looks reasonable; add documentation for public API.

The dual inheritance pattern from mllm::nn::Module and models::ARGeneration provides flexibility for both module-based and autoregressive generation interfaces. The NYI stub for the ARGeneration forward is appropriate.

Per coding guidelines, consider adding class-level and method documentation explaining the purpose and usage of QnnAOTModule.
mllm/backends/qnn/aot_rt/QnnAOTModule.cpp (1)
12-15: Mark unused parameters to suppress compiler warnings.

The inputs and args parameters are unused. This is likely intentional since output_tensors_ is populated via setOutputTensors() before forward() is called, but the unused parameters may trigger compiler warnings.
♻️ Suggested fix
 std::vector<mllm::Tensor> QnnAOTModule::forward(const std::vector<mllm::Tensor>& inputs,
                                                 const std::vector<mllm::AnyValue>& args) {
+  (void)inputs;
+  (void)args;
   return output_tensors_;
 }
Alternatively, use [[maybe_unused]] attribute in the header declaration.
examples/qwen3_qnn_aot/CMakeLists.txt (1)

8-10: Consider adding a build guard for the runner target.

The mllm-qwen3-aot-runner target links against MllmQNNBackend, but is defined unconditionally. According to the AI summary, the parent examples/CMakeLists.txt guards the entire subdirectory inclusion with MLLM_BUILD_QNN_BACKEND OR MLLM_QUALCOMM_QNN_AOT_ON_X86_ENABLE, so this should be safe. However, for clarity and consistency with the compile target, consider adding an explicit guard or a comment explaining the build conditions.

mllm/backends/qnn/aot_rt/PromptProcessor.hpp (1)

18-29: Consider unifying integer types in Config struct.

The Config struct mixes int32_t (context_len, ar_len, vocab_size, head_dim, sliding_window) with int64_t (num_heads, num_layers). Unless there's a specific reason for this (e.g., matching QNN API requirements), consider using consistent types to avoid implicit conversions.

mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp (1)

16-25: Consider documenting model-specific defaults.

The RunnerConfig defaults (e.g., vocab_size = 151936, num_layers = 28, num_heads = 12) appear to be Qwen3-specific. Consider adding a comment noting these are Qwen3 defaults, or extracting them to a model-specific configuration to improve reusability.

mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp (1)

67-70: Hardcoded EOS token IDs should be configurable.

The EOS token IDs 151643 and 151645 are hardcoded with a TODO comment. These are Qwen3-specific values. Consider adding them to RunnerConfig to improve configurability and reusability across different models.

Do you want me to propose a modification to add EOS IDs to RunnerConfig?

mllm/backends/qnn/aot_rt/PromptProcessor.cpp (1)

1-3: Minor: Leading blank line.

Line 1 is blank before the copyright header. This is a minor stylistic inconsistency.

mllm/backends/qnn/aot_rt/TokenGenerator.hpp (1)

42-47: Consider documenting ownership semantics for raw pointer members.

The class uses raw pointers for tokenizer_ and kv_manager_ (non-owning) and unique_ptr for module_ and eos_ids_ (owning). This is a valid design for dependency injection, but adding brief comments clarifying that the caller retains ownership of tokenizer_ and kv_manager_ would improve maintainability.
mllm/backends/qnn/aot_rt/KVCacheManager.hpp (2)
52-53: Add [[nodiscard]] attribute to accessor methods for consistency.

The static analysis correctly notes that getKCache() and getVCache() should be marked [[nodiscard]] for consistency with getTotalCacheSizeInBytes() (line 54). Discarding the return value of these accessors is almost certainly a bug.
♻️ Proposed fix
-  const std::vector<KVCache<T>>& getKCache() const { return k_cache_; }
-  const std::vector<KVCache<T>>& getVCache() const { return v_cache_; }
+  [[nodiscard]] const std::vector<KVCache<T>>& getKCache() const { return k_cache_; }
+  [[nodiscard]] const std::vector<KVCache<T>>& getVCache() const { return v_cache_; }
22-29: Consider consistent integer types in KVCacheConfig.

The struct mixes int32_t and int64_t types for similar dimension values. While not incorrect, this inconsistency may lead to implicit conversions. If there's no external API requirement, consider using a consistent type.
mllm/backends/qnn/aot_rt/KVCacheManager.cpp (1)
70-73: Using exit(1) for error handling is non-recoverable and not production-friendly.

Multiple error conditions use exit(1) which terminates the entire process without cleanup. Consider throwing exceptions or returning error codes to allow callers to handle errors gracefully.
♻️ Proposed alternative using exceptions
   if (attention_map.size() > ar_len) {
-    MLLM_ERROR("The size of attention_map ({}) doesn't match with ar_len ({})", attention_map.size(), ar_len);
-    exit(1);
+    throw std::invalid_argument(
+        fmt::format("The size of attention_map ({}) doesn't match with ar_len ({})", attention_map.size(), ar_len));
   }
Also applies to: 106-109, 241-245

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0eb79bb and fdb27ea.

📒 Files selected for processing (26)

examples/CMakeLists.txt
examples/qwen3_qnn_aot/CMakeLists.txt
examples/qwen3_qnn_aot/aot_run.cpp
mllm/backends/qnn/CMakeLists.txt
mllm/backends/qnn/QNNBackend.cpp
mllm/backends/qnn/QNNBackend.hpp
mllm/backends/qnn/QNNUtils.cpp
mllm/backends/qnn/QNNUtils.hpp
mllm/backends/qnn/Register.cpp
mllm/backends/qnn/aot_rt/KVCacheManager.cpp
mllm/backends/qnn/aot_rt/KVCacheManager.hpp
mllm/backends/qnn/aot_rt/PromptProcessor.cpp
mllm/backends/qnn/aot_rt/PromptProcessor.hpp
mllm/backends/qnn/aot_rt/QnnAOTModule.cpp
mllm/backends/qnn/aot_rt/QnnAOTModule.hpp
mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp
mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp
mllm/backends/qnn/aot_rt/TokenGenerator.cpp
mllm/backends/qnn/aot_rt/TokenGenerator.hpp
mllm/backends/qnn/aot_rt/utils/MaskGen.cpp
mllm/backends/qnn/aot_rt/utils/MaskGen.hpp
mllm/backends/qnn/aot_rt/utils/PositionIdGen.cpp
mllm/backends/qnn/aot_rt/utils/PositionIdGen.hpp
mllm/backends/qnn/aot_rt/utils/RoPEGen.cpp
mllm/backends/qnn/aot_rt/utils/RoPEGen.hpp
mllm/mllm.hpp

🧰 Additional context used

📓 Path-based instructions (5)

{mllm,mllm-cli,pymllm}/**/*

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).

Files:

mllm/backends/qnn/CMakeLists.txt
mllm/backends/qnn/QNNUtils.hpp
mllm/mllm.hpp
mllm/backends/qnn/aot_rt/PromptProcessor.cpp
mllm/backends/qnn/aot_rt/TokenGenerator.hpp
mllm/backends/qnn/aot_rt/PromptProcessor.hpp
mllm/backends/qnn/QNNUtils.cpp
mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp
mllm/backends/qnn/QNNBackend.cpp
mllm/backends/qnn/aot_rt/KVCacheManager.hpp
mllm/backends/qnn/aot_rt/QnnAOTModule.cpp
mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp
mllm/backends/qnn/QNNBackend.hpp
mllm/backends/qnn/Register.cpp
mllm/backends/qnn/aot_rt/KVCacheManager.cpp
mllm/backends/qnn/aot_rt/QnnAOTModule.hpp
mllm/backends/qnn/aot_rt/TokenGenerator.cpp

{mllm,mllm-cli,pymllm}/**/*.{toml,json,txt,md}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

When suggesting new dependencies, evaluate their maturity, licensing, and maintenance status. Prefer well-established and actively maintained libraries.

Files:

mllm/backends/qnn/CMakeLists.txt

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Files:

mllm/backends/qnn/QNNUtils.hpp
mllm/mllm.hpp
mllm/backends/qnn/aot_rt/PromptProcessor.cpp
mllm/backends/qnn/aot_rt/TokenGenerator.hpp
mllm/backends/qnn/aot_rt/PromptProcessor.hpp
mllm/backends/qnn/QNNUtils.cpp
mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp
mllm/backends/qnn/QNNBackend.cpp
mllm/backends/qnn/aot_rt/KVCacheManager.hpp
mllm/backends/qnn/aot_rt/QnnAOTModule.cpp
mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp
mllm/backends/qnn/QNNBackend.hpp
mllm/backends/qnn/Register.cpp
mllm/backends/qnn/aot_rt/KVCacheManager.cpp
mllm/backends/qnn/aot_rt/QnnAOTModule.hpp
mllm/backends/qnn/aot_rt/TokenGenerator.cpp

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).

Files:

mllm/backends/qnn/QNNUtils.hpp
mllm/mllm.hpp
mllm/backends/qnn/aot_rt/PromptProcessor.cpp
mllm/backends/qnn/aot_rt/TokenGenerator.hpp
mllm/backends/qnn/aot_rt/PromptProcessor.hpp
mllm/backends/qnn/QNNUtils.cpp
mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp
mllm/backends/qnn/QNNBackend.cpp
mllm/backends/qnn/aot_rt/KVCacheManager.hpp
mllm/backends/qnn/aot_rt/QnnAOTModule.cpp
mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp
mllm/backends/qnn/QNNBackend.hpp
mllm/backends/qnn/Register.cpp
mllm/backends/qnn/aot_rt/KVCacheManager.cpp
mllm/backends/qnn/aot_rt/QnnAOTModule.hpp
mllm/backends/qnn/aot_rt/TokenGenerator.cpp

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.

Files:

mllm/backends/qnn/aot_rt/PromptProcessor.cpp
mllm/backends/qnn/QNNUtils.cpp
mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp
mllm/backends/qnn/QNNBackend.cpp
mllm/backends/qnn/aot_rt/QnnAOTModule.cpp
mllm/backends/qnn/Register.cpp
mllm/backends/qnn/aot_rt/KVCacheManager.cpp
mllm/backends/qnn/aot_rt/TokenGenerator.cpp

🧬 Code graph analysis (12)

examples/qwen3_qnn_aot/aot_run.cpp (4)

mllm/utils/Argparse.hpp (1)

Argparse (135-135)

mllm/backends/qnn/aot_rt/KVCacheManager.hpp (1)

ar_len (50-50)

mllm/backends/qnn/aot_rt/PromptProcessor.hpp (2)

prompt_tokens (39-39)

prompt_tokens (42-42)

mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp (1)

prompt_tokens (33-34)

mllm/backends/qnn/aot_rt/TokenGenerator.hpp (1)

mllm/backends/qnn/aot_rt/TokenGenerator.cpp (1)

TokenGenerator (9-15)

mllm/backends/qnn/QNNUtils.cpp (1)

mllm/backends/qnn/QNNUtils.hpp (1)

isAlloc_ (208-208)

mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp (3)

mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp (3)

Runner (29-29)

Runner (30-30)

prompt_tokens (33-34)

mllm/backends/qnn/aot_rt/TokenGenerator.cpp (2)

generate (107-150)

generate (107-108)

mllm/backends/qnn/aot_rt/PromptProcessor.hpp (2)

prompt_tokens (39-39)

prompt_tokens (42-42)

mllm/backends/qnn/QNNBackend.cpp (2)

mllm/backends/qnn/QNNBackend.hpp (10)

contextBinaryPath (53-54)

context (52-52)

contextPath (90-90)

graphName (96-96)

graphName (98-102)

graphName (104-104)

graphName (106-106)

graphName (109-110)

graphName (112-113)

graphName (116-116)

mllm/backends/qnn/QNNModel.hpp (2)

context (39-40)

context (43-45)

mllm/backends/qnn/aot_rt/KVCacheManager.hpp (1)

mllm/backends/qnn/aot_rt/KVCacheManager.cpp (1)

KVCacheManager (12-20)

mllm/backends/qnn/aot_rt/QnnAOTModule.cpp (1)

mllm/backends/qnn/aot_rt/QnnAOTModule.hpp (2)

QnnAOTModule (14-14)

inputs (16-16)

mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp (2)

mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp (1)

Runner (12-13)

mllm/backends/qnn/aot_rt/PromptProcessor.hpp (2)

prompt_tokens (39-39)

prompt_tokens (42-42)

mllm/backends/qnn/QNNBackend.hpp (1)

mllm/backends/qnn/QNNModel.hpp (2)

context (39-40)

context (43-45)

mllm/backends/qnn/Register.cpp (3)

mllm/mllm.hpp (1)

isQnnAvailable (198-198)

mllm/mllm.cpp (2)

isQnnAvailable (93-98)

isQnnAvailable (93-93)

mllm/backends/qnn/aot/QnnWrappersAPI.hpp (1)

instance (171-174)

mllm/backends/qnn/aot_rt/KVCacheManager.cpp (1)

mllm/backends/qnn/aot_rt/KVCacheManager.hpp (15)

KVCacheManager (34-34)

KVCacheManager (35-35)

k_cache_ (52-52)

v_cache_ (53-53)

allocator (37-37)

ar_len (50-50)

attention_mask (40-40)

attention_mask (42-43)

attention_mask (45-45)

attention_mask (47-48)

ar_len_dst (38-38)

k_cache (57-57)

k_cache (59-59)

v_cache (58-58)

v_cache (60-60)

mllm/backends/qnn/aot_rt/TokenGenerator.cpp (3)

mllm/backends/qnn/aot_rt/TokenGenerator.hpp (4)

TokenGenerator (30-31)

TokenGenerator (33-33)

TokenGenerator (33-33)

tokens (39-40)

mllm/backends/qnn/aot_rt/PromptProcessor.cpp (4)

init_io (27-93)

init_io (27-27)

prepare_io (96-130)

prepare_io (96-96)

mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp (2)

generate (82-102)

generate (82-83)

🪛 Clang (14.0.6)

examples/qwen3_qnn_aot/aot_run.cpp

[error] 1-1: 'iostream' file not found

(clang-diagnostic-error)

mllm/backends/qnn/aot_rt/PromptProcessor.cpp

[error] 5-5: 'mllm/backends/qnn/aot_rt/PromptProcessor.hpp' file not found

(clang-diagnostic-error)

mllm/backends/qnn/aot_rt/TokenGenerator.hpp

[error] 3-3: 'mllm/backends/qnn/aot_rt/QnnAOTModule.hpp' file not found

(clang-diagnostic-error)

mllm/backends/qnn/aot_rt/PromptProcessor.hpp

[error] 6-6: 'mllm/backends/qnn/aot_rt/QnnAOTModule.hpp' file not found

(clang-diagnostic-error)

mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp

[error] 4-4: 'mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp' file not found

(clang-diagnostic-error)

mllm/backends/qnn/aot_rt/KVCacheManager.hpp

[error] 6-6: 'cstdint' file not found

(clang-diagnostic-error)

[error] 52-52: function 'getKCache' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

[error] 53-53: function 'getVCache' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

mllm/backends/qnn/aot_rt/QnnAOTModule.cpp

[error] 1-1: 'mllm/backends/qnn/aot_rt/QnnAOTModule.hpp' file not found

(clang-diagnostic-error)

mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp

[error] 4-4: 'mllm/backends/qnn/aot_rt/QnnAOTModule.hpp' file not found

(clang-diagnostic-error)

mllm/backends/qnn/aot_rt/KVCacheManager.cpp

[error] 4-4: 'mllm/backends/qnn/aot_rt/KVCacheManager.hpp' file not found

(clang-diagnostic-error)

mllm/backends/qnn/aot_rt/QnnAOTModule.hpp

[error] 3-3: 'mllm/models/ARGeneration.hpp' file not found

(clang-diagnostic-error)

mllm/backends/qnn/aot_rt/TokenGenerator.cpp

[error] 1-1: 'mllm/backends/qnn/aot_rt/TokenGenerator.hpp' file not found

(clang-diagnostic-error)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build-android
GitHub Check: build-macos
GitHub Check: build-x86

🔇 Additional comments (16)

mllm/backends/qnn/QNNUtils.cpp (1)

484-496: No issue found — registerQnnTensorToSharedBuffer is explicitly designed to be idempotent.

The implementation checks if a pointer is already registered and handles it gracefully: if the pointer exists in ptrToFdAndMemHandleMap_, it simply updates the tensor's memory handle and returns (lines 62–68 of QNNAllocator.cpp). Only on first call does it perform the actual QNN memory registration. Multiple calls to alloc() on the same tensor are safe and will not cause double-registration or resource leaks.

mllm/backends/qnn/aot_rt/QnnAOTModule.cpp (1)

9-10: LGTM!

Constructor correctly initializes the base class with graph_name and stores both member variables.

mllm/backends/qnn/QNNBackend.cpp (1)

421-438: LGTM!

The new createContext and loadContext methods are well-structured. They properly initialize the QNN allocator after context creation/loading and loadContext correctly populates the qnnModelIndexMap_ for graph lookup.

examples/CMakeLists.txt (1)

20-22: LGTM!

The updated condition correctly enables the qwen3_qnn_aot example when either cross-compiling AOT on x86 or when building with QNN backend support on target.

mllm/mllm.hpp (1)

200-200: LGTM!

The optional context_path parameter with a default value maintains backward compatibility while enabling callers to specify custom context binary paths. This aligns well with the new loadContext API in QNNBackend.

mllm/backends/qnn/CMakeLists.txt (1)

24-28: The unconditional inclusion of aot_rt sources is correct.

The aot_rt sources must be compiled unconditionally because mllm-qwen3-aot-runner (examples/qwen3_qnn_aot/CMakeLists.txt, line 8) is built unconditionally and directly includes QnnAOTRuntime.hpp. While aot sources are conditional on MLLM_QUALCOMM_QNN_AOT_ON_X86_ENABLE (for AOT compilation), aot_rt sources provide the runtime components needed by the unconditionally-built runner executable.

Likely an incorrect or invalid review comment.

mllm/backends/qnn/Register.cpp (1)

23-27: Context handling logic looks correct.

The conditional load/create pattern is appropriate. However, consider whether the file existence check could race with another process modifying the file system between the check and the actual load operation (TOCTOU). In practice, for model context files this is likely acceptable, but worth noting.

One minor concern: std::filesystem requires C++17. Verify this is consistently available across all supported build configurations.

mllm/backends/qnn/aot_rt/PromptProcessor.hpp (1)

31-51: Class structure is well-designed.

The use of a raw pointer for kv_manager_ correctly indicates non-owning reference semantics, while module_ uses unique_ptr for owned resources. The public API is clean with appropriate documentation for prefill().

mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp (1)

27-43: Runner class design is appropriate.

The explicit constructor prevents implicit conversions. The use of raw pointer for tokenizer_ correctly signals non-owning semantics, while unique_ptr is used for owned components. The load() method returning bool allows proper error handling.

mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp (1)

15-80: Load method implementation is well-structured.

The load() method properly:

Validates backend availability and returns false on failure

Uses unique_ptr for RAII-based resource management

Correctly chains initialization of components

One note: if initCache(), init_io() on line 75-77 can fail, consider adding error handling and returning false accordingly.

mllm/backends/qnn/aot_rt/TokenGenerator.cpp (2)

8-15: LGTM: Constructor properly initializes members and creates module.

The constructor correctly uses move semantics for eos_ids and config_, and properly initializes the QNN AOT module.

46-57: Verify KV cache tensor dimension ordering.

The input KV cache tensors use shape {1, num_heads, context_len, head_dim} for both K and V. However, in PromptProcessor.cpp, K uses {1, num_heads, head_dim, context_len - ar_len} (transposed layout) while V uses {1, num_heads, context_len - ar_len, head_dim}.

Please verify that this dimension ordering is intentional for the decode phase vs. prefill phase, as it affects how KVCacheManager::rearrangeKey/rearrangeValue operates.

mllm/backends/qnn/aot_rt/PromptProcessor.cpp (2)

48-67: LGTM: KV cache tensor setup with manual memory management.

The tensor shapes follow standard transformer patterns (K transposed for attention matmul). The manual memory management correctly ties tensors to externally managed KVCacheManager buffers.

95-130: LGTM: prepare_io correctly handles chunked input preparation.

The method properly handles input IDs with padding, position IDs, and attention mask initialization for the current chunk.

mllm/backends/qnn/aot_rt/TokenGenerator.hpp (1)

18-28: Config struct is well-defined with sensible defaults.

The Config struct captures all necessary model parameters. The kv_dtype default of kUInt8 aligns with the explicit instantiation for uint8_t.

mllm/backends/qnn/aot_rt/KVCacheManager.cpp (1)

188-237: LGTM: Cache rearrangement correctly handles overlapping memory regions.

The use of std::memmove instead of std::memcpy is correct since source and destination buffers overlap. The different iteration patterns for key vs. value correctly account for their different memory layouts (key is transposed).

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-14T09:00:56Z

+    auto k_storage_in = std::make_shared<mllm::Storage>();
+    k_storage_in->size_ = cache_in_bytes;
+    allocator->alloc(k_storage_in);
+    memset(k_storage_in->ptr_, 0, cache_in_bytes);


⚠️ Potential issue | 🟠 Major

Missing null check after allocation.

The code calls allocator->alloc(k_storage_in) but doesn't verify that allocation succeeded before calling memset on k_storage_in->ptr_. If the allocator fails, this will cause a null pointer dereference.

🐛 Proposed fix

auto k_storage_in = std::make_shared<mllm::Storage>(); k_storage_in->size_ = cache_in_bytes; allocator->alloc(k_storage_in); + if (!k_storage_in->ptr_) { + MLLM_ERROR("Failed to allocate k_storage_in for layer {}", layer); + throw std::runtime_error("KV cache allocation failed"); + } memset(k_storage_in->ptr_, 0, cache_in_bytes);

Apply similar checks after each allocator->alloc() call.

🤖 Prompt for AI Agents

In `@mllm/backends/qnn/aot_rt/KVCacheManager.cpp` around lines 35 - 38, After calling allocator->alloc(k_storage_in) ensure you check that k_storage_in->ptr_ is non-null before calling memset; if allocation failed, log or return an error (or throw) and avoid dereferencing the pointer. Apply the same null-check pattern for every allocator->alloc(...) use in this file (e.g., any other Storage instances created in KVCacheManager methods) so you never call memset or access ptr_ when allocation failed.

coderabbitai · 2026-01-14T09:00:56Z

+    std::vector<int32_t> true_indices(n_update);
+    for (int i = 0, j = 0; i < selected.size() && j < n_update; ++i) {
+      if (selected[i]) { true_indices[j++] = i; }
+    }


⚠️ Potential issue | 🟠 Major

Potential use of uninitialized indices if selected has fewer true values than n_update.

The vector true_indices is initialized with size n_update, but if selected contains fewer than n_update true values, some elements remain at their default value (0). This could cause incorrect cache updates by reading from index 0 instead of valid indices.

🐛 Proposed fix: Validate index count

std::vector<int32_t> true_indices(n_update); + int32_t found_count = 0; for (int i = 0, j = 0; i < selected.size() && j < n_update; ++i) { - if (selected[i]) { true_indices[j++] = i; } + if (selected[i]) { + true_indices[j++] = i; + found_count++; + } } + if (found_count != n_update) { + MLLM_ERROR("Selected vector has {} true values, expected {}", found_count, n_update); + return; // or throw + }

🤖 Prompt for AI Agents

In `@mllm/backends/qnn/aot_rt/KVCacheManager.cpp` around lines 270 - 273, The loop building true_indices can leave elements uninitialized if selected has fewer than n_update true values; change the logic to first count actual trues (or use push_back) and either resize true_indices to that actual count or validate that actual_count == n_update and handle the mismatch (log/throw/return error) before using true_indices; update the code that consumes true_indices (same scope where true_indices, selected, and n_update are used) to rely on the validated/adjusted size to avoid reading default zeros.

coderabbitai · 2026-01-14T09:00:56Z

+// Copyright (c) MLLM Team.
+// Licensed under the MIT License.
+
+#include "mllm/backends/qnn/aot_rt/QnnAOTModule.hpp"
+#include "mllm/backends/qnn/aot_rt/KVCacheManager.hpp"
+#include "mllm/backends/qnn/aot_rt/PromptProcessor.hpp"
+#include "mllm/backends/qnn/aot_rt/TokenGenerator.hpp"
+#include "mllm/preprocessor/tokenizers/AutoTokenizer.hpp"
+#include <string>
+#include <memory>
+#include <vector>
+#include <functional>


⚠️ Potential issue | 🟠 Major

Missing include guard or #pragma once.

This header file lacks an include guard (#ifndef/#define/#endif) or #pragma once directive. This will cause compilation errors if the header is included multiple times in the same translation unit.

🐛 Proposed fix

// Copyright (c) MLLM Team. // Licensed under the MIT License. +#pragma once + `#include` "mllm/backends/qnn/aot_rt/QnnAOTModule.hpp" `#include` "mllm/backends/qnn/aot_rt/KVCacheManager.hpp"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Copyright (c) MLLM Team.

// Licensed under the MIT License.

#include "mllm/backends/qnn/aot_rt/QnnAOTModule.hpp"

#include "mllm/backends/qnn/aot_rt/KVCacheManager.hpp"

#include "mllm/backends/qnn/aot_rt/PromptProcessor.hpp"

#include "mllm/backends/qnn/aot_rt/TokenGenerator.hpp"

#include "mllm/preprocessor/tokenizers/AutoTokenizer.hpp"

#include <string>

#include <memory>

#include <vector>

#include <functional>

// Copyright (c) MLLM Team.

// Licensed under the MIT License.

`#pragma` once

`#include` "mllm/backends/qnn/aot_rt/QnnAOTModule.hpp"

`#include` "mllm/backends/qnn/aot_rt/KVCacheManager.hpp"

`#include` "mllm/backends/qnn/aot_rt/PromptProcessor.hpp"

`#include` "mllm/backends/qnn/aot_rt/TokenGenerator.hpp"

`#include` "mllm/preprocessor/tokenizers/AutoTokenizer.hpp"

`#include` <string>

`#include` <memory>

`#include` <vector>

`#include` <functional>

🧰 Tools

🪛 Clang (14.0.6)

[error] 4-4: 'mllm/backends/qnn/aot_rt/QnnAOTModule.hpp' file not found

(clang-diagnostic-error)

🤖 Prompt for AI Agents

In `@mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp` around lines 1 - 12, The header QnnAOTRuntime.hpp is missing an include guard; add a header guard or a single-line `#pragma` once at the very top of QnnAOTRuntime.hpp (before the existing includes) and, if using the traditional guard, wrap the entire file with a unique macro (e.g., QNN_AOT_RT_QNNAOTRUNTIME_HPP) and close it with an `#endif` (with a comment) at the end of the file to prevent multiple inclusion.

coderabbitai · 2026-01-14T09:00:56Z

+  auto input_ids = Tensor::empty({1, 1, 1, 1}, kInt64, kQNN).alloc();
+  input_ids.setName("input_ids");
+  input_tensors_.push_back(input_ids);


⚠️ Potential issue | 🔴 Critical

Type mismatch: Tensor dtype kInt64 but accessed as int32_t*.

The input_ids tensor is created with kInt64 dtype (Line 22), but in prepare_io (Line 94), it's accessed as int32_t*. This mismatch will cause incorrect data writes. Note that PromptProcessor.cpp uses kInt32 for its input_ids tensor.

🐛 Proposed fix

Either change the tensor dtype to kInt32:

// 1. Input IDs - auto input_ids = Tensor::empty({1, 1, 1, 1}, kInt64, kQNN).alloc(); + auto input_ids = Tensor::empty({1, 1, 1, 1}, kInt32, kQNN).alloc();

Or update prepare_io to use int64_t*:

// 1. Input IDs - int32_t* input_ids_ptr = input_tensors_[0].ptr<int32_t>(); - input_ids_ptr[0] = (int32_t)cur_token; + int64_t* input_ids_ptr = input_tensors_[0].ptr<int64_t>(); + input_ids_ptr[0] = (int64_t)cur_token;

Also applies to: 94-95

🤖 Prompt for AI Agents

In `@mllm/backends/qnn/aot_rt/TokenGenerator.cpp` around lines 22 - 24, The input_ids tensor is created with Tensor::empty(..., kInt64, ...) but prepare_io reads/writes it as int32_t*, causing a dtype mismatch; update the creation or the access so both use the same type—preferably change Tensor::empty call that constructs input_ids to use kInt32 to match PromptProcessor.cpp (and adjust any Tensor::empty calls creating input_ids) OR if you must keep 64-bit, change prepare_io to use int64_t* when accessing input_ids; ensure the symbol names input_ids, Tensor::empty and prepare_io are updated consistently so allocation dtype and pointer casts match.

coderabbitai · 2026-01-14T09:00:56Z

+int64_t TokenGenerator<T>::generate(std::vector<uint64_t>& tokens, int64_t start_pos, int32_t seq_len,
+                                    const std::function<void(const std::string&)>& token_callback, bool dump_logits) {


⚠️ Potential issue | 🟡 Minor

Unused dump_logits parameter and token_all_logits_ never populated.

The dump_logits parameter is declared but never used in the function body. Additionally, get_all_logits() returns token_all_logits_ which is never populated, so it will always return an empty vector.

If logits dumping is intended functionality, would you like me to generate the implementation that collects logits when dump_logits is true?

Also applies to: 129-133, 87-89

🤖 Prompt for AI Agents

In `@mllm/backends/qnn/aot_rt/TokenGenerator.cpp` around lines 107 - 108, The generate method currently ignores the dump_logits parameter and never fills token_all_logits_, so either implement logits collection when dump_logits is true or remove the unused API; to fix, update TokenGenerator<T>::generate to, when dump_logits is true, capture per-step/logit outputs from the model/inference path and append them into the member token_all_logits_ (ensure it is cleared at start of generation), and also expose those logits via get_all_logits(); alternatively, if logits dumping is not needed, remove the dump_logits parameter from generate and delete token_all_logits_ and get_all_logits() to avoid dead code — locate TokenGenerator<T>::generate, the token_all_logits_ member and get_all_logits() to apply the change.

coderabbitai · 2026-01-14T09:00:56Z

+    // input wrapper is empty, set wrapper's dataContainer(mllm::Tensor)
+    if (!wrapper->isAlloc()) { wrapper->__setDataContainer(runtime_input); }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for usage patterns of __setDataContainer and isAlloc rg -n -C3 '__setDataContainer|isAlloc' --type=cpp

Repository: UbiquitousLearning/mllm

Length of output: 3427

🏁 Script executed:

#!/bin/bash # Get broader context around line 552-553 in QNNBackend.cpp sed -n '520,580p' mllm/backends/qnn/QNNBackend.cpp echo "---" # Search for wrapper creation/reuse patterns rg -n 'wrapper\s*=' mllm/backends/qnn/QNNBackend.cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 2855

🏁 Script executed:

#!/bin/bash # Search for wrapper lifecycle - how are input wrappers stored and reused? rg -n 'getGraphInputTensorWrappers' mllm/backends/qnn/ --type=cpp echo "---" # Search for reset/cleanup methods in QNNTensorWrapper rg -n 'reset|clear|cleanup' mllm/backends/qnn/QNNUtils.hpp --type=hpp

Repository: UbiquitousLearning/mllm

Length of output: 827

🏁 Script executed:

#!/bin/bash # Search for reset/cleanup methods in QNNTensorWrapper without type specification rg -n 'reset|clear|cleanup' mllm/backends/qnn/QNNUtils.hpp echo "---" # Look at the full QNNTensorWrapper class definition sed -n '180,230p' mllm/backends/qnn/QNNUtils.hpp

Repository: UbiquitousLearning/mllm

Length of output: 2348

🏁 Script executed:

#!/bin/bash # Check how wrappers are created and stored in the model rg -n 'inputTensorWrappers_' mllm/backends/qnn/QNNModel.cpp -A2 -B2 | head -40 echo "---" # Check if there's any mechanism to reset or recreate wrappers between executions rg -n 'graphExecute' mllm/backends/qnn/QNNBackend.cpp | head -5

Repository: UbiquitousLearning/mllm

Length of output: 1164

🏁 Script executed:

#!/bin/bash # Understand the alloc() method and whether it updates data sed -n '484,496p' mllm/backends/qnn/QNNUtils.cpp echo "---" # Check the create method to understand initial wrapper state rg -n 'QNNTensorWrapper::create' mllm/backends/qnn/QNNUtils.cpp -A15 | head -40

Repository: UbiquitousLearning/mllm

Length of output: 2599

🏁 Script executed:

#!/bin/bash # Check the lifecycle of the wrapper between multiple executions # Does graphExecute clear or reset wrappers? rg -n 'clear|reset|new ' mllm/backends/qnn/QNNBackend.cpp | grep -i wrapper echo "---" # Check how the input wrapper is actually used after alloc() in graphExecute sed -n '553,572p' mllm/backends/qnn/QNNBackend.cpp

Repository: UbiquitousLearning/mllm

Length of output: 1138

🏁 Script executed:

#!/bin/bash # Final verification: Check if there's any mechanism to reset isAlloc_ or update data between executions rg -n 'isAlloc_\s*=' mllm/backends/qnn/ --type=cpp echo "---" # Check comment or documentation about wrapper reuse rg -n 'reuse|lifecycle|between execution' mllm/backends/qnn/QNNUtils.hpp echo "---" # Look for any destructor or cleanup in QNNModel rg -n '~QNNModel|reset.*wrapper' mllm/backends/qnn/QNNModel.hpp

Repository: UbiquitousLearning/mllm

Length of output: 366

🏁 Script executed:

#!/bin/bash # Final critical check: verify what happens on second execution # Look at the actual behavior flow echo "=== First Execution Flow ===" echo "1. !wrapper->isAlloc() is TRUE (initial state: isAlloc_ = false)" echo "2. Calls __setDataContainer(runtime_input) - sets dataContainer_ to runtime_input" echo "3. Calls wrapper->alloc() - sets isAlloc_ = true" echo "" echo "=== Second Execution Flow ===" echo "1. !wrapper->isAlloc() is FALSE (already allocated)" echo "2. SKIPS __setDataContainer() - wrapper->dataContainer_ NOT updated" echo "3. Calls wrapper->alloc() again - but dataContainer_ still has old data" echo "" # Verify the key insight: once isAlloc_ is true, it never becomes false rg -n 'isAlloc_\s*=\s*false' mllm/backends/qnn/

Repository: UbiquitousLearning/mllm

Length of output: 575

Fix stale data bug when wrappers are reused across multiple graph executions.

The current logic skips data updates when wrapper->isAlloc() is true, but once a wrapper is allocated (first execution), isAlloc_ is never reset. On subsequent graphExecute calls with the same wrapper objects, the condition if (!wrapper->isAlloc()) evaluates to false, preventing __setDataContainer() from updating the wrapper's data. This causes stale input data from the previous execution to be reused.

To fix this, consider either:

Creating fresh wrappers for each execution, or

Resetting/updating the wrapper's data unconditionally before allocation (e.g., check if input data changed rather than allocation status), or

Adding a reset mechanism to clear isAlloc_ and dataContainer_ between executions.

🤖 Prompt for AI Agents

In `@mllm/backends/qnn/QNNBackend.cpp` around lines 552 - 553, The wrapper reuse causes stale inputs because wrapper->isAlloc() prevents updating dataContainer_; change the flow so the wrapper's data container is refreshed each graphExecute: either always call wrapper->__setDataContainer(runtime_input) before allocation/usage (i.e., do not gate it on wrapper->isAlloc()), or add/invoke a reset method that clears isAlloc_ and dataContainer_ on reuse (implement resetWrapper/resetAllocation that zeroes dataContainer_ and sets isAlloc_ = false and call it at start of graphExecute). Ensure __setDataContainer(), isAlloc_ and dataContainer_ are the referenced symbols updated so allocation uses the current runtime_input.

coderabbitai · 2026-01-14T09:00:56Z

+  bool retrieveContext(const std::string& contextBinaryPath, Qnn_ContextHandle_t& context,
+                       std::vector<std::shared_ptr<QNNModel>>& qnnModels, QnnContext_Config_t** contextConfig = nullptr);


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, find and read the header file to confirm the signature change find . -name "QNNBackend.hpp" -type f

Repository: UbiquitousLearning/mllm

Length of output: 101

🏁 Script executed:

# Search for all usages of retrieveContext in the codebase rg -n "retrieveContext" --type=cpp --type=h -C 3

Repository: UbiquitousLearning/mllm

Length of output: 1879

🏁 Script executed:

# Search more comprehensively for all retrieveContext usages, including in tests and other files rg -n "retrieveContext" --type=cpp --type=h --type=cc

Repository: UbiquitousLearning/mllm

Length of output: 93

🏁 Script executed:

# Read the header file to check for docstrings and API documentation cat -n mllm/backends/qnn/QNNBackend.hpp | head -80

Repository: UbiquitousLearning/mllm

Length of output: 3618

🏁 Script executed:

# Also search in Python files in case there are bindings rg -n "retrieveContext" --type=py

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Search more comprehensively across all source files rg -n "retrieveContext" --type=cpp

Repository: UbiquitousLearning/mllm

Length of output: 464

🏁 Script executed:

# Check if there are any other file types that might use this API find . -type f \( -name "*.cpp" -o -name "*.h" -o -name "*.hpp" \) | wc -l

Repository: UbiquitousLearning/mllm

Length of output: 71

🏁 Script executed:

# Look at the complete context around the method declaration to check for docstrings sed -n '40,60p' mllm/backends/qnn/QNNBackend.hpp

Repository: UbiquitousLearning/mllm

Length of output: 874

🏁 Script executed:

# Check if createContext has docstrings (line 52) sed -n '45,56p' mllm/backends/qnn/QNNBackend.hpp

Repository: UbiquitousLearning/mllm

Length of output: 686

🏁 Script executed:

# Check the complete implementation to understand the context better sed -n '333,360p' mllm/backends/qnn/QNNBackend.cpp

Repository: UbiquitousLearning/mllm

Length of output: 1343

🏁 Script executed:

# Verify no other files in the project might be calling this method find . -type f \( -name "*.cpp" -o -name "*.h" -o -name "*.hpp" \) -exec grep -l "QNNRuntime\|QNNBackend" {} \; | head -20

Repository: UbiquitousLearning/mllm

Length of output: 665

🏁 Script executed:

# Check if there are any examples or test files that might use the old API find . -type f -name "*test*" -o -name "*example*" | grep -i qnn

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Double-check for any remaining usages of the old signature or other related calls rg -n "retrieveContext\|loadContext\|createContext" mllm/backends/qnn/QNNBackend.cpp -A 2

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check if the coding guideline about docstrings is being violated # Look at similar public methods in other backends to see the pattern find . -type f -name "*.hpp" -path "*/backends/*" -exec grep -l "bool.*(" {} \; | head -5 | xargs head -100

Repository: UbiquitousLearning/mllm

Length of output: 9600

🏁 Script executed:

# Verify the specific lines mentioned in the review (53-54) and surrounding context for any documentation sed -n '48,58p' mllm/backends/qnn/QNNBackend.hpp

Repository: UbiquitousLearning/mllm

Length of output: 662

🏁 Script executed:

# Verify no trailing whitespace or other formatting issues in the header file sed -n '52,55p' mllm/backends/qnn/QNNBackend.hpp | cat -A

Repository: UbiquitousLearning/mllm

Length of output: 383

🏁 Script executed:

# Check if there are any docstring patterns used elsewhere in QNN backend files rg -n "///|//!" mllm/backends/qnn/QNNBackend.hpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check Base Backend header for documentation patterns rg -n "bool.*\(.*\);" mllm/backends/base/Backend.hpp -B 2 | head -30

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check the entire QNNRuntime class declaration to see if there are any docstrings at all sed -n '41,100p' mllm/backends/qnn/QNNBackend.hpp

Repository: UbiquitousLearning/mllm

Length of output: 2621

🏁 Script executed:

# Look at the implementation to see if docstrings are in the .cpp file instead sed -n '325,340p' mllm/backends/qnn/QNNBackend.cpp

Repository: UbiquitousLearning/mllm

Length of output: 795

🏁 Script executed:

# Check how many methods in QNNRuntime lack documentation grep -n "bool\|void" mllm/backends/qnn/QNNBackend.hpp | grep -A 1 "QNNRuntime" -m 20

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check if these methods are newly added or modified (look at git context if available) git log --oneline -1 mllm/backends/qnn/QNNBackend.hpp 2>/dev/null || echo "Git history not available"

Repository: UbiquitousLearning/mllm

Length of output: 125

🏁 Script executed:

# Verify the coding guideline compliance by checking if similar public methods in this file have docstrings sed -n '85,130p' mllm/backends/qnn/QNNBackend.hpp

Repository: UbiquitousLearning/mllm

Length of output: 2112

Add docstrings to public API methods.

The retrieveContext and createContext methods (lines 52-54) lack docstrings. Per coding guidelines, public APIs must have clear docstrings explaining purpose, parameters, returns, and errors. The signature change itself is correct and the caller at QNNBackend.cpp:429 has been properly updated.

🤖 Prompt for AI Agents

In `@mllm/backends/qnn/QNNBackend.hpp` around lines 53 - 54, Add clear docstrings for the public methods retrieveContext and createContext in QNNBackend.hpp: for each method describe its purpose, list and explain every parameter (e.g., contextBinaryPath, context, qnnModels, contextConfig for retrieveContext and relevant params for createContext), state the return value semantics (what a true/false or success/failure indicates) and document possible errors/exceptions or failure conditions and side effects (e.g., ownership of pointers, when contextConfig may be modified, thread-safety expectations). Place these comments immediately above the function declarations so they become part of the public API docs.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@mllm/backends/qnn/aot/QnnWrappersAPI.cpp`:
- Around line 483-515: In QnnAOTEnv::saveContext, treat any binary size mismatch
as a fatal error: after calling contextGetBinary and obtaining binarySize and
writtenSize (in QnnAOTEnv::saveContext and using
qnn_htp_func_symbols_.qnn_interface_.contextGetBinary), check if writtenSize !=
binarySize (or writtenSize > binarySize) and if so log an error with both sizes
via MLLM_ERROR and return/abort before opening or writing the file; ensure the
success message and file write only happen when writtenSize == binarySize.

🧹 Nitpick comments (4)

mllm/backends/qnn/aot/visitor/RMSNorm.cpp (3)
46-49: Improve comment clarity and professionalism.

The comment "nn module seems to be inconsistent with document (AMAZING!)" is informal. Consider providing a more descriptive explanation of why this fake bias is necessary and reference any relevant documentation or issue tracker.
📝 Suggested improvement
-  // fake bias, nn module seems to be inconsistent with document (AMAZING!)
+  // QNN RmsNorm op requires a bias input per API, though the weight-only variant
+  // is documented. Provide a zero-filled bias tensor to satisfy the interface.
51-54: Document or use named constants for quantization parameters.

The QuantizationSpecSymPerTensor::create(0, 0, kInt32, kFloat32, ...) uses magic numbers without explanation. Consider using named constants or adding a brief comment explaining what the first two 0 parameters represent (e.g., zero point, scale offset, etc.) for maintainability.
📝 Suggested improvement
   // fake bias quant recipe
-  auto quant_spec = mllm::ir::linalg::QuantizationSpecSymPerTensor::create(0, 0, kInt32, kFloat32, Tensor::ones({1}));
+  // Symmetric per-tensor quantization: zero_point=0, encoding=0 (unused for bias)
+  constexpr int kBiasZeroPoint = 0;
+  constexpr int kBiasEncoding = 0;
+  auto quant_spec = mllm::ir::linalg::QuantizationSpecSymPerTensor::create(
+      kBiasZeroPoint, kBiasEncoding, kInt32, kFloat32, Tensor::ones({1}));
17-79: Consider adding unit tests for the fake bias handling.

The rewrite function now includes non-trivial logic for creating fake bias tensors and quantization recipes. Given the complexity and the workaround nature of this code (as indicated by the comment), unit tests covering edge cases (e.g., different weight shapes, dtypes) would help prevent regressions.

Would you like me to help outline test cases for this pattern, or open an issue to track adding test coverage?
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (1)
28-38: Minor: redundant reassignment in else branch.

Lines 36-37 reassign scale_name and zp_name to the same values already set at lines 29-30. The else branch can be removed entirely.
Suggested simplification
 Tensor QDQ(nn::Module* m, Tensor in, const std::string& qdq_name_in_pytorch) {
-  std::string scale_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.scale";
-  std::string zp_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.zero_point";
-
+  std::string scale_name;
+  std::string zp_name;
   if (m->getModuleName().empty()) {
     scale_name = qdq_name_in_pytorch + ".fake_quant.scale";
     zp_name = qdq_name_in_pytorch + ".fake_quant.zero_point";
   } else {
     scale_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.scale";
     zp_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.zero_point";
   }

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fdb27ea and 34895a5.

📒 Files selected for processing (4)

examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp
mllm/backends/qnn/QNNUtils.cpp
mllm/backends/qnn/aot/QnnWrappersAPI.cpp
mllm/backends/qnn/aot/visitor/RMSNorm.cpp

🚧 Files skipped from review as they are similar to previous changes (1)

mllm/backends/qnn/QNNUtils.cpp

🧰 Additional context used

📓 Path-based instructions (4)

{mllm,mllm-cli,pymllm}/**/*

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).

Files:

mllm/backends/qnn/aot/visitor/RMSNorm.cpp
mllm/backends/qnn/aot/QnnWrappersAPI.cpp

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Files:

mllm/backends/qnn/aot/visitor/RMSNorm.cpp
mllm/backends/qnn/aot/QnnWrappersAPI.cpp

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.

Files:

mllm/backends/qnn/aot/visitor/RMSNorm.cpp
mllm/backends/qnn/aot/QnnWrappersAPI.cpp

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).

Files:

mllm/backends/qnn/aot/visitor/RMSNorm.cpp
mllm/backends/qnn/aot/QnnWrappersAPI.cpp

🧬 Code graph analysis (1)

mllm/backends/qnn/aot/visitor/RMSNorm.cpp (2)

mllm/backends/qnn/aot/visitor/RMSNorm.hpp (1)

writer (16-16)

mllm/backends/qnn/aot/QnnWrappersAPI.hpp (3)

qnn_context_name (219-219)

qnn_context_name (221-222)

qnn_context_name (224-225)

🪛 Clang (14.0.6)

mllm/backends/qnn/aot/visitor/RMSNorm.cpp

[error] 4-4: 'mllm/core/DataTypes.hpp' file not found

(clang-diagnostic-error)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build-macos
GitHub Check: build-android
GitHub Check: build-x86

🔇 Additional comments (3)

mllm/backends/qnn/aot/visitor/RMSNorm.cpp (2)

70-74: LGTM on input wiring for bias tensor.

The bias_node is correctly wired as the third input with force_static_weight = true, consistent with the existing weight input handling. The output attachment follows the established pattern.

4-5: No issues found. Both mllm/core/DataTypes.hpp and mllm/core/Tensor.hpp exist in the codebase and are correctly included at the specified lines.

Likely an incorrect or invalid review comment.

examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (1)

244-256: Precision upgrade from kFloat16 to kFloat32 for KV de-quantization.

The intermediate representation for key_states and value_states before re-quantization to kUInt8PerTensorSym now uses kFloat32 instead of kFloat16. This reduces quantization error accumulation at the cost of temporarily increased memory footprint.

Please confirm this change is intentional and verify whether the precision improvement outweighs any performance overhead on target QNN hardware.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-16T01:05:50Z

 void QnnAOTEnv::saveContext(const std::string& name, const std::string& path) {
-  // TODO
+  if (contexts_.find(name) == contexts_.end()) {
+    MLLM_ERROR("QnnAOTEnv::saveContext Context {} not found", name);
+    return;
+  }
+  auto context = contexts_[name];
+
+  uint64_t binarySize = 0;
+  uint64_t writtenSize = 0;
+
+  auto status = qnn_htp_func_symbols_.qnn_interface_.contextGetBinarySize(context->qnn_ctx_handle_, &binarySize);
+  MLLM_RT_ASSERT_EQ(status, QNN_SUCCESS);
+
+  std::vector<uint8_t> binaryBuffer(binarySize);
+
+  status = qnn_htp_func_symbols_.qnn_interface_.contextGetBinary(
+      context->qnn_ctx_handle_, reinterpret_cast<void*>(binaryBuffer.data()), binarySize, &writtenSize);
+  MLLM_RT_ASSERT_EQ(status, QNN_SUCCESS);
+
+  if (binarySize < writtenSize) {
+    MLLM_ERROR("QNN context binary size mismatch: expected {} bytes, but wrote {} bytes.", binarySize, writtenSize);
+  }
+
+  std::ofstream file(path, std::ios::binary);
+  if (!file.is_open()) {
+    MLLM_ERROR("Failed to open file {} for writing QNN context.", path);
+    return;
+  }
+  file.write(reinterpret_cast<char*>(binaryBuffer.data()), writtenSize);
+  file.close();
+
+  MLLM_INFO("QNN context {} saved to {} written {}", name, path, writtenSize);
 }


⚠️ Potential issue | 🟠 Major

Treat any context size mismatch as an error before writing.

writtenSize can be less than binarySize, which would save an incomplete/invalid context but still log success. Treat any mismatch as an error and abort the write.

💡 Suggested fix

- if (binarySize < writtenSize) { - MLLM_ERROR("QNN context binary size mismatch: expected {} bytes, but wrote {} bytes.", binarySize, writtenSize); - } + if (writtenSize != binarySize) { + MLLM_ERROR("QNN context binary size mismatch: expected {} bytes, but wrote {} bytes.", binarySize, writtenSize); + return; + }

🤖 Prompt for AI Agents

In `@mllm/backends/qnn/aot/QnnWrappersAPI.cpp` around lines 483 - 515, In QnnAOTEnv::saveContext, treat any binary size mismatch as a fatal error: after calling contextGetBinary and obtaining binarySize and writtenSize (in QnnAOTEnv::saveContext and using qnn_htp_func_symbols_.qnn_interface_.contextGetBinary), check if writtenSize != binarySize (or writtenSize > binarySize) and if so log an error with both sizes via MLLM_ERROR and return/abort before opening or writing the file; ensure the success message and file write only happen when writtenSize == binarySize.

oreomaker added 7 commits January 12, 2026 01:41

feat(qnn-aot): add bias support for RMSNorm operation

a46b513

feat(qnn-aot): implement saveContext functionality for QNN AOT enviro…

95ddcb8

…nment

Merge branch 'main' into qnn-aot

42d4aee

Merge branch 'main' into qnn-aot

492fa8e

feat(qwen3_qnn_aot): change data type from float16 to float32 for key…

1b27f17

… and value states

oreomaker requested review from chenghuaWang and liang1232018 as code owners January 14, 2026 08:56

coderabbitai Bot reviewed Jan 14, 2026

View reviewed changes

chenghuaWang changed the title ~~Qnn aot runner~~ feat(qualcomm): Qnn aot runner Jan 15, 2026

chenghuaWang approved these changes Jan 15, 2026

View reviewed changes

Merge branch 'qnn-aot' of github.com:oreomaker/mllm into qnn-aot

34895a5

coderabbitai Bot reviewed Jan 16, 2026

View reviewed changes

chenghuaWang merged commit def35d4 into UbiquitousLearning:main Jan 17, 2026
4 checks passed

This was referenced Jan 29, 2026

feat(qwen3): Introduce Single Head Attention (SHA) optimization for Qualcomm qwen model #616

Merged

doc(qualcomm): Qnn aot #625

Merged

		int64_t TokenGenerator<T>::generate(std::vector<uint64_t>& tokens, int64_t start_pos, int32_t seq_len,
		const std::function<void(const std::string&)>& token_callback, bool dump_logits) {

		// input wrapper is empty, set wrapper's dataContainer(mllm::Tensor)
		if (!wrapper->isAlloc()) { wrapper->__setDataContainer(runtime_input); }

		bool retrieveContext(const std::string& contextBinaryPath, Qnn_ContextHandle_t& context,
		std::vector<std::shared_ptr<QNNModel>>& qnnModels, QnnContext_Config_t** contextConfig = nullptr);

Conversation

oreomaker commented Jan 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly Related PRs

Suggested Reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oreomaker commented Jan 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 14, 2026 •

edited

Loading