feat(qualcomm): Qnn aot runner#603
Conversation
… and value states
…alization - Add loadContext and createContext methods to QNNBackend for flexible context management - Modify initQnnBackend to accept optional context path parameter - Update QNNRuntime::retrieveContext to accept context binary path as parameter - Remove hardcoded context creation logic from constructor - Add filesystem include for context file existence checking - Refactor tensor wrapper allocation logic to use setDataContainer - Remove debug logging and verification code - Simplify input tensor handling during graph execution
- Add MLLM_BUILD_QNN_BACKEND condition to enable QNN AOT example - Create new aot_run.cpp with Qwen3 AOT runner implementation - Implement KVCacheManager for managing KV cache operations - Add PromptProcessor for handling prompt prefill operations - Create QnnAOTModule as base module for AOT runtime - Implement QnnAOTRuntime with complete generation pipeline - Update CMakeLists.txt to include new AOT runtime sources
📝 WalkthroughWalkthroughAdds a Qualcomm QNN AOT runtime for Qwen3: KV-cache manager, prompt processor, token generator, AOT module and Runner; exposes context create/load APIs in QNN backend; updates build and initQnnBackend to accept a context path; adds example runner executable and related CMake changes. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Runner
participant KVCacheMgr as KVCacheManager<T>
participant PromptProc as PromptProcessor<T>
participant TokenGen as TokenGenerator<T>
participant QNNBackend
Client->>Runner: load()
Runner->>QNNBackend: get backend / allocator
Runner->>KVCacheMgr: initCache(allocator, ar_len)
Runner->>PromptProc: construct + init_io()
Runner->>TokenGen: construct + init_io()
Client->>Runner: generate(prompt_tokens, seq_len, callback)
Runner->>PromptProc: prefill(prompt_tokens, start=0)
PromptProc->>KVCacheMgr: rearrangeCache / initAttentionMask
PromptProc->>PromptProc: module forward (QnnAOTModule)
PromptProc->>KVCacheMgr: updateCache(...)
PromptProc-->>Runner: next_token
loop autoregressive
Runner->>TokenGen: generate(tokens, pos, seq_len, callback)
TokenGen->>KVCacheMgr: prepare_io(...) / initAttentionMask
TokenGen->>TokenGen: module forward
TokenGen->>KVCacheMgr: updateCache(...)
TokenGen-->>Client: token_callback(token_str)
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly Related PRs
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
mllm/backends/qnn/QNNBackend.cpp (3)
333-343: Missing file open check before reading.The code opens the file and immediately calls
tellg()without verifying the file opened successfully. If the file doesn't exist or can't be read,tellg()returns -1, and subsequent operations will fail or exhibit undefined behavior.🐛 Proposed fix
bool QNNRuntime::retrieveContext(const std::string& contextBinaryPath, Qnn_ContextHandle_t& context, std::vector<std::shared_ptr<QNNModel>>& qnnModels, QnnContext_Config_t** contextConfig) { // Read the binary from qnn_context.bin and get the size in byte std::ifstream file(contextBinaryPath, std::ios::binary | std::ios::ate); + if (!file.is_open()) { + MLLM_ERROR("Failed to open context binary file: {}", contextBinaryPath); + return false; + } std::streamsize size = file.tellg(); + if (size <= 0) { + MLLM_ERROR("Context binary file is empty or unreadable: {}", contextBinaryPath); + return false; + } file.seekg(0, std::ios::beg);
371-374: Unused variablewrittenSize.The variable
writtenSizeis declared but never used. Either remove it or use it to validate the context creation result.🐛 Proposed fix
// Create context from binary - Qnn_ContextBinarySize_t writtenSize = 0; - qnnInterface.contextCreateFromBinary(backendHandle, deviceHandle, (const QnnContext_Config_t**)contextConfig, - binaryBuffer.get(), size, &context, profileHandle); + if (QNN_CONTEXT_NO_ERROR != qnnInterface.contextCreateFromBinary( + backendHandle, deviceHandle, (const QnnContext_Config_t**)contextConfig, + binaryBuffer.get(), size, &context, profileHandle)) { + MLLM_ERROR("Failed to create context from binary: {}", contextBinaryPath); + return false; + }
417-417: Log message contains hardcoded filename instead of actual path.The log message says "qnn_context.bin" but should reference the actual
contextBinaryPathparameter for accurate logging.🐛 Proposed fix
- MLLM_INFO("QNN context retrieved from qnn_context.bin with {} QNNModels", graphNum); + MLLM_INFO("QNN context retrieved from {} with {} QNNModels", contextBinaryPath, graphNum);
🤖 Fix all issues with AI agents
In `@examples/qwen3_qnn_aot/aot_run.cpp`:
- Around line 22-27: Move the help flag check (help.isSet() /
Argparse::printHelp()) to before any backend initialization so the app exits
without calling mllm::initQnnBackend when the user only requested help; also fix
the initQnnBackend call to pass the QNN context binary (or omit the argument to
use the default "qnn_context.bin") instead of model_path.get() (e.g., replace
model_path.get() with the correct context path or no argument in the
mllm::initQnnBackend(...) call).
In `@mllm/backends/qnn/aot_rt/KVCacheManager.cpp`:
- Around line 35-38: After calling allocator->alloc(k_storage_in) ensure you
check that k_storage_in->ptr_ is non-null before calling memset; if allocation
failed, log or return an error (or throw) and avoid dereferencing the pointer.
Apply the same null-check pattern for every allocator->alloc(...) use in this
file (e.g., any other Storage instances created in KVCacheManager methods) so
you never call memset or access ptr_ when allocation failed.
- Around line 270-273: The loop building true_indices can leave elements
uninitialized if selected has fewer than n_update true values; change the logic
to first count actual trues (or use push_back) and either resize true_indices to
that actual count or validate that actual_count == n_update and handle the
mismatch (log/throw/return error) before using true_indices; update the code
that consumes true_indices (same scope where true_indices, selected, and
n_update are used) to rely on the validated/adjusted size to avoid reading
default zeros.
In `@mllm/backends/qnn/aot_rt/PromptProcessor.cpp`:
- Around line 145-146: The call to kv_manager_->initAttentionMask is using the
wrong tensor index (input_tensors_[3]) which points at a KV cache; change it to
use the attention-mask tensor created in init_io() (input_tensors_[2]) or,
better, replace the hardcoded index with the dedicated attn_mask member if one
exists; update the call to pass the correct ptr (e.g.,
input_tensors_[2].ptr<uint16_t>() or the attn_mask member) while keeping the
same arguments (attention_map, config_.ar_len, start_pos) so initAttentionMask
receives the actual attention mask instead of a past_key buffer.
- Around line 162-163: The attention-mask update call in PromptProcessor.cpp is
using the wrong tensor index: change the call to
kv_manager_->updateAttentionMask(...) to use input_tensors_[2].ptr<uint16_t>()
instead of input_tensors_[3].ptr<uint16_t>(); locate the updateAttentionMask
invocation in PromptProcessor.cpp (the kv_manager_->updateAttentionMask(...)
line) and replace the index so the attention mask source uses input_tensors_[2],
leaving other arguments (config_.ar_len, current_pos, n_update,
config_.sliding_window) unchanged.
In `@mllm/backends/qnn/aot_rt/QnnAOTModule.hpp`:
- Around line 26-33: The private member backend_path_ in class QnnAOTModule is
unused and should be removed; edit QnnAOTModule.hpp to delete the backend_path_
declaration from the private section (and remove any related comments or TODOs
referring to backend_path_), then rebuild/compile to ensure no missing
references — if any references exist elsewhere (e.g., in QnnAOTModule methods or
tests), remove or refactor those usages accordingly.
In `@mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp`:
- Around line 82-97: The call to tokenizer_->detokenize(next_token) in
Runner::generate can dereference a null tokenizer_; add a null check before
using tokenizer_ (e.g., if (token_callback) { if (tokenizer_) { wstr =
tokenizer_->detokenize(next_token); } else { wstr = L""; } ... }) and ensure you
still invoke token_callback with a sensible fallback (empty string or numeric
token string) when tokenizer_ is null; update the code paths around tokenizer_,
detokenize, and token_callback accordingly so no null access occurs.
In `@mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp`:
- Around line 1-12: The header QnnAOTRuntime.hpp is missing an include guard;
add a header guard or a single-line `#pragma` once at the very top of
QnnAOTRuntime.hpp (before the existing includes) and, if using the traditional
guard, wrap the entire file with a unique macro (e.g.,
QNN_AOT_RT_QNNAOTRUNTIME_HPP) and close it with an `#endif` (with a comment) at
the end of the file to prevent multiple inclusion.
In `@mllm/backends/qnn/aot_rt/TokenGenerator.cpp`:
- Around line 107-108: The generate method currently ignores the dump_logits
parameter and never fills token_all_logits_, so either implement logits
collection when dump_logits is true or remove the unused API; to fix, update
TokenGenerator<T>::generate to, when dump_logits is true, capture per-step/logit
outputs from the model/inference path and append them into the member
token_all_logits_ (ensure it is cleared at start of generation), and also expose
those logits via get_all_logits(); alternatively, if logits dumping is not
needed, remove the dump_logits parameter from generate and delete
token_all_logits_ and get_all_logits() to avoid dead code — locate
TokenGenerator<T>::generate, the token_all_logits_ member and get_all_logits()
to apply the change.
- Around line 22-24: The input_ids tensor is created with Tensor::empty(...,
kInt64, ...) but prepare_io reads/writes it as int32_t*, causing a dtype
mismatch; update the creation or the access so both use the same type—preferably
change Tensor::empty call that constructs input_ids to use kInt32 to match
PromptProcessor.cpp (and adjust any Tensor::empty calls creating input_ids) OR
if you must keep 64-bit, change prepare_io to use int64_t* when accessing
input_ids; ensure the symbol names input_ids, Tensor::empty and prepare_io are
updated consistently so allocation dtype and pointer casts match.
In `@mllm/backends/qnn/QNNBackend.cpp`:
- Around line 552-553: The wrapper reuse causes stale inputs because
wrapper->isAlloc() prevents updating dataContainer_; change the flow so the
wrapper's data container is refreshed each graphExecute: either always call
wrapper->__setDataContainer(runtime_input) before allocation/usage (i.e., do not
gate it on wrapper->isAlloc()), or add/invoke a reset method that clears
isAlloc_ and dataContainer_ on reuse (implement resetWrapper/resetAllocation
that zeroes dataContainer_ and sets isAlloc_ = false and call it at start of
graphExecute). Ensure __setDataContainer(), isAlloc_ and dataContainer_ are the
referenced symbols updated so allocation uses the current runtime_input.
In `@mllm/backends/qnn/QNNBackend.hpp`:
- Around line 53-54: Add clear docstrings for the public methods retrieveContext
and createContext in QNNBackend.hpp: for each method describe its purpose, list
and explain every parameter (e.g., contextBinaryPath, context, qnnModels,
contextConfig for retrieveContext and relevant params for createContext), state
the return value semantics (what a true/false or success/failure indicates) and
document possible errors/exceptions or failure conditions and side effects
(e.g., ownership of pointers, when contextConfig may be modified, thread-safety
expectations). Place these comments immediately above the function declarations
so they become part of the public API docs.
🧹 Nitpick comments (16)
mllm/backends/qnn/QNNBackend.hpp (1)
90-92: Add documentation for new public APIs.Per coding guidelines, public APIs should have clear docstrings or comments explaining purpose, parameters, returns, and errors. Consider adding brief documentation:
📝 Suggested documentation
+ /** + * `@brief` Loads a QNN context from a binary file. + * `@param` contextPath Path to the context binary file. + * `@return` true if context was loaded successfully, false otherwise. + */ bool loadContext(const std::string& contextPath); + + /** + * `@brief` Creates a new QNN context. + * `@return` true if context was created successfully, false otherwise. + */ bool createContext();mllm/backends/qnn/QNNUtils.hpp (1)
208-213: Consider renaming__setDataContainerto avoid reserved naming convention.The double underscore prefix (
__) is typically reserved for compiler/implementation use in C++. While this works, a single underscore suffix or different naming pattern would be more idiomatic.The implementation logic is correct - the assertion guards against overwriting an existing container, and
isAlloc_is properly updated only for non-nil tensors.♻️ Suggested rename
- void __setDataContainer(const Tensor& tensor) { + void setDataContainerUnchecked(const Tensor& tensor) { MLLM_RT_ASSERT(dataContainer_.isNil()) dataContainer_ = tensor; if (!tensor.isNil()) { isAlloc_ = true; } }examples/qwen3_qnn_aot/aot_run.cpp (3)
38-39: Consider makingcontext_lenconfigurable or using a named constant.The hardcoded value
1024forcontext_lenis a magic number. Consider either:
- Adding a CLI argument for it (like
--ar_len)- Using a named constant with documentation
55-58: Type conversion fromint64_ttouint64_tmay lose sign information.While token IDs are typically non-negative, the explicit cast from
int64_t*touint64_tcould mask unexpected negative values. Consider adding a debug assertion or usingstatic_castwith a comment explaining the assumption.♻️ Suggested improvement
std::vector<uint64_t> prompt_tokens; auto sequence = input_tensor["sequence"]; int64_t* ptr = sequence.ptr<int64_t>(); - for (int i = 0; i < sequence.shape()[1]; ++i) { prompt_tokens.push_back((uint64_t)ptr[i]); } + for (int i = 0; i < sequence.shape()[1]; ++i) { + MLLM_RT_ASSERT(ptr[i] >= 0); // Token IDs should be non-negative + prompt_tokens.push_back(static_cast<uint64_t>(ptr[i])); + }
43-48: Add validation for empty prompt input.If the user enters an empty string or just whitespace, the code proceeds without validation. Consider adding a check to handle this gracefully, especially since
convertMessagebehavior with empty input may be undefined.🔧 Suggested validation
std::string prompt_text; fmt::print("💬 Prompt text (or 'exit/quit'): "); std::getline(std::cin, prompt_text); + if (prompt_text.empty() || prompt_text == "exit" || prompt_text == "quit") { + return 0; + } + auto input_tensor = tokenizer.convertMessage({.prompt = prompt_text});mllm/backends/qnn/aot_rt/QnnAOTModule.hpp (1)
12-23: Class design looks reasonable; add documentation for public API.The dual inheritance pattern from
mllm::nn::Moduleandmodels::ARGenerationprovides flexibility for both module-based and autoregressive generation interfaces. The NYI stub for the ARGeneration forward is appropriate.Per coding guidelines, consider adding class-level and method documentation explaining the purpose and usage of
QnnAOTModule.mllm/backends/qnn/aot_rt/QnnAOTModule.cpp (1)
12-15: Mark unused parameters to suppress compiler warnings.The
inputsandargsparameters are unused. This is likely intentional sinceoutput_tensors_is populated viasetOutputTensors()beforeforward()is called, but the unused parameters may trigger compiler warnings.♻️ Suggested fix
std::vector<mllm::Tensor> QnnAOTModule::forward(const std::vector<mllm::Tensor>& inputs, const std::vector<mllm::AnyValue>& args) { + (void)inputs; + (void)args; return output_tensors_; }Alternatively, use
[[maybe_unused]]attribute in the header declaration.examples/qwen3_qnn_aot/CMakeLists.txt (1)
8-10: Consider adding a build guard for the runner target.The
mllm-qwen3-aot-runnertarget links againstMllmQNNBackend, but is defined unconditionally. According to the AI summary, the parentexamples/CMakeLists.txtguards the entire subdirectory inclusion withMLLM_BUILD_QNN_BACKEND OR MLLM_QUALCOMM_QNN_AOT_ON_X86_ENABLE, so this should be safe. However, for clarity and consistency with the compile target, consider adding an explicit guard or a comment explaining the build conditions.mllm/backends/qnn/aot_rt/PromptProcessor.hpp (1)
18-29: Consider unifying integer types in Config struct.The Config struct mixes
int32_t(context_len,ar_len,vocab_size,head_dim,sliding_window) withint64_t(num_heads,num_layers). Unless there's a specific reason for this (e.g., matching QNN API requirements), consider using consistent types to avoid implicit conversions.mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp (1)
16-25: Consider documenting model-specific defaults.The
RunnerConfigdefaults (e.g.,vocab_size = 151936,num_layers = 28,num_heads = 12) appear to be Qwen3-specific. Consider adding a comment noting these are Qwen3 defaults, or extracting them to a model-specific configuration to improve reusability.mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp (1)
67-70: Hardcoded EOS token IDs should be configurable.The EOS token IDs
151643and151645are hardcoded with a TODO comment. These are Qwen3-specific values. Consider adding them toRunnerConfigto improve configurability and reusability across different models.Do you want me to propose a modification to add EOS IDs to
RunnerConfig?mllm/backends/qnn/aot_rt/PromptProcessor.cpp (1)
1-3: Minor: Leading blank line.Line 1 is blank before the copyright header. This is a minor stylistic inconsistency.
mllm/backends/qnn/aot_rt/TokenGenerator.hpp (1)
42-47: Consider documenting ownership semantics for raw pointer members.The class uses raw pointers for
tokenizer_andkv_manager_(non-owning) andunique_ptrformodule_andeos_ids_(owning). This is a valid design for dependency injection, but adding brief comments clarifying that the caller retains ownership oftokenizer_andkv_manager_would improve maintainability.mllm/backends/qnn/aot_rt/KVCacheManager.hpp (2)
52-53: Add[[nodiscard]]attribute to accessor methods for consistency.The static analysis correctly notes that
getKCache()andgetVCache()should be marked[[nodiscard]]for consistency withgetTotalCacheSizeInBytes()(line 54). Discarding the return value of these accessors is almost certainly a bug.♻️ Proposed fix
- const std::vector<KVCache<T>>& getKCache() const { return k_cache_; } - const std::vector<KVCache<T>>& getVCache() const { return v_cache_; } + [[nodiscard]] const std::vector<KVCache<T>>& getKCache() const { return k_cache_; } + [[nodiscard]] const std::vector<KVCache<T>>& getVCache() const { return v_cache_; }
22-29: Consider consistent integer types in KVCacheConfig.The struct mixes
int32_tandint64_ttypes for similar dimension values. While not incorrect, this inconsistency may lead to implicit conversions. If there's no external API requirement, consider using a consistent type.mllm/backends/qnn/aot_rt/KVCacheManager.cpp (1)
70-73: Usingexit(1)for error handling is non-recoverable and not production-friendly.Multiple error conditions use
exit(1)which terminates the entire process without cleanup. Consider throwing exceptions or returning error codes to allow callers to handle errors gracefully.♻️ Proposed alternative using exceptions
if (attention_map.size() > ar_len) { - MLLM_ERROR("The size of attention_map ({}) doesn't match with ar_len ({})", attention_map.size(), ar_len); - exit(1); + throw std::invalid_argument( + fmt::format("The size of attention_map ({}) doesn't match with ar_len ({})", attention_map.size(), ar_len)); }Also applies to: 106-109, 241-245
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (26)
examples/CMakeLists.txtexamples/qwen3_qnn_aot/CMakeLists.txtexamples/qwen3_qnn_aot/aot_run.cppmllm/backends/qnn/CMakeLists.txtmllm/backends/qnn/QNNBackend.cppmllm/backends/qnn/QNNBackend.hppmllm/backends/qnn/QNNUtils.cppmllm/backends/qnn/QNNUtils.hppmllm/backends/qnn/Register.cppmllm/backends/qnn/aot_rt/KVCacheManager.cppmllm/backends/qnn/aot_rt/KVCacheManager.hppmllm/backends/qnn/aot_rt/PromptProcessor.cppmllm/backends/qnn/aot_rt/PromptProcessor.hppmllm/backends/qnn/aot_rt/QnnAOTModule.cppmllm/backends/qnn/aot_rt/QnnAOTModule.hppmllm/backends/qnn/aot_rt/QnnAOTRuntime.cppmllm/backends/qnn/aot_rt/QnnAOTRuntime.hppmllm/backends/qnn/aot_rt/TokenGenerator.cppmllm/backends/qnn/aot_rt/TokenGenerator.hppmllm/backends/qnn/aot_rt/utils/MaskGen.cppmllm/backends/qnn/aot_rt/utils/MaskGen.hppmllm/backends/qnn/aot_rt/utils/PositionIdGen.cppmllm/backends/qnn/aot_rt/utils/PositionIdGen.hppmllm/backends/qnn/aot_rt/utils/RoPEGen.cppmllm/backends/qnn/aot_rt/utils/RoPEGen.hppmllm/mllm.hpp
🧰 Additional context used
📓 Path-based instructions (5)
{mllm,mllm-cli,pymllm}/**/*
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).
Files:
mllm/backends/qnn/CMakeLists.txtmllm/backends/qnn/QNNUtils.hppmllm/mllm.hppmllm/backends/qnn/aot_rt/PromptProcessor.cppmllm/backends/qnn/aot_rt/TokenGenerator.hppmllm/backends/qnn/aot_rt/PromptProcessor.hppmllm/backends/qnn/QNNUtils.cppmllm/backends/qnn/aot_rt/QnnAOTRuntime.cppmllm/backends/qnn/QNNBackend.cppmllm/backends/qnn/aot_rt/KVCacheManager.hppmllm/backends/qnn/aot_rt/QnnAOTModule.cppmllm/backends/qnn/aot_rt/QnnAOTRuntime.hppmllm/backends/qnn/QNNBackend.hppmllm/backends/qnn/Register.cppmllm/backends/qnn/aot_rt/KVCacheManager.cppmllm/backends/qnn/aot_rt/QnnAOTModule.hppmllm/backends/qnn/aot_rt/TokenGenerator.cpp
{mllm,mllm-cli,pymllm}/**/*.{toml,json,txt,md}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
When suggesting new dependencies, evaluate their maturity, licensing, and maintenance status. Prefer well-established and actively maintained libraries.
Files:
mllm/backends/qnn/CMakeLists.txt
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.
Files:
mllm/backends/qnn/QNNUtils.hppmllm/mllm.hppmllm/backends/qnn/aot_rt/PromptProcessor.cppmllm/backends/qnn/aot_rt/TokenGenerator.hppmllm/backends/qnn/aot_rt/PromptProcessor.hppmllm/backends/qnn/QNNUtils.cppmllm/backends/qnn/aot_rt/QnnAOTRuntime.cppmllm/backends/qnn/QNNBackend.cppmllm/backends/qnn/aot_rt/KVCacheManager.hppmllm/backends/qnn/aot_rt/QnnAOTModule.cppmllm/backends/qnn/aot_rt/QnnAOTRuntime.hppmllm/backends/qnn/QNNBackend.hppmllm/backends/qnn/Register.cppmllm/backends/qnn/aot_rt/KVCacheManager.cppmllm/backends/qnn/aot_rt/QnnAOTModule.hppmllm/backends/qnn/aot_rt/TokenGenerator.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).
Files:
mllm/backends/qnn/QNNUtils.hppmllm/mllm.hppmllm/backends/qnn/aot_rt/PromptProcessor.cppmllm/backends/qnn/aot_rt/TokenGenerator.hppmllm/backends/qnn/aot_rt/PromptProcessor.hppmllm/backends/qnn/QNNUtils.cppmllm/backends/qnn/aot_rt/QnnAOTRuntime.cppmllm/backends/qnn/QNNBackend.cppmllm/backends/qnn/aot_rt/KVCacheManager.hppmllm/backends/qnn/aot_rt/QnnAOTModule.cppmllm/backends/qnn/aot_rt/QnnAOTRuntime.hppmllm/backends/qnn/QNNBackend.hppmllm/backends/qnn/Register.cppmllm/backends/qnn/aot_rt/KVCacheManager.cppmllm/backends/qnn/aot_rt/QnnAOTModule.hppmllm/backends/qnn/aot_rt/TokenGenerator.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.
Files:
mllm/backends/qnn/aot_rt/PromptProcessor.cppmllm/backends/qnn/QNNUtils.cppmllm/backends/qnn/aot_rt/QnnAOTRuntime.cppmllm/backends/qnn/QNNBackend.cppmllm/backends/qnn/aot_rt/QnnAOTModule.cppmllm/backends/qnn/Register.cppmllm/backends/qnn/aot_rt/KVCacheManager.cppmllm/backends/qnn/aot_rt/TokenGenerator.cpp
🧬 Code graph analysis (12)
examples/qwen3_qnn_aot/aot_run.cpp (4)
mllm/utils/Argparse.hpp (1)
Argparse(135-135)mllm/backends/qnn/aot_rt/KVCacheManager.hpp (1)
ar_len(50-50)mllm/backends/qnn/aot_rt/PromptProcessor.hpp (2)
prompt_tokens(39-39)prompt_tokens(42-42)mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp (1)
prompt_tokens(33-34)
mllm/backends/qnn/aot_rt/TokenGenerator.hpp (1)
mllm/backends/qnn/aot_rt/TokenGenerator.cpp (1)
TokenGenerator(9-15)
mllm/backends/qnn/QNNUtils.cpp (1)
mllm/backends/qnn/QNNUtils.hpp (1)
isAlloc_(208-208)
mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp (3)
mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp (3)
Runner(29-29)Runner(30-30)prompt_tokens(33-34)mllm/backends/qnn/aot_rt/TokenGenerator.cpp (2)
generate(107-150)generate(107-108)mllm/backends/qnn/aot_rt/PromptProcessor.hpp (2)
prompt_tokens(39-39)prompt_tokens(42-42)
mllm/backends/qnn/QNNBackend.cpp (2)
mllm/backends/qnn/QNNBackend.hpp (10)
contextBinaryPath(53-54)context(52-52)contextPath(90-90)graphName(96-96)graphName(98-102)graphName(104-104)graphName(106-106)graphName(109-110)graphName(112-113)graphName(116-116)mllm/backends/qnn/QNNModel.hpp (2)
context(39-40)context(43-45)
mllm/backends/qnn/aot_rt/KVCacheManager.hpp (1)
mllm/backends/qnn/aot_rt/KVCacheManager.cpp (1)
KVCacheManager(12-20)
mllm/backends/qnn/aot_rt/QnnAOTModule.cpp (1)
mllm/backends/qnn/aot_rt/QnnAOTModule.hpp (2)
QnnAOTModule(14-14)inputs(16-16)
mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp (2)
mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp (1)
Runner(12-13)mllm/backends/qnn/aot_rt/PromptProcessor.hpp (2)
prompt_tokens(39-39)prompt_tokens(42-42)
mllm/backends/qnn/QNNBackend.hpp (1)
mllm/backends/qnn/QNNModel.hpp (2)
context(39-40)context(43-45)
mllm/backends/qnn/Register.cpp (3)
mllm/mllm.hpp (1)
isQnnAvailable(198-198)mllm/mllm.cpp (2)
isQnnAvailable(93-98)isQnnAvailable(93-93)mllm/backends/qnn/aot/QnnWrappersAPI.hpp (1)
instance(171-174)
mllm/backends/qnn/aot_rt/KVCacheManager.cpp (1)
mllm/backends/qnn/aot_rt/KVCacheManager.hpp (15)
KVCacheManager(34-34)KVCacheManager(35-35)k_cache_(52-52)v_cache_(53-53)allocator(37-37)ar_len(50-50)attention_mask(40-40)attention_mask(42-43)attention_mask(45-45)attention_mask(47-48)ar_len_dst(38-38)k_cache(57-57)k_cache(59-59)v_cache(58-58)v_cache(60-60)
mllm/backends/qnn/aot_rt/TokenGenerator.cpp (3)
mllm/backends/qnn/aot_rt/TokenGenerator.hpp (4)
TokenGenerator(30-31)TokenGenerator(33-33)TokenGenerator(33-33)tokens(39-40)mllm/backends/qnn/aot_rt/PromptProcessor.cpp (4)
init_io(27-93)init_io(27-27)prepare_io(96-130)prepare_io(96-96)mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp (2)
generate(82-102)generate(82-83)
🪛 Clang (14.0.6)
examples/qwen3_qnn_aot/aot_run.cpp
[error] 1-1: 'iostream' file not found
(clang-diagnostic-error)
mllm/backends/qnn/aot_rt/PromptProcessor.cpp
[error] 5-5: 'mllm/backends/qnn/aot_rt/PromptProcessor.hpp' file not found
(clang-diagnostic-error)
mllm/backends/qnn/aot_rt/TokenGenerator.hpp
[error] 3-3: 'mllm/backends/qnn/aot_rt/QnnAOTModule.hpp' file not found
(clang-diagnostic-error)
mllm/backends/qnn/aot_rt/PromptProcessor.hpp
[error] 6-6: 'mllm/backends/qnn/aot_rt/QnnAOTModule.hpp' file not found
(clang-diagnostic-error)
mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp
[error] 4-4: 'mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp' file not found
(clang-diagnostic-error)
mllm/backends/qnn/aot_rt/KVCacheManager.hpp
[error] 6-6: 'cstdint' file not found
(clang-diagnostic-error)
[error] 52-52: function 'getKCache' should be marked [[nodiscard]]
(modernize-use-nodiscard,-warnings-as-errors)
[error] 53-53: function 'getVCache' should be marked [[nodiscard]]
(modernize-use-nodiscard,-warnings-as-errors)
mllm/backends/qnn/aot_rt/QnnAOTModule.cpp
[error] 1-1: 'mllm/backends/qnn/aot_rt/QnnAOTModule.hpp' file not found
(clang-diagnostic-error)
mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp
[error] 4-4: 'mllm/backends/qnn/aot_rt/QnnAOTModule.hpp' file not found
(clang-diagnostic-error)
mllm/backends/qnn/aot_rt/KVCacheManager.cpp
[error] 4-4: 'mllm/backends/qnn/aot_rt/KVCacheManager.hpp' file not found
(clang-diagnostic-error)
mllm/backends/qnn/aot_rt/QnnAOTModule.hpp
[error] 3-3: 'mllm/models/ARGeneration.hpp' file not found
(clang-diagnostic-error)
mllm/backends/qnn/aot_rt/TokenGenerator.cpp
[error] 1-1: 'mllm/backends/qnn/aot_rt/TokenGenerator.hpp' file not found
(clang-diagnostic-error)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-android
- GitHub Check: build-macos
- GitHub Check: build-x86
🔇 Additional comments (16)
mllm/backends/qnn/QNNUtils.cpp (1)
484-496: No issue found —registerQnnTensorToSharedBufferis explicitly designed to be idempotent.The implementation checks if a pointer is already registered and handles it gracefully: if the pointer exists in
ptrToFdAndMemHandleMap_, it simply updates the tensor's memory handle and returns (lines 62–68 of QNNAllocator.cpp). Only on first call does it perform the actual QNN memory registration. Multiple calls toalloc()on the same tensor are safe and will not cause double-registration or resource leaks.mllm/backends/qnn/aot_rt/QnnAOTModule.cpp (1)
9-10: LGTM!Constructor correctly initializes the base class with
graph_nameand stores both member variables.mllm/backends/qnn/QNNBackend.cpp (1)
421-438: LGTM!The new
createContextandloadContextmethods are well-structured. They properly initialize the QNN allocator after context creation/loading andloadContextcorrectly populates theqnnModelIndexMap_for graph lookup.examples/CMakeLists.txt (1)
20-22: LGTM!The updated condition correctly enables the
qwen3_qnn_aotexample when either cross-compiling AOT on x86 or when building with QNN backend support on target.mllm/mllm.hpp (1)
200-200: LGTM!The optional
context_pathparameter with a default value maintains backward compatibility while enabling callers to specify custom context binary paths. This aligns well with the newloadContextAPI inQNNBackend.mllm/backends/qnn/CMakeLists.txt (1)
24-28: The unconditional inclusion of aot_rt sources is correct.The
aot_rtsources must be compiled unconditionally becausemllm-qwen3-aot-runner(examples/qwen3_qnn_aot/CMakeLists.txt, line 8) is built unconditionally and directly includesQnnAOTRuntime.hpp. Whileaotsources are conditional onMLLM_QUALCOMM_QNN_AOT_ON_X86_ENABLE(for AOT compilation),aot_rtsources provide the runtime components needed by the unconditionally-built runner executable.Likely an incorrect or invalid review comment.
mllm/backends/qnn/Register.cpp (1)
23-27: Context handling logic looks correct.The conditional load/create pattern is appropriate. However, consider whether the file existence check could race with another process modifying the file system between the check and the actual load operation (TOCTOU). In practice, for model context files this is likely acceptable, but worth noting.
One minor concern:
std::filesystemrequires C++17. Verify this is consistently available across all supported build configurations.mllm/backends/qnn/aot_rt/PromptProcessor.hpp (1)
31-51: Class structure is well-designed.The use of a raw pointer for
kv_manager_correctly indicates non-owning reference semantics, whilemodule_usesunique_ptrfor owned resources. The public API is clean with appropriate documentation forprefill().mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp (1)
27-43: Runner class design is appropriate.The explicit constructor prevents implicit conversions. The use of raw pointer for
tokenizer_correctly signals non-owning semantics, whileunique_ptris used for owned components. Theload()method returningboolallows proper error handling.mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp (1)
15-80: Load method implementation is well-structured.The
load()method properly:
- Validates backend availability and returns
falseon failure- Uses
unique_ptrfor RAII-based resource management- Correctly chains initialization of components
One note: if
initCache(),init_io()on line 75-77 can fail, consider adding error handling and returningfalseaccordingly.mllm/backends/qnn/aot_rt/TokenGenerator.cpp (2)
8-15: LGTM: Constructor properly initializes members and creates module.The constructor correctly uses move semantics for
eos_idsandconfig_, and properly initializes the QNN AOT module.
46-57: Verify KV cache tensor dimension ordering.The input KV cache tensors use shape
{1, num_heads, context_len, head_dim}for both K and V. However, inPromptProcessor.cpp, K uses{1, num_heads, head_dim, context_len - ar_len}(transposed layout) while V uses{1, num_heads, context_len - ar_len, head_dim}.Please verify that this dimension ordering is intentional for the decode phase vs. prefill phase, as it affects how
KVCacheManager::rearrangeKey/rearrangeValueoperates.mllm/backends/qnn/aot_rt/PromptProcessor.cpp (2)
48-67: LGTM: KV cache tensor setup with manual memory management.The tensor shapes follow standard transformer patterns (K transposed for attention matmul). The manual memory management correctly ties tensors to externally managed KVCacheManager buffers.
95-130: LGTM: prepare_io correctly handles chunked input preparation.The method properly handles input IDs with padding, position IDs, and attention mask initialization for the current chunk.
mllm/backends/qnn/aot_rt/TokenGenerator.hpp (1)
18-28: Config struct is well-defined with sensible defaults.The
Configstruct captures all necessary model parameters. Thekv_dtypedefault ofkUInt8aligns with the explicit instantiation foruint8_t.mllm/backends/qnn/aot_rt/KVCacheManager.cpp (1)
188-237: LGTM: Cache rearrangement correctly handles overlapping memory regions.The use of
std::memmoveinstead ofstd::memcpyis correct since source and destination buffers overlap. The different iteration patterns for key vs. value correctly account for their different memory layouts (key is transposed).
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| auto k_storage_in = std::make_shared<mllm::Storage>(); | ||
| k_storage_in->size_ = cache_in_bytes; | ||
| allocator->alloc(k_storage_in); | ||
| memset(k_storage_in->ptr_, 0, cache_in_bytes); |
There was a problem hiding this comment.
Missing null check after allocation.
The code calls allocator->alloc(k_storage_in) but doesn't verify that allocation succeeded before calling memset on k_storage_in->ptr_. If the allocator fails, this will cause a null pointer dereference.
🐛 Proposed fix
auto k_storage_in = std::make_shared<mllm::Storage>();
k_storage_in->size_ = cache_in_bytes;
allocator->alloc(k_storage_in);
+ if (!k_storage_in->ptr_) {
+ MLLM_ERROR("Failed to allocate k_storage_in for layer {}", layer);
+ throw std::runtime_error("KV cache allocation failed");
+ }
memset(k_storage_in->ptr_, 0, cache_in_bytes);Apply similar checks after each allocator->alloc() call.
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot_rt/KVCacheManager.cpp` around lines 35 - 38, After
calling allocator->alloc(k_storage_in) ensure you check that k_storage_in->ptr_
is non-null before calling memset; if allocation failed, log or return an error
(or throw) and avoid dereferencing the pointer. Apply the same null-check
pattern for every allocator->alloc(...) use in this file (e.g., any other
Storage instances created in KVCacheManager methods) so you never call memset or
access ptr_ when allocation failed.
| std::vector<int32_t> true_indices(n_update); | ||
| for (int i = 0, j = 0; i < selected.size() && j < n_update; ++i) { | ||
| if (selected[i]) { true_indices[j++] = i; } | ||
| } |
There was a problem hiding this comment.
Potential use of uninitialized indices if selected has fewer true values than n_update.
The vector true_indices is initialized with size n_update, but if selected contains fewer than n_update true values, some elements remain at their default value (0). This could cause incorrect cache updates by reading from index 0 instead of valid indices.
🐛 Proposed fix: Validate index count
std::vector<int32_t> true_indices(n_update);
+ int32_t found_count = 0;
for (int i = 0, j = 0; i < selected.size() && j < n_update; ++i) {
- if (selected[i]) { true_indices[j++] = i; }
+ if (selected[i]) {
+ true_indices[j++] = i;
+ found_count++;
+ }
}
+ if (found_count != n_update) {
+ MLLM_ERROR("Selected vector has {} true values, expected {}", found_count, n_update);
+ return; // or throw
+ }🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot_rt/KVCacheManager.cpp` around lines 270 - 273, The loop
building true_indices can leave elements uninitialized if selected has fewer
than n_update true values; change the logic to first count actual trues (or use
push_back) and either resize true_indices to that actual count or validate that
actual_count == n_update and handle the mismatch (log/throw/return error) before
using true_indices; update the code that consumes true_indices (same scope where
true_indices, selected, and n_update are used) to rely on the validated/adjusted
size to avoid reading default zeros.
| // Copyright (c) MLLM Team. | ||
| // Licensed under the MIT License. | ||
|
|
||
| #include "mllm/backends/qnn/aot_rt/QnnAOTModule.hpp" | ||
| #include "mllm/backends/qnn/aot_rt/KVCacheManager.hpp" | ||
| #include "mllm/backends/qnn/aot_rt/PromptProcessor.hpp" | ||
| #include "mllm/backends/qnn/aot_rt/TokenGenerator.hpp" | ||
| #include "mllm/preprocessor/tokenizers/AutoTokenizer.hpp" | ||
| #include <string> | ||
| #include <memory> | ||
| #include <vector> | ||
| #include <functional> |
There was a problem hiding this comment.
Missing include guard or #pragma once.
This header file lacks an include guard (#ifndef/#define/#endif) or #pragma once directive. This will cause compilation errors if the header is included multiple times in the same translation unit.
🐛 Proposed fix
// Copyright (c) MLLM Team.
// Licensed under the MIT License.
+#pragma once
+
`#include` "mllm/backends/qnn/aot_rt/QnnAOTModule.hpp"
`#include` "mllm/backends/qnn/aot_rt/KVCacheManager.hpp"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Copyright (c) MLLM Team. | |
| // Licensed under the MIT License. | |
| #include "mllm/backends/qnn/aot_rt/QnnAOTModule.hpp" | |
| #include "mllm/backends/qnn/aot_rt/KVCacheManager.hpp" | |
| #include "mllm/backends/qnn/aot_rt/PromptProcessor.hpp" | |
| #include "mllm/backends/qnn/aot_rt/TokenGenerator.hpp" | |
| #include "mllm/preprocessor/tokenizers/AutoTokenizer.hpp" | |
| #include <string> | |
| #include <memory> | |
| #include <vector> | |
| #include <functional> | |
| // Copyright (c) MLLM Team. | |
| // Licensed under the MIT License. | |
| `#pragma` once | |
| `#include` "mllm/backends/qnn/aot_rt/QnnAOTModule.hpp" | |
| `#include` "mllm/backends/qnn/aot_rt/KVCacheManager.hpp" | |
| `#include` "mllm/backends/qnn/aot_rt/PromptProcessor.hpp" | |
| `#include` "mllm/backends/qnn/aot_rt/TokenGenerator.hpp" | |
| `#include` "mllm/preprocessor/tokenizers/AutoTokenizer.hpp" | |
| `#include` <string> | |
| `#include` <memory> | |
| `#include` <vector> | |
| `#include` <functional> |
🧰 Tools
🪛 Clang (14.0.6)
[error] 4-4: 'mllm/backends/qnn/aot_rt/QnnAOTModule.hpp' file not found
(clang-diagnostic-error)
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp` around lines 1 - 12, The header
QnnAOTRuntime.hpp is missing an include guard; add a header guard or a
single-line `#pragma` once at the very top of QnnAOTRuntime.hpp (before the
existing includes) and, if using the traditional guard, wrap the entire file
with a unique macro (e.g., QNN_AOT_RT_QNNAOTRUNTIME_HPP) and close it with an
`#endif` (with a comment) at the end of the file to prevent multiple inclusion.
| auto input_ids = Tensor::empty({1, 1, 1, 1}, kInt64, kQNN).alloc(); | ||
| input_ids.setName("input_ids"); | ||
| input_tensors_.push_back(input_ids); |
There was a problem hiding this comment.
Type mismatch: Tensor dtype kInt64 but accessed as int32_t*.
The input_ids tensor is created with kInt64 dtype (Line 22), but in prepare_io (Line 94), it's accessed as int32_t*. This mismatch will cause incorrect data writes. Note that PromptProcessor.cpp uses kInt32 for its input_ids tensor.
🐛 Proposed fix
Either change the tensor dtype to kInt32:
// 1. Input IDs
- auto input_ids = Tensor::empty({1, 1, 1, 1}, kInt64, kQNN).alloc();
+ auto input_ids = Tensor::empty({1, 1, 1, 1}, kInt32, kQNN).alloc();Or update prepare_io to use int64_t*:
// 1. Input IDs
- int32_t* input_ids_ptr = input_tensors_[0].ptr<int32_t>();
- input_ids_ptr[0] = (int32_t)cur_token;
+ int64_t* input_ids_ptr = input_tensors_[0].ptr<int64_t>();
+ input_ids_ptr[0] = (int64_t)cur_token;Also applies to: 94-95
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot_rt/TokenGenerator.cpp` around lines 22 - 24, The
input_ids tensor is created with Tensor::empty(..., kInt64, ...) but prepare_io
reads/writes it as int32_t*, causing a dtype mismatch; update the creation or
the access so both use the same type—preferably change Tensor::empty call that
constructs input_ids to use kInt32 to match PromptProcessor.cpp (and adjust any
Tensor::empty calls creating input_ids) OR if you must keep 64-bit, change
prepare_io to use int64_t* when accessing input_ids; ensure the symbol names
input_ids, Tensor::empty and prepare_io are updated consistently so allocation
dtype and pointer casts match.
| int64_t TokenGenerator<T>::generate(std::vector<uint64_t>& tokens, int64_t start_pos, int32_t seq_len, | ||
| const std::function<void(const std::string&)>& token_callback, bool dump_logits) { |
There was a problem hiding this comment.
Unused dump_logits parameter and token_all_logits_ never populated.
The dump_logits parameter is declared but never used in the function body. Additionally, get_all_logits() returns token_all_logits_ which is never populated, so it will always return an empty vector.
If logits dumping is intended functionality, would you like me to generate the implementation that collects logits when dump_logits is true?
Also applies to: 129-133, 87-89
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot_rt/TokenGenerator.cpp` around lines 107 - 108, The
generate method currently ignores the dump_logits parameter and never fills
token_all_logits_, so either implement logits collection when dump_logits is
true or remove the unused API; to fix, update TokenGenerator<T>::generate to,
when dump_logits is true, capture per-step/logit outputs from the
model/inference path and append them into the member token_all_logits_ (ensure
it is cleared at start of generation), and also expose those logits via
get_all_logits(); alternatively, if logits dumping is not needed, remove the
dump_logits parameter from generate and delete token_all_logits_ and
get_all_logits() to avoid dead code — locate TokenGenerator<T>::generate, the
token_all_logits_ member and get_all_logits() to apply the change.
| // input wrapper is empty, set wrapper's dataContainer(mllm::Tensor) | ||
| if (!wrapper->isAlloc()) { wrapper->__setDataContainer(runtime_input); } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for usage patterns of __setDataContainer and isAlloc
rg -n -C3 '__setDataContainer|isAlloc' --type=cppRepository: UbiquitousLearning/mllm
Length of output: 3427
🏁 Script executed:
#!/bin/bash
# Get broader context around line 552-553 in QNNBackend.cpp
sed -n '520,580p' mllm/backends/qnn/QNNBackend.cpp
echo "---"
# Search for wrapper creation/reuse patterns
rg -n 'wrapper\s*=' mllm/backends/qnn/QNNBackend.cpp | head -20Repository: UbiquitousLearning/mllm
Length of output: 2855
🏁 Script executed:
#!/bin/bash
# Search for wrapper lifecycle - how are input wrappers stored and reused?
rg -n 'getGraphInputTensorWrappers' mllm/backends/qnn/ --type=cpp
echo "---"
# Search for reset/cleanup methods in QNNTensorWrapper
rg -n 'reset|clear|cleanup' mllm/backends/qnn/QNNUtils.hpp --type=hppRepository: UbiquitousLearning/mllm
Length of output: 827
🏁 Script executed:
#!/bin/bash
# Search for reset/cleanup methods in QNNTensorWrapper without type specification
rg -n 'reset|clear|cleanup' mllm/backends/qnn/QNNUtils.hpp
echo "---"
# Look at the full QNNTensorWrapper class definition
sed -n '180,230p' mllm/backends/qnn/QNNUtils.hppRepository: UbiquitousLearning/mllm
Length of output: 2348
🏁 Script executed:
#!/bin/bash
# Check how wrappers are created and stored in the model
rg -n 'inputTensorWrappers_' mllm/backends/qnn/QNNModel.cpp -A2 -B2 | head -40
echo "---"
# Check if there's any mechanism to reset or recreate wrappers between executions
rg -n 'graphExecute' mllm/backends/qnn/QNNBackend.cpp | head -5Repository: UbiquitousLearning/mllm
Length of output: 1164
🏁 Script executed:
#!/bin/bash
# Understand the alloc() method and whether it updates data
sed -n '484,496p' mllm/backends/qnn/QNNUtils.cpp
echo "---"
# Check the create method to understand initial wrapper state
rg -n 'QNNTensorWrapper::create' mllm/backends/qnn/QNNUtils.cpp -A15 | head -40Repository: UbiquitousLearning/mllm
Length of output: 2599
🏁 Script executed:
#!/bin/bash
# Check the lifecycle of the wrapper between multiple executions
# Does graphExecute clear or reset wrappers?
rg -n 'clear|reset|new ' mllm/backends/qnn/QNNBackend.cpp | grep -i wrapper
echo "---"
# Check how the input wrapper is actually used after alloc() in graphExecute
sed -n '553,572p' mllm/backends/qnn/QNNBackend.cppRepository: UbiquitousLearning/mllm
Length of output: 1138
🏁 Script executed:
#!/bin/bash
# Final verification: Check if there's any mechanism to reset isAlloc_ or update data between executions
rg -n 'isAlloc_\s*=' mllm/backends/qnn/ --type=cpp
echo "---"
# Check comment or documentation about wrapper reuse
rg -n 'reuse|lifecycle|between execution' mllm/backends/qnn/QNNUtils.hpp
echo "---"
# Look for any destructor or cleanup in QNNModel
rg -n '~QNNModel|reset.*wrapper' mllm/backends/qnn/QNNModel.hppRepository: UbiquitousLearning/mllm
Length of output: 366
🏁 Script executed:
#!/bin/bash
# Final critical check: verify what happens on second execution
# Look at the actual behavior flow
echo "=== First Execution Flow ==="
echo "1. !wrapper->isAlloc() is TRUE (initial state: isAlloc_ = false)"
echo "2. Calls __setDataContainer(runtime_input) - sets dataContainer_ to runtime_input"
echo "3. Calls wrapper->alloc() - sets isAlloc_ = true"
echo ""
echo "=== Second Execution Flow ==="
echo "1. !wrapper->isAlloc() is FALSE (already allocated)"
echo "2. SKIPS __setDataContainer() - wrapper->dataContainer_ NOT updated"
echo "3. Calls wrapper->alloc() again - but dataContainer_ still has old data"
echo ""
# Verify the key insight: once isAlloc_ is true, it never becomes false
rg -n 'isAlloc_\s*=\s*false' mllm/backends/qnn/Repository: UbiquitousLearning/mllm
Length of output: 575
Fix stale data bug when wrappers are reused across multiple graph executions.
The current logic skips data updates when wrapper->isAlloc() is true, but once a wrapper is allocated (first execution), isAlloc_ is never reset. On subsequent graphExecute calls with the same wrapper objects, the condition if (!wrapper->isAlloc()) evaluates to false, preventing __setDataContainer() from updating the wrapper's data. This causes stale input data from the previous execution to be reused.
To fix this, consider either:
- Creating fresh wrappers for each execution, or
- Resetting/updating the wrapper's data unconditionally before allocation (e.g., check if input data changed rather than allocation status), or
- Adding a reset mechanism to clear
isAlloc_anddataContainer_between executions.
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/QNNBackend.cpp` around lines 552 - 553, The wrapper reuse
causes stale inputs because wrapper->isAlloc() prevents updating dataContainer_;
change the flow so the wrapper's data container is refreshed each graphExecute:
either always call wrapper->__setDataContainer(runtime_input) before
allocation/usage (i.e., do not gate it on wrapper->isAlloc()), or add/invoke a
reset method that clears isAlloc_ and dataContainer_ on reuse (implement
resetWrapper/resetAllocation that zeroes dataContainer_ and sets isAlloc_ =
false and call it at start of graphExecute). Ensure __setDataContainer(),
isAlloc_ and dataContainer_ are the referenced symbols updated so allocation
uses the current runtime_input.
| bool retrieveContext(const std::string& contextBinaryPath, Qnn_ContextHandle_t& context, | ||
| std::vector<std::shared_ptr<QNNModel>>& qnnModels, QnnContext_Config_t** contextConfig = nullptr); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, find and read the header file to confirm the signature change
find . -name "QNNBackend.hpp" -type fRepository: UbiquitousLearning/mllm
Length of output: 101
🏁 Script executed:
# Search for all usages of retrieveContext in the codebase
rg -n "retrieveContext" --type=cpp --type=h -C 3Repository: UbiquitousLearning/mllm
Length of output: 1879
🏁 Script executed:
# Search more comprehensively for all retrieveContext usages, including in tests and other files
rg -n "retrieveContext" --type=cpp --type=h --type=ccRepository: UbiquitousLearning/mllm
Length of output: 93
🏁 Script executed:
# Read the header file to check for docstrings and API documentation
cat -n mllm/backends/qnn/QNNBackend.hpp | head -80Repository: UbiquitousLearning/mllm
Length of output: 3618
🏁 Script executed:
# Also search in Python files in case there are bindings
rg -n "retrieveContext" --type=pyRepository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Search more comprehensively across all source files
rg -n "retrieveContext" --type=cppRepository: UbiquitousLearning/mllm
Length of output: 464
🏁 Script executed:
# Check if there are any other file types that might use this API
find . -type f \( -name "*.cpp" -o -name "*.h" -o -name "*.hpp" \) | wc -lRepository: UbiquitousLearning/mllm
Length of output: 71
🏁 Script executed:
# Look at the complete context around the method declaration to check for docstrings
sed -n '40,60p' mllm/backends/qnn/QNNBackend.hppRepository: UbiquitousLearning/mllm
Length of output: 874
🏁 Script executed:
# Check if createContext has docstrings (line 52)
sed -n '45,56p' mllm/backends/qnn/QNNBackend.hppRepository: UbiquitousLearning/mllm
Length of output: 686
🏁 Script executed:
# Check the complete implementation to understand the context better
sed -n '333,360p' mllm/backends/qnn/QNNBackend.cppRepository: UbiquitousLearning/mllm
Length of output: 1343
🏁 Script executed:
# Verify no other files in the project might be calling this method
find . -type f \( -name "*.cpp" -o -name "*.h" -o -name "*.hpp" \) -exec grep -l "QNNRuntime\|QNNBackend" {} \; | head -20Repository: UbiquitousLearning/mllm
Length of output: 665
🏁 Script executed:
# Check if there are any examples or test files that might use the old API
find . -type f -name "*test*" -o -name "*example*" | grep -i qnnRepository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Double-check for any remaining usages of the old signature or other related calls
rg -n "retrieveContext\|loadContext\|createContext" mllm/backends/qnn/QNNBackend.cpp -A 2Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check if the coding guideline about docstrings is being violated
# Look at similar public methods in other backends to see the pattern
find . -type f -name "*.hpp" -path "*/backends/*" -exec grep -l "bool.*(" {} \; | head -5 | xargs head -100Repository: UbiquitousLearning/mllm
Length of output: 9600
🏁 Script executed:
# Verify the specific lines mentioned in the review (53-54) and surrounding context for any documentation
sed -n '48,58p' mllm/backends/qnn/QNNBackend.hppRepository: UbiquitousLearning/mllm
Length of output: 662
🏁 Script executed:
# Verify no trailing whitespace or other formatting issues in the header file
sed -n '52,55p' mllm/backends/qnn/QNNBackend.hpp | cat -ARepository: UbiquitousLearning/mllm
Length of output: 383
🏁 Script executed:
# Check if there are any docstring patterns used elsewhere in QNN backend files
rg -n "///|//!" mllm/backends/qnn/QNNBackend.hpp | head -20Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check Base Backend header for documentation patterns
rg -n "bool.*\(.*\);" mllm/backends/base/Backend.hpp -B 2 | head -30Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check the entire QNNRuntime class declaration to see if there are any docstrings at all
sed -n '41,100p' mllm/backends/qnn/QNNBackend.hppRepository: UbiquitousLearning/mllm
Length of output: 2621
🏁 Script executed:
# Look at the implementation to see if docstrings are in the .cpp file instead
sed -n '325,340p' mllm/backends/qnn/QNNBackend.cppRepository: UbiquitousLearning/mllm
Length of output: 795
🏁 Script executed:
# Check how many methods in QNNRuntime lack documentation
grep -n "bool\|void" mllm/backends/qnn/QNNBackend.hpp | grep -A 1 "QNNRuntime" -m 20Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check if these methods are newly added or modified (look at git context if available)
git log --oneline -1 mllm/backends/qnn/QNNBackend.hpp 2>/dev/null || echo "Git history not available"Repository: UbiquitousLearning/mllm
Length of output: 125
🏁 Script executed:
# Verify the coding guideline compliance by checking if similar public methods in this file have docstrings
sed -n '85,130p' mllm/backends/qnn/QNNBackend.hppRepository: UbiquitousLearning/mllm
Length of output: 2112
Add docstrings to public API methods.
The retrieveContext and createContext methods (lines 52-54) lack docstrings. Per coding guidelines, public APIs must have clear docstrings explaining purpose, parameters, returns, and errors. The signature change itself is correct and the caller at QNNBackend.cpp:429 has been properly updated.
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/QNNBackend.hpp` around lines 53 - 54, Add clear docstrings
for the public methods retrieveContext and createContext in QNNBackend.hpp: for
each method describe its purpose, list and explain every parameter (e.g.,
contextBinaryPath, context, qnnModels, contextConfig for retrieveContext and
relevant params for createContext), state the return value semantics (what a
true/false or success/failure indicates) and document possible errors/exceptions
or failure conditions and side effects (e.g., ownership of pointers, when
contextConfig may be modified, thread-safety expectations). Place these comments
immediately above the function declarations so they become part of the public
API docs.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@mllm/backends/qnn/aot/QnnWrappersAPI.cpp`:
- Around line 483-515: In QnnAOTEnv::saveContext, treat any binary size mismatch
as a fatal error: after calling contextGetBinary and obtaining binarySize and
writtenSize (in QnnAOTEnv::saveContext and using
qnn_htp_func_symbols_.qnn_interface_.contextGetBinary), check if writtenSize !=
binarySize (or writtenSize > binarySize) and if so log an error with both sizes
via MLLM_ERROR and return/abort before opening or writing the file; ensure the
success message and file write only happen when writtenSize == binarySize.
🧹 Nitpick comments (4)
mllm/backends/qnn/aot/visitor/RMSNorm.cpp (3)
46-49: Improve comment clarity and professionalism.The comment "nn module seems to be inconsistent with document (AMAZING!)" is informal. Consider providing a more descriptive explanation of why this fake bias is necessary and reference any relevant documentation or issue tracker.
📝 Suggested improvement
- // fake bias, nn module seems to be inconsistent with document (AMAZING!) + // QNN RmsNorm op requires a bias input per API, though the weight-only variant + // is documented. Provide a zero-filled bias tensor to satisfy the interface.
51-54: Document or use named constants for quantization parameters.The
QuantizationSpecSymPerTensor::create(0, 0, kInt32, kFloat32, ...)uses magic numbers without explanation. Consider using named constants or adding a brief comment explaining what the first two0parameters represent (e.g., zero point, scale offset, etc.) for maintainability.📝 Suggested improvement
// fake bias quant recipe - auto quant_spec = mllm::ir::linalg::QuantizationSpecSymPerTensor::create(0, 0, kInt32, kFloat32, Tensor::ones({1})); + // Symmetric per-tensor quantization: zero_point=0, encoding=0 (unused for bias) + constexpr int kBiasZeroPoint = 0; + constexpr int kBiasEncoding = 0; + auto quant_spec = mllm::ir::linalg::QuantizationSpecSymPerTensor::create( + kBiasZeroPoint, kBiasEncoding, kInt32, kFloat32, Tensor::ones({1}));
17-79: Consider adding unit tests for the fake bias handling.The
rewritefunction now includes non-trivial logic for creating fake bias tensors and quantization recipes. Given the complexity and the workaround nature of this code (as indicated by the comment), unit tests covering edge cases (e.g., different weight shapes, dtypes) would help prevent regressions.Would you like me to help outline test cases for this pattern, or open an issue to track adding test coverage?
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (1)
28-38: Minor: redundant reassignment in else branch.Lines 36-37 reassign
scale_nameandzp_nameto the same values already set at lines 29-30. The else branch can be removed entirely.Suggested simplification
Tensor QDQ(nn::Module* m, Tensor in, const std::string& qdq_name_in_pytorch) { - std::string scale_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.scale"; - std::string zp_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.zero_point"; - + std::string scale_name; + std::string zp_name; if (m->getModuleName().empty()) { scale_name = qdq_name_in_pytorch + ".fake_quant.scale"; zp_name = qdq_name_in_pytorch + ".fake_quant.zero_point"; } else { scale_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.scale"; zp_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.zero_point"; }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hppmllm/backends/qnn/QNNUtils.cppmllm/backends/qnn/aot/QnnWrappersAPI.cppmllm/backends/qnn/aot/visitor/RMSNorm.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
- mllm/backends/qnn/QNNUtils.cpp
🧰 Additional context used
📓 Path-based instructions (4)
{mllm,mllm-cli,pymllm}/**/*
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).
Files:
mllm/backends/qnn/aot/visitor/RMSNorm.cppmllm/backends/qnn/aot/QnnWrappersAPI.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.
Files:
mllm/backends/qnn/aot/visitor/RMSNorm.cppmllm/backends/qnn/aot/QnnWrappersAPI.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.
Files:
mllm/backends/qnn/aot/visitor/RMSNorm.cppmllm/backends/qnn/aot/QnnWrappersAPI.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).
Files:
mllm/backends/qnn/aot/visitor/RMSNorm.cppmllm/backends/qnn/aot/QnnWrappersAPI.cpp
🧬 Code graph analysis (1)
mllm/backends/qnn/aot/visitor/RMSNorm.cpp (2)
mllm/backends/qnn/aot/visitor/RMSNorm.hpp (1)
writer(16-16)mllm/backends/qnn/aot/QnnWrappersAPI.hpp (3)
qnn_context_name(219-219)qnn_context_name(221-222)qnn_context_name(224-225)
🪛 Clang (14.0.6)
mllm/backends/qnn/aot/visitor/RMSNorm.cpp
[error] 4-4: 'mllm/core/DataTypes.hpp' file not found
(clang-diagnostic-error)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-macos
- GitHub Check: build-android
- GitHub Check: build-x86
🔇 Additional comments (3)
mllm/backends/qnn/aot/visitor/RMSNorm.cpp (2)
70-74: LGTM on input wiring for bias tensor.The bias_node is correctly wired as the third input with
force_static_weight = true, consistent with the existing weight input handling. The output attachment follows the established pattern.
4-5: No issues found. Bothmllm/core/DataTypes.hppandmllm/core/Tensor.hppexist in the codebase and are correctly included at the specified lines.Likely an incorrect or invalid review comment.
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (1)
244-256: Precision upgrade from kFloat16 to kFloat32 for KV de-quantization.The intermediate representation for key_states and value_states before re-quantization to
kUInt8PerTensorSymnow useskFloat32instead ofkFloat16. This reduces quantization error accumulation at the cost of temporarily increased memory footprint.Please confirm this change is intentional and verify whether the precision improvement outweighs any performance overhead on target QNN hardware.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| void QnnAOTEnv::saveContext(const std::string& name, const std::string& path) { | ||
| // TODO | ||
| if (contexts_.find(name) == contexts_.end()) { | ||
| MLLM_ERROR("QnnAOTEnv::saveContext Context {} not found", name); | ||
| return; | ||
| } | ||
| auto context = contexts_[name]; | ||
|
|
||
| uint64_t binarySize = 0; | ||
| uint64_t writtenSize = 0; | ||
|
|
||
| auto status = qnn_htp_func_symbols_.qnn_interface_.contextGetBinarySize(context->qnn_ctx_handle_, &binarySize); | ||
| MLLM_RT_ASSERT_EQ(status, QNN_SUCCESS); | ||
|
|
||
| std::vector<uint8_t> binaryBuffer(binarySize); | ||
|
|
||
| status = qnn_htp_func_symbols_.qnn_interface_.contextGetBinary( | ||
| context->qnn_ctx_handle_, reinterpret_cast<void*>(binaryBuffer.data()), binarySize, &writtenSize); | ||
| MLLM_RT_ASSERT_EQ(status, QNN_SUCCESS); | ||
|
|
||
| if (binarySize < writtenSize) { | ||
| MLLM_ERROR("QNN context binary size mismatch: expected {} bytes, but wrote {} bytes.", binarySize, writtenSize); | ||
| } | ||
|
|
||
| std::ofstream file(path, std::ios::binary); | ||
| if (!file.is_open()) { | ||
| MLLM_ERROR("Failed to open file {} for writing QNN context.", path); | ||
| return; | ||
| } | ||
| file.write(reinterpret_cast<char*>(binaryBuffer.data()), writtenSize); | ||
| file.close(); | ||
|
|
||
| MLLM_INFO("QNN context {} saved to {} written {}", name, path, writtenSize); | ||
| } |
There was a problem hiding this comment.
Treat any context size mismatch as an error before writing.
writtenSize can be less than binarySize, which would save an incomplete/invalid context but still log success. Treat any mismatch as an error and abort the write.
💡 Suggested fix
- if (binarySize < writtenSize) {
- MLLM_ERROR("QNN context binary size mismatch: expected {} bytes, but wrote {} bytes.", binarySize, writtenSize);
- }
+ if (writtenSize != binarySize) {
+ MLLM_ERROR("QNN context binary size mismatch: expected {} bytes, but wrote {} bytes.", binarySize, writtenSize);
+ return;
+ }🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot/QnnWrappersAPI.cpp` around lines 483 - 515, In
QnnAOTEnv::saveContext, treat any binary size mismatch as a fatal error: after
calling contextGetBinary and obtaining binarySize and writtenSize (in
QnnAOTEnv::saveContext and using
qnn_htp_func_symbols_.qnn_interface_.contextGetBinary), check if writtenSize !=
binarySize (or writtenSize > binarySize) and if so log an error with both sizes
via MLLM_ERROR and return/abort before opening or writing the file; ensure the
success message and file write only happen when writtenSize == binarySize.
Please check Guidelines for Contributing.
Summary by CodeRabbit
New Features
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.