feat: qwen2 cpu model and connection with npu prefill#555
feat: qwen2 cpu model and connection with npu prefill#555chenghuaWang merged 5 commits intoUbiquitousLearning:mainfrom
Conversation
…4 & minicpm o audio encoding feat(MiniCPM4, MiniCPM-o, AvgPool1dOp): Add support for MiniCPM4 model, MiniCPM-o's audio modality inference capability, and AvgPool1dOp
WalkthroughReplaces macro entry point with standard main; adds mixed NPU-prefill + CPU-decode flow in the Qwen NPU example; introduces shared StaticCache plumbing across KVCache/ops/backends; adds CPU-side Qwen model headers for decoding; and refactors QNN allocator/utils and perfetto tracing categories. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant Example as examples/qwen_npu
participant QNN as QNN Prefill
participant StaticCache as Shared StaticCache
participant CPU as CPU Decode
participant Output as Console
App->>Example: launch with use_npu_prefill flag
Example->>QNN: load QNN model, run prefill(input tokens)
QNN->>StaticCache: updateKVCache(layer, K, V)
StaticCache-->>QNN: confirm storage
alt use_npu_prefill enabled
Example->>CPU: load CPU decode model + attach StaticCache
loop per-step decode
CPU->>StaticCache: read KV for attention
CPU->>CPU: compute logits -> select token
CPU->>StaticCache: updateKVCache(layer, K, V)
CPU->>Output: print token
end
else pure CPU decode
Example->>CPU: run CPU prefill+decode loop
end
Output-->>App: final EOS / completion
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (11)
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (2)
168-169: Consider precomputing the attention scale factor.The
1.f / std::sqrt(static_cast<float>(head_dim_))is computed on every forward pass. Sincehead_dim_is fixed after construction, consider storing this as a member variable initialized in the constructor.// In constructor after setting head_dim_: + attn_scale_ = 1.f / std::sqrt(static_cast<float>(head_dim_)); // In forward: - auto attn_weights = nn::functional::matmul(query_states, key_states, /*trans_a=*/false, /*trans_b=*/true) - * (1.f / std::sqrt(static_cast<float>(head_dim_))); + auto attn_weights = nn::functional::matmul(query_states, key_states, /*trans_a=*/false, /*trans_b=*/true) + * attn_scale_;
211-212: Consider makingself_attn_andmlp_private.Public member variables in a class reduce encapsulation. The direct access at line 257 (
b.self_attn_.set_layer_idx(...)) could be replaced with aset_layer_idxmethod onQwenDecoderCPUthat forwards toself_attn_.+ void set_layer_idx(int layer_idx) { self_attn_.set_layer_idx(layer_idx); } + + private: QwenAttentionCPU self_attn_; QwenMLPCPU mlp_; - - nn::KVCache& getKVCache() { return self_attn_.getKVCache(); }mllm/nn/layers/KVCache.cpp (1)
34-43: Inconsistent cast strategy: dynamic_cast vs static_pointer_castThe new methods use
dynamic_castwhile existing methods (lines 21, 24, 27, 31) usestatic_pointer_cast. This inconsistency is worth noting.The
dynamic_castapproach is safer for the new static cache methods since not all backends may support static cache sharing, but consider whether the silent failure insetStaticCache(line 36) could hide misconfiguration issues. For debugging purposes, you might want to add a log warning when the cast fails.void KVCache::setStaticCache(const nn::StaticCache* cache) { auto* kv_op = dynamic_cast<aops::KVCacheOp*>(impl()->getInstancedOp().get()); - if (kv_op != nullptr) { kv_op->setStaticCache(cache); } + if (kv_op != nullptr) { + kv_op->setStaticCache(cache); + } else { + MLLM_DEBUG("setStaticCache: KVCacheOp cast failed, static cache not set"); + } }mllm/backends/qnn/QNNAllocator.cpp (1)
19-22: Consider replacing variadic macro with constexpr template function.Static analysis flags
QNN_ALLOCATOR_VERBOSEas a variadic macro. While functional, modern C++ prefersconstexprvariadic template functions for type safety. This is a minor style improvement that can be deferred.Example alternative:
template<typename... Args> constexpr void QNN_ALLOCATOR_VERBOSE_FN(Args&&... args) { if constexpr (kVerboseQnnAllocatorLogs) { MLLM_INFO(std::forward<Args>(args)...); } }mllm/core/aops/KVCacheOp.hpp (1)
50-52: Address the "Hack" comment and add[[nodiscard]].The
// Hackcomment indicates this is a temporary solution. Consider:
- Creating a tracking issue to properly design the static cache interface
- Adding a brief explanation of why this is considered a hack and what the ideal solution would be
Also,
getStaticCache()should be marked[[nodiscard]]per static analysis.- // Hack - virtual void setStaticCache(const mllm::nn::StaticCache* cache) {} - virtual const mllm::nn::StaticCache* getStaticCache() const { return nullptr; } + // TODO: Temporary interface for KV cache sharing between CPU and QNN backends. + // Consider a cleaner abstraction (e.g., a CacheProvider interface) when stabilized. + virtual void setStaticCache(const mllm::nn::StaticCache* /*cache*/) {} + [[nodiscard]] virtual const mllm::nn::StaticCache* getStaticCache() const { return nullptr; }mllm/backends/cpu/ops/KVCacheOp.cpp (2)
20-28: Consider extracting helper methods to reduce code duplication.The
cache_to_useandlayer_idx_to_usepattern is repeated inreshape(),forward(), andgetCurrentSeqCnt(). Consider extracting helper methods:+private: + nn::StaticCache* getActiveCache() const { + return shared_cache_ ? shared_cache_ : const_cast<nn::StaticCache*>(&cache_); + } + + int32_t getLayerIdx() const { + return shared_cache_ ? options_.layer_idx : 0; + }Then use these helpers consistently across all methods. This reduces the risk of inconsistency if the logic needs to change.
61-77: Add a comment explaining the asymmetry betweensetCurrentSeqCntandgetCurrentSeqCnt.
setCurrentSeqCntsets the sequence count globally (for all layers) when using shared cache, whilegetCurrentSeqCntretrieves the count for this specific layer. This asymmetry appears intentional for the decode flow but could confuse future maintainers.void CPUKVCacheOp::setCurrentSeqCnt(int32_t seq) { if (shared_cache_) { + // Set globally for all layers in shared cache - typically called once + // after prefill to sync sequence position across all decoder blocks shared_cache_->setCurrentSeqCnt(seq); } else { cache_.setCurrentSeqCnt(seq); } }examples/qwen_npu/main.cpp (4)
29-36: Consider makinguse_npu_prefillconfigurable.The flag is hardcoded to
true, making the pure CPU path (lines 116-161) unreachable in production. Consider using an environment variable or command-line argument for flexibility:- // Mixed inference: QNN prefill + CPU decode - const bool use_npu_prefill = true; + // Mixed inference: QNN prefill + CPU decode (configurable via env var) + const bool use_npu_prefill = std::getenv("MLLM_USE_NPU_PREFILL") != nullptr;
139-146: Consider extracting repeated argmax logic into a helper function.The greedy token selection (argmax) is duplicated three times across the file. Extracting it would improve maintainability:
// Helper function for greedy decoding inline int argmax(const float* data, int64_t vocab_size) { int best_id = 0; float max_val = data[0]; for (int64_t i = 1; i < vocab_size; ++i) { if (data[i] > max_val) { max_val = data[i]; best_id = static_cast<int>(i); } } return best_id; }Similarly, the prefill output handling (lines 133-160 and 193-220) could be unified.
Also applies to: 199-206, 252-259
262-263: Potential output interleaving betweenMLLM_INFOandstd::wcout.Mixing narrow-character logging (
MLLM_INFOlikely usesstdout/stderr) with wide-character output (std::wcout) on the same conceptual line may cause interleaving or buffering issues. Consider flushing or using a consistent output mechanism:- MLLM_INFO("[Decode step {:3d}] time: {:.2f}ms, token: ", step, step_time_ms); - std::wcout << next_token_str << std::flush; + std::cout << fmt::format("[Decode step {:3d}] time: {:.2f}ms, token: ", step, step_time_ms); + std::wcout << next_token_str << std::endl;Or consider consolidating all output to one stream.
275-281: Consider using a named constant for milliseconds-to-seconds conversion.The value
1000.0appears multiple times for time unit conversion. A named constant would improve readability:+ constexpr double kMsPerSecond = 1000.0; + if (decode_count > 0) { double avg_time_ms = total_decode_time_ms / decode_count; MLLM_INFO( "Decode completed: {} tokens, total time: {:.2f}ms ({:.2f}s), avg time: {:.2f}ms/token, throughput: {:.2f} tokens/s", - decode_count, total_decode_time_ms, total_decode_time_ms / 1000.0, avg_time_ms, - decode_count / (total_decode_time_ms / 1000.0)); + decode_count, total_decode_time_ms, total_decode_time_ms / kMsPerSecond, avg_time_ms, + decode_count / (total_decode_time_ms / kMsPerSecond)); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
examples/qwen_npu/main.cpp(1 hunks)mllm/backends/cpu/ops/KVCacheOp.cpp(2 hunks)mllm/backends/cpu/ops/KVCacheOp.hpp(1 hunks)mllm/backends/qnn/QNNAllocator.cpp(18 hunks)mllm/backends/qnn/QNNAllocator.hpp(4 hunks)mllm/backends/qnn/QNNBackend.cpp(9 hunks)mllm/backends/qnn/QNNDispatcher.cpp(1 hunks)mllm/backends/qnn/QNNUtils.cpp(2 hunks)mllm/backends/qnn/QNNUtils.hpp(1 hunks)mllm/core/aops/KVCacheOp.hpp(2 hunks)mllm/engine/Perf.hpp(1 hunks)mllm/models/qwen_npu/modeling_qwen_npu.hpp(1 hunks)mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp(1 hunks)mllm/nn/layers/KVCache.cpp(1 hunks)mllm/nn/layers/KVCache.hpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
{mllm,mllm-cli,pymllm}/**/*
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).
Files:
mllm/backends/qnn/QNNUtils.hppmllm/nn/layers/KVCache.cppmllm/core/aops/KVCacheOp.hppmllm/engine/Perf.hppmllm/backends/cpu/ops/KVCacheOp.hppmllm/backends/cpu/ops/KVCacheOp.cppmllm/nn/layers/KVCache.hppmllm/backends/qnn/QNNDispatcher.cppmllm/backends/qnn/QNNUtils.cppmllm/backends/qnn/QNNAllocator.cppmllm/models/qwen_npu/modeling_qwen_npu.hppmllm/backends/qnn/QNNAllocator.hppmllm/backends/qnn/QNNBackend.cppmllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.
Files:
mllm/backends/qnn/QNNUtils.hppmllm/nn/layers/KVCache.cppmllm/core/aops/KVCacheOp.hppmllm/engine/Perf.hppmllm/backends/cpu/ops/KVCacheOp.hppmllm/backends/cpu/ops/KVCacheOp.cppmllm/nn/layers/KVCache.hppmllm/backends/qnn/QNNDispatcher.cppmllm/backends/qnn/QNNUtils.cppmllm/backends/qnn/QNNAllocator.cppmllm/models/qwen_npu/modeling_qwen_npu.hppmllm/backends/qnn/QNNAllocator.hppmllm/backends/qnn/QNNBackend.cppmllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).
Files:
mllm/backends/qnn/QNNUtils.hppmllm/nn/layers/KVCache.cppmllm/core/aops/KVCacheOp.hppmllm/engine/Perf.hppmllm/backends/cpu/ops/KVCacheOp.hppmllm/backends/cpu/ops/KVCacheOp.cppmllm/nn/layers/KVCache.hppmllm/backends/qnn/QNNDispatcher.cppmllm/backends/qnn/QNNUtils.cppmllm/backends/qnn/QNNAllocator.cppmllm/models/qwen_npu/modeling_qwen_npu.hppmllm/backends/qnn/QNNAllocator.hppmllm/backends/qnn/QNNBackend.cppmllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.
Files:
mllm/nn/layers/KVCache.cppmllm/backends/cpu/ops/KVCacheOp.cppmllm/backends/qnn/QNNDispatcher.cppmllm/backends/qnn/QNNUtils.cppmllm/backends/qnn/QNNAllocator.cppmllm/backends/qnn/QNNBackend.cpp
🧬 Code graph analysis (8)
mllm/nn/layers/KVCache.cpp (3)
mllm/backends/cpu/ops/KVCacheOp.hpp (2)
cache(26-26)cache(26-26)mllm/core/aops/KVCacheOp.hpp (2)
cache(51-51)cache(51-51)mllm/nn/layers/KVCache.hpp (1)
cache(30-30)
mllm/core/aops/KVCacheOp.hpp (2)
mllm/backends/cpu/ops/KVCacheOp.hpp (2)
cache(26-26)cache(26-26)mllm/nn/layers/KVCache.hpp (1)
cache(30-30)
mllm/backends/cpu/ops/KVCacheOp.hpp (2)
mllm/core/aops/KVCacheOp.hpp (2)
cache(51-51)cache(51-51)mllm/nn/layers/KVCache.hpp (1)
cache(30-30)
examples/qwen_npu/main.cpp (6)
mllm/utils/Argparse.hpp (1)
Argparse(135-135)mllm/engine/Perf.hpp (3)
start(52-52)stop(56-56)saveReport(58-58)mllm/models/qwen_npu/tokenization_qwen.hpp (2)
QwenTokenizer(153-162)QwenTokenizer(153-153)mllm/models/qwen_npu/configuration_qwen_npu.hpp (3)
QwenNPUConfig(11-11)QwenNPUConfig(13-34)QwenNPUConfig(13-13)mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (5)
QwenForCausalLMCPU(288-288)QwenForCausalLMCPU(290-296)QwenForCausalLMCPU(290-290)layer_idx(135-138)layer_idx(135-135)mllm/nn/layers/KVCache.hpp (1)
layer_idx(20-20)
mllm/nn/layers/KVCache.hpp (3)
mllm/backends/cpu/ops/KVCacheOp.hpp (2)
cache(26-26)cache(26-26)mllm/core/aops/KVCacheOp.hpp (2)
cache(51-51)cache(51-51)mllm/nn/lmcache/StaticCache.hpp (6)
nodiscard(16-16)nodiscard(22-22)nodiscard(50-50)nodiscard(58-58)nodiscard(60-60)nodiscard(103-103)
mllm/backends/qnn/QNNAllocator.cpp (1)
mllm/backends/qnn/QNNAllocator.hpp (21)
ptr(52-55)ptr(52-52)ptr(56-56)ptr(56-56)ptr(74-74)ptr(84-84)ptr(85-85)ptr(136-136)ptr(154-154)storage(43-43)storage(45-45)storage(45-45)storage(47-47)storage(49-49)storage(49-49)storage(58-63)storage(58-58)storage(65-65)storage(65-65)storage(72-72)tensor_id(146-147)
mllm/models/qwen_npu/modeling_qwen_npu.hpp (1)
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (1)
decode_blocks_(278-278)
mllm/backends/qnn/QNNBackend.cpp (2)
mllm/backends/qnn/QNNBackend.hpp (7)
graphName(93-93)graphName(95-99)graphName(101-101)graphName(103-103)graphName(106-107)graphName(109-110)graphName(113-113)mllm/backends/qnn/QNNModel.hpp (2)
expectedOrder(80-80)expectedOrder(80-80)
🪛 Clang (14.0.6)
mllm/core/aops/KVCacheOp.hpp
[error] 52-52: function 'getStaticCache' should be marked [[nodiscard]]
(modernize-use-nodiscard,-warnings-as-errors)
mllm/backends/cpu/ops/KVCacheOp.hpp
[error] 27-27: function 'getStaticCache' should be marked [[nodiscard]]
(modernize-use-nodiscard,-warnings-as-errors)
examples/qwen_npu/main.cpp
[error] 38-38: variable 'config_path' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 39-39: variable 'npu_model_path' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 40-40: variable 'cpu_decode_model_path' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 44-44: variable 'file_version' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 50-50: variable 'prefill_seq_len' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 108-108: variable 'past' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 109-109: variable 'args' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 112-112: variable 'enable_timing' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 121-121: variable 'prefill_args' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 137-137: variable 'vocab_size' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 138-138: variable 'logits_data' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 140-140: variable 'max_logit' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 181-181: variable 'prefill_args' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 197-197: variable 'vocab_size' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 198-198: variable 'logits_data' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 200-200: variable 'max_logit' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 235-235: variable 'step_time_ms' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 240-240: variable 'logits' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 241-241: variable 'logits_f' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 249-249: variable 'vocab_size' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 250-250: variable 'data' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 253-253: variable 'max_logit' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 279-279: 1000.0 is a magic number; consider replacing it with a named constant
(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)
[error] 280-280: 1000.0 is a magic number; consider replacing it with a named constant
(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)
mllm/backends/qnn/QNNAllocator.cpp
[error] 19-19: variadic macro 'QNN_ALLOCATOR_VERBOSE' used; consider using a 'constexpr' variadic template function
(cppcoreguidelines-macro-usage,-warnings-as-errors)
mllm/backends/qnn/QNNAllocator.hpp
[error] 120-120: constructor does not initialize these fields: tensor_name
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
[error] 3-3: 'cmath' file not found
(clang-diagnostic-error)
[error] 14-14: do not use unnamed namespaces in header files
(google-build-namespaces,-warnings-as-errors)
[error] 19-19: 2 adjacent parameters of 'makeRoPEInvFreq' of convertible types are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 28-28: 3 adjacent parameters of 'makeRotaryPosEmbedding' of convertible types are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 82-82: constructor does not initialize these fields: gate_proj_, up_proj_, down_proj_, silu_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 86-86: constructor does not initialize these fields: gate_proj_, up_proj_, down_proj_, silu_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 86-86: 2 adjacent parameters of 'QwenMLPCPU' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 94-94: variable name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 96-96: variable name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 110-110: constructor does not initialize these fields: q_proj_, k_proj_, v_proj_, o_proj_, q_rope_, k_rope_, mask_, softmax_, kv_cache_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 114-114: constructor does not initialize these fields: q_proj_, k_proj_, v_proj_, o_proj_, q_rope_, k_rope_, mask_, softmax_, kv_cache_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 114-114: 2 adjacent parameters of 'QwenAttentionCPU' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 146-146: variable 'B' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 146-146: variable name 'B' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 146-146: invalid case style for variable 'B'
(readability-identifier-naming,-warnings-as-errors)
[error] 147-147: variable 'S' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 147-147: variable name 'S' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 147-147: invalid case style for variable 'S'
(readability-identifier-naming,-warnings-as-errors)
[error] 209-209: constructor does not initialize these fields: input_layernorm_, post_attention_layernorm_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 211-211: member variable 'self_attn_' has public visibility
(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)
[error] 212-212: member variable 'mlp_' has public visibility
(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)
[error] 218-218: constructor does not initialize these fields: input_layernorm_, post_attention_layernorm_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 218-218: 2 adjacent parameters of 'QwenDecoderCPU' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 250-250: constructor does not initialize these fields: decode_blocks_, norm_, embedding_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 254-254: constructor does not initialize these fields: decode_blocks_, norm_, embedding_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 254-254: 2 adjacent parameters of 'QwenTextCPU' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 268-268: variable name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 286-286: constructor does not initialize these fields: cfg_, lm_head_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 290-290: constructor does not initialize these fields: cfg_, lm_head_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 290-290: 2 adjacent parameters of 'QwenForCausalLMCPU' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 298-298: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 300-300: variable 'B' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 300-300: variable name 'B' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 300-300: invalid case style for variable 'B'
(readability-identifier-naming,-warnings-as-errors)
[error] 301-301: variable 'S' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 301-301: variable name 'S' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 301-301: invalid case style for variable 'S'
(readability-identifier-naming,-warnings-as-errors)
[error] 305-305: variable 'position_ids' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 316-316: loop variable name 'b' is too short, expected at least 2 characters
(readability-identifier-length,-warnings-as-errors)
[error] 317-317: loop variable name 's' is too short, expected at least 2 characters
(readability-identifier-length,-warnings-as-errors)
🔇 Additional comments (17)
mllm/backends/qnn/QNNUtils.hpp (1)
268-268: Trailing newline addition aligns with best practices.The change properly adds a final newline to the file, which adheres to POSIX and C++ style guidelines (including Google's C++ Style Guide). No functional changes are introduced.
mllm/engine/Perf.hpp (1)
20-21: LGTM!The new
mllm.qnn.executePerfetto category is correctly added and aligns with the trace event usage inQNNDispatcher.cpp. The macro syntax is properly formatted.mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (3)
82-107: LGTM!The MLP block correctly implements the gated activation pattern (gate → SiLU → element-wise multiply with up projection → down projection). The module registration pattern is consistent with the codebase.
250-284: LGTM!The text model implementation correctly registers RoPE inverse frequency as a buffer, iterates through decoder blocks with layer index assignment, and applies final normalization. The structured binding with
enumerateis clean.
303-329: Unfortunately, I'm unable to clone and execute shell commands against the repository due to infrastructure limitations. However, I can analyze the concern based on the provided code context and general programming principles.Unable to verify caller validation of
seq_lenparameter; recommend manual inspection of callers and addition of defensive guard.The concern about
real_seqpotentially being zero or negative is theoretically valid:
- Line 303 assigns
real_seqeither fromargs.at("seq_len")or defaults toS- Line 329 uses
real_seq - 1as a slice index, which would be -1 ifreal_seq == 0- Negative indices in tensor slicing typically cause undefined behavior or assertion failures
What cannot be verified without repository access:
- Whether callers guarantee
seq_len >= 1- Whether
S(sequence length) is always >= 1- If there are existing guards elsewhere in the call chain
- The exact slicing semantics of the
Tensorclass for negative indicesThe suggested assertion guard remains valid defensive programming:
auto real_seq = args.count("seq_len") ? args.at("seq_len").get<int>() : S; + MLLM_ASSERT(real_seq >= 1, "seq_len must be at least 1");mllm/backends/qnn/QNNDispatcher.cpp (1)
62-67: LGTM!The trace event name now correctly matches the Perfetto category
"mllm.qnn.execute"defined inPerf.hpp. The trailing dot removal fixes the naming inconsistency.mllm/backends/qnn/QNNAllocator.hpp (1)
120-129: LGTM!The addition of
hasLastRegistrationInfo_provides explicit validity tracking for theLastRegistrationInfocache, which is cleaner than relying on checking individual fields for defaults. The struct members have appropriate default initializers. The static analysis warning abouttensor_nameis a false positive sincestd::stringdefault-constructs to an empty string.mllm/nn/layers/KVCache.hpp (2)
8-8: LGTM!The include for
StaticCache.hppis necessary for the new interface methods.
30-32: LGTM!The
setStaticCacheandgetStaticCachemethods provide a clean interface for KV cache sharing between CPU and NPU execution phases. The[[nodiscard]]attribute on the getter andconstcorrectness are appropriate. This aligns with the corresponding virtual methods inaops::KVCacheOp.hpp.mllm/backends/qnn/QNNUtils.cpp (1)
437-461: LGTM!The
resetAlloc()simplification is clean. TheinitFromQnnTensorimplementation correctly uses a shallow copy followed by pointer rebinding to internal storage (name_anddimensions_), which is a safe pattern that avoids unnecessary allocations while maintaining ownership of the critical fields.mllm/backends/qnn/QNNBackend.cpp (2)
533-536: LGTM on formatting changes.The formatting adjustments to log messages and code layout improve readability without changing functionality.
209-214: I'm unable to clone the repository to perform the verification. However, based on the code snippet provided in the review comment, I can analyze what's visible:The review comment identifies a legitimate concern: the minor version check is commented out on line 210, which does relax the QNN interface selection criteria. Without access to the repository, I cannot verify:
- Whether there are compensating version checks elsewhere in the code
- The rationale or documentation for this change
- Whether this is intentional and documented in commit messages
- How this impacts downstream API usage
Given these limitations, here's the rewritten review comment:
Verify the intentional relaxation of QNN API minor version check and confirm compensating safeguards.
The minor version check is now commented out (line 210), allowing the backend to accept any QNN interface provider with a matching major version, regardless of minor version. This could allow using an older QNN runtime that lacks APIs the code depends on.
Before merging, confirm:
- Whether compensating version checks exist elsewhere in the codebase
- If this change is intentional and documented (commit message, design doc, or code comment)
- Whether minimum supported QNN version requirements are documented
- If runtime API compatibility is validated through alternative mechanisms
mllm/backends/cpu/ops/KVCacheOp.cpp (1)
53-59: Verify:clearCache()clears the entire shared cache when using shared mode.When
shared_cache_is set,clearCache()clears the entire shared cache, affecting all layers that share it. Is this the intended behavior?If individual layers should only clear their own portion, consider:
void CPUKVCacheOp::clearCache() { if (shared_cache_) { - shared_cache_->clearCache(); + // Only clear this layer's cache, not the entire shared cache + shared_cache_->clearLayerCache(options_.layer_idx); } else { cache_.clearCache(); } }Otherwise, add a comment clarifying that clearing is intentionally global when using a shared cache.
mllm/models/qwen_npu/modeling_qwen_npu.hpp (2)
440-441: LGTM on thedecode_blocks()accessor.This accessor aligns with the same pattern in
modeling_qwen_npu_cpu.hpp(line 277), enabling consistent access to decoder blocks for KV cache sharing in the mixed NPU prefill + CPU decode workflow.
446-451: LGTM onsetKVCacheSeqLenwith improved documentation.The rename from
setKVCacheSeqCnttosetKVCacheSeqLenuses clearer terminology, and the added comments explain the purpose well—synchronizing logical sequence length across layers when reusing a fixed-size KV buffer.examples/qwen_npu/main.cpp (2)
23-27: LGTM on entry point structure and Perfetto integration.The transition from macro-based entry to standard
main()is clean. Perfetto instrumentation is properly guarded with#ifdefand follows correct start/stop/save pattern.Also applies to: 282-288
178-178: I'll help you verify theclearKVCache()behavior with shared StaticCache. Let me search the codebase to understand the implementation.
<function_calls>
Search for clearKVCache implementation and StaticCache usage
#!/bin/bashFind clearKVCache implementation
echo "=== Searching for clearKVCache implementation ==="
rg -n "clearKVCache" --type=cpp --type=h -B2 -A10echo ""
echo "=== Searching for StaticCache references ==="
rg -n "StaticCache" --type=cpp --type=h -B2 -A5
</function_calls>
<function_calls>
Get context around line 178 in main.cpp
#!/bin/bashFind and examine the main.cpp file
find . -name "main.cpp" -path "/qwen_npu/" | head -1 | xargs wc -l
Get context around line 178
find . -name "main.cpp" -path "/qwen_npu/" | head -1 | xargs sed -n '165,195p' | cat -n
</function_calls>
| if (!use_npu_prefill) { | ||
| MLLM_INFO("Starting CPU prefill..."); | ||
| auto prefill_start = std::chrono::high_resolution_clock::now(); | ||
|
|
||
| prefill_seq_len = static_cast<int32_t>(raw_input_tokens.shape()[1]); | ||
| mllm::models::ARGenerationArgs prefill_args; | ||
| prefill_args["seq_len"] = prefill_seq_len; | ||
| auto prefill_output = cpu_decode_model.forward(past, prefill_args); | ||
|
|
||
| auto prefill_end = std::chrono::high_resolution_clock::now(); | ||
| auto prefill_duration = std::chrono::duration_cast<std::chrono::milliseconds>(prefill_end - prefill_start); | ||
| MLLM_INFO("CPU prefill completed: seq_len={}, time={:.2f}s", prefill_seq_len, prefill_duration.count() / 1000.0); |
There was a problem hiding this comment.
Missing KV cache setup for pure CPU inference path.
When use_npu_prefill is false, the CPU prefill path executes, but the KV cache for the CPU model is never configured. The shared_kv_cache remains nullptr and cpu_decode_model's KVCache layers are never linked to a StaticCache (that setup only happens in the use_npu_prefill block at lines 83-91).
This will likely cause incorrect decode behavior or runtime errors when the decode loop tries to use uninitialized KV state.
+ // Setup KV cache for CPU-only inference
+ if (!use_npu_prefill) {
+ shared_kv_cache = std::make_unique<mllm::nn::StaticCache>(
+ cpu_cfg.max_cache_length, cpu_cfg.num_hidden_layers, cpu_cfg.num_attention_heads, cpu_cfg.num_key_value_heads,
+ cpu_cfg.head_dim, mllm::kFloat32, mllm::kFloat32, mllm::kCPU, false);
+
+ auto& cpu_decode_blocks = cpu_decode_model.text_model().decode_blocks();
+ for (int32_t layer_idx = 0; layer_idx < cpu_cfg.num_hidden_layers; ++layer_idx) {
+ auto& cpu_block = cpu_decode_blocks.list()[layer_idx];
+ auto& cpu_kv_cache = cpu_block.getKVCache();
+ cpu_kv_cache.setStaticCache(shared_kv_cache.get());
+ cpu_kv_cache.setLayerIndex(layer_idx);
+ }
+ MLLM_INFO("Configured {} CPU KVCache layers for pure CPU inference", cpu_cfg.num_hidden_layers);
+ }
+
// Prefill phase
if (!use_npu_prefill) {Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Clang (14.0.6)
[error] 121-121: variable 'prefill_args' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
🤖 Prompt for AI Agents
In examples/qwen_npu/main.cpp around lines 116 to 127, the CPU-only prefill path
doesn't initialize or attach a KV cache: shared_kv_cache stays nullptr and
cpu_decode_model's KVCache layers are never linked to a StaticCache (the setup
is only done in the use_npu_prefill branch around lines 83-91). Fix by mirroring
the NPU branch: allocate a StaticCache (or equivalent cache object) for the CPU
path, set shared_kv_cache to it, and attach/link all cpu_decode_model KVCache
layers to that StaticCache (same initialization and linking logic used in lines
83-91) before calling cpu_decode_model.forward so the decode loop has a valid KV
state.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
examples/qwen_npu/main.cpp (1)
113-124: Missing KV cache setup for pure CPU inference path.When
use_npu_prefillisfalse, the shared KV cache is never initialized and the CPU decode model's KVCache layers are never linked to a StaticCache. The decode loop will use uninitialized KV state, causing incorrect behavior or crashes.Add KV cache initialization before the CPU prefill block:
+ // Setup KV cache for CPU-only inference + if (!use_npu_prefill) { + shared_kv_cache = std::make_unique<mllm::nn::StaticCache>( + cpu_cfg.max_cache_length, cpu_cfg.num_hidden_layers, cpu_cfg.num_attention_heads, cpu_cfg.num_key_value_heads, + cpu_cfg.head_dim, mllm::kFloat32, mllm::kFloat32, mllm::kCPU, false); + + auto& cpu_decode_blocks = cpu_decode_model.text_model().decode_blocks(); + for (int32_t layer_idx = 0; layer_idx < cpu_cfg.num_hidden_layers; ++layer_idx) { + auto& cpu_block = cpu_decode_blocks.list()[layer_idx]; + auto& cpu_kv_cache = cpu_block.getKVCache(); + cpu_kv_cache.setStaticCache(shared_kv_cache.get()); + cpu_kv_cache.setLayerIndex(layer_idx); + } + MLLM_INFO("Configured {} CPU KVCache layers for pure CPU inference", cpu_cfg.num_hidden_layers); + } + // Prefill phase if (!use_npu_prefill) {
🧹 Nitpick comments (5)
examples/qwen_npu/main.cpp (2)
38-55: Consider making model paths configurable via command-line arguments.The
argcandargvparameters are available but unused. For better usability, consider usingArgparse(already imported viamllm::Argparse) to make the model paths configurable instead of hardcoding them.
126-157: Consider extracting duplicated logits processing logic.The prefill output processing (lines 126-157 for CPU, lines 186-217 for NPU) is nearly identical. Consider extracting this into a helper function to reduce duplication:
auto processFirstToken(const mllm::Tensor& prefill_logits, const mllm::Tensor& raw_input_tokens, int prefill_seq_len, mllm::models::qwen_npu::QwenTokenizer& tokenizer) -> std::pair<mllm::Tensor, mllm::Tensor>;mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (2)
209-214: Consider makingself_attn_andmlp_private with accessors.The public member variables expose implementation details. While this enables direct access from
main.cppfor KV cache configuration, consider using accessor methods:+ QwenAttentionCPU& self_attn() { return self_attn_; } + + private: QwenAttentionCPU self_attn_; QwenMLPCPU mlp_;Then update
main.cppto useblock.self_attn().getKVCache()instead ofblock.getKVCache().
1-13: Consider adding documentation for public API classes.Per coding guidelines, public APIs should have clear docstrings explaining purpose, parameters, and usage. Consider adding brief documentation for the main classes (
QwenMLPCPU,QwenAttentionCPU,QwenDecoderCPU,QwenTextCPU,QwenForCausalLMCPU) to explain their roles in the CPU decode pipeline./// @brief CPU-side Qwen2 model for decode phase in mixed NPU/CPU inference. /// /// This module provides a CPU implementation of the Qwen2 architecture that can /// share KV cache state with the NPU prefill model for hybrid inference. class QwenForCausalLMCPU : public nn::Module, public ARGeneration { ... };mllm/nn/lmcache/StaticCache.hpp (1)
24-27: ConfirmTensor()vsTensor::nil()semantics and clean up commented-out codeSwitching the default
updateKVCacheimplementation to return{Tensor(), Tensor()}instead of{Tensor::nil(), Tensor::nil()}is fine only if a default-constructedTensoris semantically equivalent to the previousnilsentinel for all existing call sites (e.g., checks like “is empty / invalid cache entry”). If any code explicitly distinguishesnilfrom a defaultTensor, this may subtly change behavior.Also, the old return statement is now just commented-out dead code; it should be removed to avoid confusion.
Consider:
- Verifying that
Tensor::nil()andTensor()are interchangeable in all usages that consumeupdateKVCache’s result.- Removing the commented-out line once the behavioral equivalence is confirmed.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/qwen_npu/main.cpp(1 hunks)mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp(1 hunks)mllm/nn/lmcache/StaticCache.hpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
{mllm,mllm-cli,pymllm}/**/*
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).
Files:
mllm/nn/lmcache/StaticCache.hppmllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.
Files:
mllm/nn/lmcache/StaticCache.hppmllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).
Files:
mllm/nn/lmcache/StaticCache.hppmllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
🧬 Code graph analysis (3)
examples/qwen_npu/main.cpp (9)
mllm/utils/Argparse.hpp (1)
Argparse(135-135)mllm/engine/Perf.hpp (3)
start(52-52)stop(56-56)saveReport(58-58)mllm/models/qwen_npu/tokenization_qwen.hpp (2)
QwenTokenizer(153-162)QwenTokenizer(153-153)mllm/models/qwen_npu/configuration_qwen_npu.hpp (3)
QwenNPUConfig(11-11)QwenNPUConfig(13-34)QwenNPUConfig(13-13)mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (5)
QwenForCausalLMCPU(288-288)QwenForCausalLMCPU(290-296)QwenForCausalLMCPU(290-290)layer_idx(135-138)layer_idx(135-135)mllm/nn/lmcache/StaticCache.hpp (13)
layer_idx(16-16)layer_idx(24-27)layer_idx(24-24)layer_idx(45-47)layer_idx(49-49)layer_idx(53-53)layer_idx(55-55)layer_idx(57-57)layer_idx(59-59)layer_idx(61-61)layer_idx(98-98)layer_idx(100-102)layer_idx(106-106)mllm/core/aops/KVCacheOp.hpp (1)
layer_idx(38-38)mllm/nn/Module.hpp (2)
args(142-158)args(142-142)mllm/models/ARGeneration.cpp (1)
step(64-64)
mllm/nn/lmcache/StaticCache.hpp (2)
mllm/core/Tensor.cpp (1)
Tensor(68-68)mllm/core/Tensor.hpp (2)
Tensor(63-63)Tensor(166-166)
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (2)
mllm/models/qwen_npu/modeling_qwen_npu.hpp (6)
seq_len(449-451)seq_len(449-449)decode_blocks_(440-440)block(442-444)makeRotaryPosEmbedding(26-73)makeRotaryPosEmbedding(26-27)mllm/nn/layers/KVCache.hpp (1)
layer_idx(20-20)
🪛 Clang (14.0.6)
examples/qwen_npu/main.cpp
[error] 38-38: variable 'config_path' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 39-39: variable 'npu_model_path' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 40-40: variable 'cpu_decode_model_path' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 44-44: variable 'file_version' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 50-50: variable 'prefill_seq_len' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 105-105: variable 'eos_token_id' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 108-108: variable 'past' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 109-109: variable 'args' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 118-118: variable 'prefill_args' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 134-134: variable 'vocab_size' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 135-135: variable 'logits_data' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 137-137: variable 'max_logit' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 178-178: variable 'prefill_args' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 194-194: variable 'vocab_size' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 195-195: variable 'logits_data' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 197-197: variable 'max_logit' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 232-232: variable 'step_time_ms' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 237-237: variable 'logits' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 238-238: variable 'logits_f' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 246-246: variable 'vocab_size' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 247-247: variable 'data' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 249-249: variable 'next_id' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 250-250: variable 'max_logit' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 276-276: 1000.0 is a magic number; consider replacing it with a named constant
(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)
[error] 277-277: 1000.0 is a magic number; consider replacing it with a named constant
(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
[error] 3-3: 'cmath' file not found
(clang-diagnostic-error)
[error] 19-19: 2 adjacent parameters of 'makeRoPEInvFreq' of convertible types are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 28-28: 3 adjacent parameters of 'makeRotaryPosEmbedding' of convertible types are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 82-82: constructor does not initialize these fields: gate_proj_, up_proj_, down_proj_, silu_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 86-86: constructor does not initialize these fields: gate_proj_, up_proj_, down_proj_, silu_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 86-86: 2 adjacent parameters of 'QwenMLPCPU' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 94-94: variable name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 96-96: variable name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 110-110: constructor does not initialize these fields: q_proj_, k_proj_, v_proj_, o_proj_, q_rope_, k_rope_, mask_, softmax_, kv_cache_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 114-114: constructor does not initialize these fields: q_proj_, k_proj_, v_proj_, o_proj_, q_rope_, k_rope_, mask_, softmax_, kv_cache_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 114-114: 2 adjacent parameters of 'QwenAttentionCPU' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 146-146: variable 'B' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 146-146: variable name 'B' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 146-146: invalid case style for variable 'B'
(readability-identifier-naming,-warnings-as-errors)
[error] 147-147: variable 'S' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 147-147: variable name 'S' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 147-147: invalid case style for variable 'S'
(readability-identifier-naming,-warnings-as-errors)
[error] 209-209: constructor does not initialize these fields: input_layernorm_, post_attention_layernorm_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 211-211: member variable 'self_attn_' has public visibility
(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)
[error] 212-212: member variable 'mlp_' has public visibility
(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)
[error] 218-218: constructor does not initialize these fields: input_layernorm_, post_attention_layernorm_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 218-218: 2 adjacent parameters of 'QwenDecoderCPU' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 250-250: constructor does not initialize these fields: decode_blocks_, norm_, embedding_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 254-254: constructor does not initialize these fields: decode_blocks_, norm_, embedding_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 254-254: 2 adjacent parameters of 'QwenTextCPU' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 268-268: variable name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 286-286: constructor does not initialize these fields: cfg_, lm_head_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 290-290: constructor does not initialize these fields: cfg_, lm_head_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 290-290: 2 adjacent parameters of 'QwenForCausalLMCPU' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 298-298: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 300-300: variable 'B' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 300-300: variable name 'B' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 300-300: invalid case style for variable 'B'
(readability-identifier-naming,-warnings-as-errors)
[error] 301-301: variable 'S' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 301-301: variable name 'S' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 301-301: invalid case style for variable 'S'
(readability-identifier-naming,-warnings-as-errors)
[error] 305-305: variable 'position_ids' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 316-316: loop variable name 'b' is too short, expected at least 2 characters
(readability-identifier-length,-warnings-as-errors)
[error] 317-317: loop variable name 's' is too short, expected at least 2 characters
(readability-identifier-length,-warnings-as-errors)
🔇 Additional comments (8)
examples/qwen_npu/main.cpp (2)
23-27: Entry point and initialization look good.Clean transition from macro-based entry to standard
main(). Perfetto tracing is properly conditionally compiled.
227-270: Decode loop implementation looks good.The decode loop correctly handles per-step timing, greedy token selection, EOS termination, and live output. The throughput statistics at the end provide useful benchmarking information.
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (5)
14-26: Named namespace properly addresses the previous review concern.The utility functions are now in a named
mllm::models::qwen_npu::utilsnamespace instead of an unnamed namespace, which resolves the ODR violation concern from the previous review.
19-26: RoPE inverse frequency calculation is correct.The implementation correctly computes
1.0 / (rope_theta ^ (2i / dim))for rotary position embeddings.
82-107: QwenMLPCPU correctly implements SwiGLU-style MLP.The gate-up-down projection pattern with SiLU activation follows the Qwen2 architecture correctly.
140-185: Attention forward implementation is correct and follows HuggingFace Qwen2.The attention correctly:
- Projects to Q/K/V and reshapes to multi-head format
- Applies RoPE to Q and K
- Uses KV cache for cached key/value states
- Applies causal masking and scaled dot-product attention
- Computes softmax in float32 for numerical stability (matching HF)
298-337: Forward implementation correctly handles both prefill and decode phases.The position ID management properly handles:
- Initial prefill: generates sequential position IDs
- Decode: increments from the last position in previous output
- The
real_seqparameter allows clipping for proper last-token logit extractionmllm/nn/lmcache/StaticCache.hpp (1)
13-28: I'll verify the review comment about the missing virtual destructor inAbstractStaticCache. Let me start by examining the file and exploring the class hierarchy.
<function_calls>
cat -n mllm/nn/lmcache/StaticCache.hpp
</function_calls>Let me search for derived classes of
AbstractStaticCache:
<function_calls>
rg -n "class.*:.*AbstractStaticCache" --type hpp --type cpp --type h
</function_calls>Let me search more broadly for classes that inherit from AbstractStaticCache:
<function_calls>
rg -n -A2 "class.*StaticCache" --type hpp --type cpp --type h | head -50
</function_calls>Let me search for usage patterns where AbstractStaticCache pointers might be deleted:
<function_calls>
rg -n "AbstractStaticCache" --type hpp --type cpp --type h
</function_calls>Let me search for the full file structure to understand the context better:
<function_calls>
fd "StaticCache" --type f
</function_calls>Let me examine the StaticCache files that were found:
<function_calls>
find . -name "StaticCache" -type f 2>/dev/null
</function_calls>Let me check the directory structure around the file:
<function_calls>
ls -la mllm/nn/lmcache/ 2>/dev/null || find . -path "lmcache" -type d
</function_calls>Let me look at the actual file more carefully:
<function_calls>
wc -l mllm/nn/lmcache/StaticCache.hpp && echo "---" && file mllm/nn/lmcache/StaticCache.hpp
</function_calls>Let me check if the file exists and get its full contents:
<function_calls>
stat mllm/nn/lmcache/StaticCache.hpp 2>&1 || echo "File not found in root; searching..."
find . -name "StaticCache.hpp" 2>/dev/null
</function_calls>Let me search for all hpp files to understand the repository structure:
<function_calls>
git ls-files | grep -i "staticcache|lmcache" | head -20
</function_calls>Let me try a broader search in the repository:
<function_calls>
git ls-files | grep ".hpp$" | head -20
</function_calls>Let me check the repository structure at the root:
<function_calls>
ls -la | head -30 && echo "---" && git ls-files | wc -l
</function_calls>Let me search for files matching the path pattern mentioned in the review:
<function_calls>
git ls-files "lmcache" 2>/dev/null || find . -type d -name "lmcache" 2>/dev/null
</function_calls>Let me check character encoding and format rules compliance on the potential file path:
<function_calls>
#!/bin/bashCheck file encoding rules
TARGET_FILE="mllm/nn/lmcache/StaticCache.hpp"
if [ -f "$TARGET_FILE" ]; then
echo "=== File exists ==="
file "$TARGET_FILE"Check for BOM
if head -c 3 "$TARGET_FILE" | od -An -tx1 | grep -q "ef bb bf"; then
echo "ERROR: BOM detected"
else
echo "OK: No BOM"
fiCheck for C0/C1 control codes (excluding tab 0x09 and LF 0x0A)
if od -An -tx1 "$TARGET_FILE" | grep -E " (0[0-8]|0b|0c|0[e-1f]|7f|[89][0-9a-f])" | head -1; then
echo "Checking for invalid control codes..."
else
echo "OK: No invalid control codes"
fiCheck line endings
if file "$TARGET_FILE" | grep -q "CRLF"; then
echo "ERROR: CRLF line endings detected"
else
echo "OK: Unix line endings"
fiCheck trailing whitespace
if grep -n ' $' "$TARGET_FILE"; then
echo "ERROR: Trailing whitespace detected"
else
echo "OK: No trailing whitespace"
fi
else
echo "File not found: $TARGET_FILE"
echo "Searching for related files..."
find . -name "StaticCache.hpp" -o -name "lmcache" 2>/dev/null | head -20
fi
</function_calls>Let me try a simpler approach to locate the file:
<function_calls>
pwd && find . -maxdepth 5 -type d -name "mllm" 2>/dev/null
</function_calls>Now let me read the actual file content:
<function_calls>
cat mllm/nn/lmcache/StaticCache.hpp 2>&1
</function_calls>Let me verify the virtual destructor presence and search for derived classes more thoroughly:
<function_calls>
#!/bin/bashSearch for AbstractStaticCache definition and derived classes
echo "=== Searching for AbstractStaticCache definition ==="
rg -n "class AbstractStaticCache" -A 20echo -e "\n=== Searching for classes inheriting from AbstractStaticCache ==="
rg -n ": public AbstractStaticCache|: protected AbstractStaticCache|: private AbstractStaticCache"echo -e "\n=== Searching for AbstractStaticCache usage patterns ==="
rg -n "AbstractStaticCache*|unique_ptr.*AbstractStaticCache|shared_ptr.*AbstractStaticCache" | head -20echo -e "\n=== Checking for destructor in AbstractStaticCache ==="
rg -n "~AbstractStaticCache"
</function_calls>Let me also verify coding guidelines compliance on the specific lines:
<function_calls>
#!/bin/bashVerify encoding and format compliance for the file
TARGET_FILE="mllm/nn/lmcache/StaticCache.hpp"
if [ -f "$TARGET_FILE" ]; then
echo "=== Checking file: $TARGET_FILE ==="Read lines 13-28 specifically
echo -e "\n=== Lines 13-28 ==="
sed -n '13,28p' "$TARGET_FILE" | cat -nCheck for BOM
echo -e "\n=== BOM Check ==="
hexdump -C "$TARGET_FILE" | head -1Check for trailing whitespace in lines 13-28
echo -e "\n=== Trailing Whitespace in 13-28 ==="
sed -n '13,28p' "$TARGET_FILE" | grep -n ' $' || echo "None found (OK)"Check line endings
echo -e "\n=== Line Ending Check ==="
file "$TARGET_FILE"
else
echo "File not found at $TARGET_FILE"
fi
</function_calls>
Summary by CodeRabbit
New Features
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.