feat: qwen2 cpu model and connection with npu prefill by Sp0tless · Pull Request #555 · UbiquitousLearning/mllm

Sp0tless · 2025-11-28T02:00:30Z

Implement qwen2 CPU model
Temporarily make KVCache layer expose inside StaticCache for KV cache sharing
Support mixed QNN prefill + CPU decode
Fix a missing category in perfetto config

Summary by CodeRabbit

New Features
- Mixed inference workflow: accelerator prefill + CPU decode with shared KV cache
- CPU-side Qwen NPU model and tokenizer for decoding and live output
- Per-step performance tracing, cumulative timing and throughput stats
- Extended tracing category for accelerator execution
Bug Fixes
- Broadened compatibility with accelerator interface providers

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…4 & minicpm o audio encoding feat(MiniCPM4, MiniCPM-o, AvgPool1dOp): Add support for MiniCPM4 model, MiniCPM-o's audio modality inference capability, and AvgPool1dOp

coderabbitai · 2025-11-28T02:00:46Z

Walkthrough

Replaces macro entry point with standard main; adds mixed NPU-prefill + CPU-decode flow in the Qwen NPU example; introduces shared StaticCache plumbing across KVCache/ops/backends; adds CPU-side Qwen model headers for decoding; and refactors QNN allocator/utils and perfetto tracing categories.

Changes

Cohort / File(s)	Change Summary
Example entry point & workflow `examples/qwen_npu/main.cpp`	Replaced `MLLM_MAIN` with `int main(...)`. Added `use_npu_prefill` flag, two-phase prefill (QNN) + decode (CPU) flow, separate model paths/configs, shared `StaticCache` wiring, per-step decoding loop, timing instrumentation, and live token output.
KVCache API surface `mllm/core/aops/KVCacheOp.hpp`, `mllm/nn/layers/KVCache.hpp`, `mllm/nn/layers/KVCache.cpp`	Added `setStaticCache(const StaticCache*)` and `getStaticCache() const` to KVCacheOp and KVCache wrapper; forward-declaration for `StaticCache` and plumbing to delegate static-cache access to operator implementations.
CPU KVCache op implementation `mllm/backends/cpu/ops/KVCacheOp.hpp`, `mllm/backends/cpu/ops/KVCacheOp.cpp`	Added `shared_cache_` pointer, `setStaticCache()`/`getStaticCache()` overrides, and conditional logic to use shared cache vs internal cache in forward, clear/set/getCurrentSeqCnt with per-call layer-index mapping.
StaticCache base behavior `mllm/nn/lmcache/StaticCache.hpp`	Changed default AbstractStaticCache::updateKVCache return to pair of default-constructed `Tensor` objects (Tensor(), Tensor()) instead of `Tensor::nil()` values.
KV cache layer additions `mllm/nn/layers/KVCache.cpp`, `mllm/nn/layers/KVCache.hpp`	Added methods on `nn::KVCache` to set/get associated `StaticCache` and forward to underlying aops operator via dynamic_cast.
Qwen NPU CPU modeling `mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp`	New CPU-side Qwen model header: RoPE helpers, `QwenMLPCPU`, `QwenAttentionCPU` (with KV cache), `QwenDecoderCPU`, `QwenTextCPU`, and `QwenForCausalLMCPU` for CPU decode path.
Qwen NPU model API edits `mllm/models/qwen_npu/modeling_qwen_npu.hpp`	Removed several const `getKVCache()` accessors and per-layer seq helpers; added `decode_blocks()` accessor and `setKVCacheSeqLen()` on `QwenText`.
QNN backend & utils `mllm/backends/qnn/QNNBackend.cpp`, `mllm/backends/qnn/QNNUtils.cpp`, `mllm/backends/qnn/QNNUtils.hpp`, `mllm/backends/qnn/QNNDispatcher.cpp`, `mllm/backends/qnn/QNNAllocator.cpp`, `mllm/backends/qnn/QNNAllocator.hpp`	Relaxed QNN provider version minor check, added `hasLastRegistrationInfo_` flag, refactored `QNNTensorWrapper::initFromQnnTensor()` to use shallow-copy semantics for name/dim pointers, minor formatting and perfetto trace name tweak (removed trailing dot).
Perfetto tracing `mllm/engine/Perf.hpp`	Added `perfetto::Category("mllm.qnn.execute")` to tracing category list.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Example as examples/qwen_npu
    participant QNN as QNN Prefill
    participant StaticCache as Shared StaticCache
    participant CPU as CPU Decode
    participant Output as Console

    App->>Example: launch with use_npu_prefill flag
    Example->>QNN: load QNN model, run prefill(input tokens)
    QNN->>StaticCache: updateKVCache(layer, K, V)
    StaticCache-->>QNN: confirm storage
    alt use_npu_prefill enabled
        Example->>CPU: load CPU decode model + attach StaticCache
        loop per-step decode
            CPU->>StaticCache: read KV for attention
            CPU->>CPU: compute logits -> select token
            CPU->>StaticCache: updateKVCache(layer, K, V)
            CPU->>Output: print token
        end
    else pure CPU decode
        Example->>CPU: run CPU prefill+decode loop
    end
    Output-->>App: final EOS / completion

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Focus review on:
- examples/qwen_npu/main.cpp (two-phase control flow, cache wiring, timing/instrumentation)
- mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (new CPU model correctness, RoPE and KV cache integration)
- mllm/backends/cpu/ops/KVCacheOp.{hpp,cpp} (shared vs internal cache indexing and updates)
- mllm/nn/lmcache/StaticCache.hpp (default return semantics change)
- mllm/backends/qnn/QNNBackend.cpp (relaxed version check impact)

Possibly related PRs

feat(qnn): Basic QNN Prefill on v2 #485 — Related KVCache/StaticCache wiring and Qwen NPU example changes touching shared static cache integration.
feat(cpu): implement RadixAttnSwaSink with sliding window attention support #516 — Related modifications to KV-cache subsystem and StaticCache APIs used by CPU/QNN paths.
feat(cpu): add inplace rmsnorm implementations for fp32 and fp16 #483 — Related work on shared KV cache support and KVCacheOp/KVCache API additions.

Suggested reviewers

yirongjie
oreomaker
liang1232018

Poem

🐰 I hopped into code with a twitchy ear,
Prefill on NPU, CPU decoding near.
Shared KV in my pouch, tokens hop along,
Timings tick-tock, and the output sings a song.
Two-phase carrots fetched—now the generation's clear.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 2.63% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	❓ Inconclusive	The description provides a clear, concise list of key changes but lacks the detailed structure recommended by the repository template.	Consider expanding the description to include motivation, detailed changes, testing information, and any breaking changes, following the repository's contribution guidelines template.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main changes: implementing a Qwen2 CPU model and connecting it with NPU prefill for mixed execution.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chenghuaWang

LGTM

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (11)

mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (2)
168-169: Consider precomputing the attention scale factor.

The 1.f / std::sqrt(static_cast<float>(head_dim_)) is computed on every forward pass. Since head_dim_ is fixed after construction, consider storing this as a member variable initialized in the constructor.
 // In constructor after setting head_dim_:
+  attn_scale_ = 1.f / std::sqrt(static_cast<float>(head_dim_));

 // In forward:
-  auto attn_weights = nn::functional::matmul(query_states, key_states, /*trans_a=*/false, /*trans_b=*/true)
-                      * (1.f / std::sqrt(static_cast<float>(head_dim_)));
+  auto attn_weights = nn::functional::matmul(query_states, key_states, /*trans_a=*/false, /*trans_b=*/true)
+                      * attn_scale_;
211-212: Consider making self_attn_ and mlp_ private.

Public member variables in a class reduce encapsulation. The direct access at line 257 (b.self_attn_.set_layer_idx(...)) could be replaced with a set_layer_idx method on QwenDecoderCPU that forwards to self_attn_.
+  void set_layer_idx(int layer_idx) { self_attn_.set_layer_idx(layer_idx); }
+
+ private:
   QwenAttentionCPU self_attn_;
   QwenMLPCPU mlp_;
-
-  nn::KVCache& getKVCache() { return self_attn_.getKVCache(); }
mllm/nn/layers/KVCache.cpp (1)
34-43: Inconsistent cast strategy: dynamic_cast vs static_pointer_cast

The new methods use dynamic_cast while existing methods (lines 21, 24, 27, 31) use static_pointer_cast. This inconsistency is worth noting.

The dynamic_cast approach is safer for the new static cache methods since not all backends may support static cache sharing, but consider whether the silent failure in setStaticCache (line 36) could hide misconfiguration issues. For debugging purposes, you might want to add a log warning when the cast fails.
 void KVCache::setStaticCache(const nn::StaticCache* cache) {
   auto* kv_op = dynamic_cast<aops::KVCacheOp*>(impl()->getInstancedOp().get());
-  if (kv_op != nullptr) { kv_op->setStaticCache(cache); }
+  if (kv_op != nullptr) {
+    kv_op->setStaticCache(cache);
+  } else {
+    MLLM_DEBUG("setStaticCache: KVCacheOp cast failed, static cache not set");
+  }
 }
mllm/backends/qnn/QNNAllocator.cpp (1)
19-22: Consider replacing variadic macro with constexpr template function.

Static analysis flags QNN_ALLOCATOR_VERBOSE as a variadic macro. While functional, modern C++ prefers constexpr variadic template functions for type safety. This is a minor style improvement that can be deferred.

Example alternative:
template<typename... Args>
constexpr void QNN_ALLOCATOR_VERBOSE_FN(Args&&... args) {
  if constexpr (kVerboseQnnAllocatorLogs) { MLLM_INFO(std::forward<Args>(args)...); }
}
mllm/core/aops/KVCacheOp.hpp (1)
50-52: Address the "Hack" comment and add [[nodiscard]].

The // Hack comment indicates this is a temporary solution. Consider:

Creating a tracking issue to properly design the static cache interface

Adding a brief explanation of why this is considered a hack and what the ideal solution would be

Also, getStaticCache() should be marked [[nodiscard]] per static analysis.
-  // Hack
-  virtual void setStaticCache(const mllm::nn::StaticCache* cache) {}
-  virtual const mllm::nn::StaticCache* getStaticCache() const { return nullptr; }
+  // TODO: Temporary interface for KV cache sharing between CPU and QNN backends.
+  // Consider a cleaner abstraction (e.g., a CacheProvider interface) when stabilized.
+  virtual void setStaticCache(const mllm::nn::StaticCache* /*cache*/) {}
+  [[nodiscard]] virtual const mllm::nn::StaticCache* getStaticCache() const { return nullptr; }
mllm/backends/cpu/ops/KVCacheOp.cpp (2)
20-28: Consider extracting helper methods to reduce code duplication.

The cache_to_use and layer_idx_to_use pattern is repeated in reshape(), forward(), and getCurrentSeqCnt(). Consider extracting helper methods:
+private:
+  nn::StaticCache* getActiveCache() const {
+    return shared_cache_ ? shared_cache_ : const_cast<nn::StaticCache*>(&cache_);
+  }
+  
+  int32_t getLayerIdx() const {
+    return shared_cache_ ? options_.layer_idx : 0;
+  }
Then use these helpers consistently across all methods. This reduces the risk of inconsistency if the logic needs to change.

61-77: Add a comment explaining the asymmetry between setCurrentSeqCnt and getCurrentSeqCnt.

setCurrentSeqCnt sets the sequence count globally (for all layers) when using shared cache, while getCurrentSeqCnt retrieves the count for this specific layer. This asymmetry appears intentional for the decode flow but could confuse future maintainers.
 void CPUKVCacheOp::setCurrentSeqCnt(int32_t seq) {
   if (shared_cache_) {
+    // Set globally for all layers in shared cache - typically called once
+    // after prefill to sync sequence position across all decoder blocks
     shared_cache_->setCurrentSeqCnt(seq);
   } else {
     cache_.setCurrentSeqCnt(seq);
   }
 }
examples/qwen_npu/main.cpp (4)
29-36: Consider making use_npu_prefill configurable.

The flag is hardcoded to true, making the pure CPU path (lines 116-161) unreachable in production. Consider using an environment variable or command-line argument for flexibility:
-  // Mixed inference: QNN prefill + CPU decode
-  const bool use_npu_prefill = true;
+  // Mixed inference: QNN prefill + CPU decode (configurable via env var)
+  const bool use_npu_prefill = std::getenv("MLLM_USE_NPU_PREFILL") != nullptr;
139-146: Consider extracting repeated argmax logic into a helper function.

The greedy token selection (argmax) is duplicated three times across the file. Extracting it would improve maintainability:
// Helper function for greedy decoding
inline int argmax(const float* data, int64_t vocab_size) {
  int best_id = 0;
  float max_val = data[0];
  for (int64_t i = 1; i < vocab_size; ++i) {
    if (data[i] > max_val) {
      max_val = data[i];
      best_id = static_cast<int>(i);
    }
  }
  return best_id;
}
Similarly, the prefill output handling (lines 133-160 and 193-220) could be unified.

Also applies to: 199-206, 252-259

262-263: Potential output interleaving between MLLM_INFO and std::wcout.

Mixing narrow-character logging (MLLM_INFO likely uses stdout/stderr) with wide-character output (std::wcout) on the same conceptual line may cause interleaving or buffering issues. Consider flushing or using a consistent output mechanism:
-    MLLM_INFO("[Decode step {:3d}] time: {:.2f}ms, token: ", step, step_time_ms);
-    std::wcout << next_token_str << std::flush;
+    std::cout << fmt::format("[Decode step {:3d}] time: {:.2f}ms, token: ", step, step_time_ms);
+    std::wcout << next_token_str << std::endl;
Or consider consolidating all output to one stream.

275-281: Consider using a named constant for milliseconds-to-seconds conversion.

The value 1000.0 appears multiple times for time unit conversion. A named constant would improve readability:
+  constexpr double kMsPerSecond = 1000.0;
+
   if (decode_count > 0) {
     double avg_time_ms = total_decode_time_ms / decode_count;
     MLLM_INFO(
         "Decode completed: {} tokens, total time: {:.2f}ms ({:.2f}s), avg time: {:.2f}ms/token, throughput: {:.2f} tokens/s",
-        decode_count, total_decode_time_ms, total_decode_time_ms / 1000.0, avg_time_ms,
-        decode_count / (total_decode_time_ms / 1000.0));
+        decode_count, total_decode_time_ms, total_decode_time_ms / kMsPerSecond, avg_time_ms,
+        decode_count / (total_decode_time_ms / kMsPerSecond));
   }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5865176 and e841d0d.

📒 Files selected for processing (15)

examples/qwen_npu/main.cpp (1 hunks)
mllm/backends/cpu/ops/KVCacheOp.cpp (2 hunks)
mllm/backends/cpu/ops/KVCacheOp.hpp (1 hunks)
mllm/backends/qnn/QNNAllocator.cpp (18 hunks)
mllm/backends/qnn/QNNAllocator.hpp (4 hunks)
mllm/backends/qnn/QNNBackend.cpp (9 hunks)
mllm/backends/qnn/QNNDispatcher.cpp (1 hunks)
mllm/backends/qnn/QNNUtils.cpp (2 hunks)
mllm/backends/qnn/QNNUtils.hpp (1 hunks)
mllm/core/aops/KVCacheOp.hpp (2 hunks)
mllm/engine/Perf.hpp (1 hunks)
mllm/models/qwen_npu/modeling_qwen_npu.hpp (1 hunks)
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (1 hunks)
mllm/nn/layers/KVCache.cpp (1 hunks)
mllm/nn/layers/KVCache.hpp (2 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

{mllm,mllm-cli,pymllm}/**/*

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).

Files:

mllm/backends/qnn/QNNUtils.hpp
mllm/nn/layers/KVCache.cpp
mllm/core/aops/KVCacheOp.hpp
mllm/engine/Perf.hpp
mllm/backends/cpu/ops/KVCacheOp.hpp
mllm/backends/cpu/ops/KVCacheOp.cpp
mllm/nn/layers/KVCache.hpp
mllm/backends/qnn/QNNDispatcher.cpp
mllm/backends/qnn/QNNUtils.cpp
mllm/backends/qnn/QNNAllocator.cpp
mllm/models/qwen_npu/modeling_qwen_npu.hpp
mllm/backends/qnn/QNNAllocator.hpp
mllm/backends/qnn/QNNBackend.cpp
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Files:

mllm/backends/qnn/QNNUtils.hpp
mllm/nn/layers/KVCache.cpp
mllm/core/aops/KVCacheOp.hpp
mllm/engine/Perf.hpp
mllm/backends/cpu/ops/KVCacheOp.hpp
mllm/backends/cpu/ops/KVCacheOp.cpp
mllm/nn/layers/KVCache.hpp
mllm/backends/qnn/QNNDispatcher.cpp
mllm/backends/qnn/QNNUtils.cpp
mllm/backends/qnn/QNNAllocator.cpp
mllm/models/qwen_npu/modeling_qwen_npu.hpp
mllm/backends/qnn/QNNAllocator.hpp
mllm/backends/qnn/QNNBackend.cpp
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).

Files:

mllm/backends/qnn/QNNUtils.hpp
mllm/nn/layers/KVCache.cpp
mllm/core/aops/KVCacheOp.hpp
mllm/engine/Perf.hpp
mllm/backends/cpu/ops/KVCacheOp.hpp
mllm/backends/cpu/ops/KVCacheOp.cpp
mllm/nn/layers/KVCache.hpp
mllm/backends/qnn/QNNDispatcher.cpp
mllm/backends/qnn/QNNUtils.cpp
mllm/backends/qnn/QNNAllocator.cpp
mllm/models/qwen_npu/modeling_qwen_npu.hpp
mllm/backends/qnn/QNNAllocator.hpp
mllm/backends/qnn/QNNBackend.cpp
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.

Files:

mllm/nn/layers/KVCache.cpp
mllm/backends/cpu/ops/KVCacheOp.cpp
mllm/backends/qnn/QNNDispatcher.cpp
mllm/backends/qnn/QNNUtils.cpp
mllm/backends/qnn/QNNAllocator.cpp
mllm/backends/qnn/QNNBackend.cpp

🧬 Code graph analysis (8)

mllm/nn/layers/KVCache.cpp (3)

mllm/backends/cpu/ops/KVCacheOp.hpp (2)

cache (26-26)

cache (26-26)

mllm/core/aops/KVCacheOp.hpp (2)

cache (51-51)

cache (51-51)

mllm/nn/layers/KVCache.hpp (1)

cache (30-30)

mllm/core/aops/KVCacheOp.hpp (2)

mllm/backends/cpu/ops/KVCacheOp.hpp (2)

cache (26-26)

cache (26-26)

mllm/nn/layers/KVCache.hpp (1)

cache (30-30)

mllm/backends/cpu/ops/KVCacheOp.hpp (2)

mllm/core/aops/KVCacheOp.hpp (2)

cache (51-51)

cache (51-51)

mllm/nn/layers/KVCache.hpp (1)

cache (30-30)

examples/qwen_npu/main.cpp (6)

mllm/utils/Argparse.hpp (1)

Argparse (135-135)

mllm/engine/Perf.hpp (3)

start (52-52)

stop (56-56)

saveReport (58-58)

mllm/models/qwen_npu/tokenization_qwen.hpp (2)

QwenTokenizer (153-162)

QwenTokenizer (153-153)

mllm/models/qwen_npu/configuration_qwen_npu.hpp (3)

QwenNPUConfig (11-11)

QwenNPUConfig (13-34)

QwenNPUConfig (13-13)

mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (5)

QwenForCausalLMCPU (288-288)

QwenForCausalLMCPU (290-296)

QwenForCausalLMCPU (290-290)

layer_idx (135-138)

layer_idx (135-135)

mllm/nn/layers/KVCache.hpp (1)

layer_idx (20-20)

mllm/nn/layers/KVCache.hpp (3)

mllm/backends/cpu/ops/KVCacheOp.hpp (2)

cache (26-26)

cache (26-26)

mllm/core/aops/KVCacheOp.hpp (2)

cache (51-51)

cache (51-51)

mllm/nn/lmcache/StaticCache.hpp (6)

nodiscard (16-16)

nodiscard (22-22)

nodiscard (50-50)

nodiscard (58-58)

nodiscard (60-60)

nodiscard (103-103)

mllm/backends/qnn/QNNAllocator.cpp (1)

mllm/backends/qnn/QNNAllocator.hpp (21)

ptr (52-55)

ptr (52-52)

ptr (56-56)

ptr (56-56)

ptr (74-74)

ptr (84-84)

ptr (85-85)

ptr (136-136)

ptr (154-154)

storage (43-43)

storage (45-45)

storage (45-45)

storage (47-47)

storage (49-49)

storage (49-49)

storage (58-63)

storage (58-58)

storage (65-65)

storage (65-65)

storage (72-72)

tensor_id (146-147)

mllm/models/qwen_npu/modeling_qwen_npu.hpp (1)

mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (1)

decode_blocks_ (278-278)

mllm/backends/qnn/QNNBackend.cpp (2)

mllm/backends/qnn/QNNBackend.hpp (7)

graphName (93-93)

graphName (95-99)

graphName (101-101)

graphName (103-103)

graphName (106-107)

graphName (109-110)

graphName (113-113)

mllm/backends/qnn/QNNModel.hpp (2)

expectedOrder (80-80)

expectedOrder (80-80)

🪛 Clang (14.0.6)

mllm/core/aops/KVCacheOp.hpp

[error] 52-52: function 'getStaticCache' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

mllm/backends/cpu/ops/KVCacheOp.hpp

[error] 27-27: function 'getStaticCache' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

examples/qwen_npu/main.cpp

[error] 38-38: variable 'config_path' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 39-39: variable 'npu_model_path' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 40-40: variable 'cpu_decode_model_path' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 44-44: variable 'file_version' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 50-50: variable 'prefill_seq_len' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 108-108: variable 'past' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 109-109: variable 'args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 112-112: variable 'enable_timing' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 121-121: variable 'prefill_args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 137-137: variable 'vocab_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 138-138: variable 'logits_data' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 140-140: variable 'max_logit' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 181-181: variable 'prefill_args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 197-197: variable 'vocab_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 198-198: variable 'logits_data' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 200-200: variable 'max_logit' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 235-235: variable 'step_time_ms' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 240-240: variable 'logits' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 241-241: variable 'logits_f' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 249-249: variable 'vocab_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 250-250: variable 'data' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 253-253: variable 'max_logit' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 279-279: 1000.0 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)

[error] 280-280: 1000.0 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)

mllm/backends/qnn/QNNAllocator.cpp

[error] 19-19: variadic macro 'QNN_ALLOCATOR_VERBOSE' used; consider using a 'constexpr' variadic template function

(cppcoreguidelines-macro-usage,-warnings-as-errors)

mllm/backends/qnn/QNNAllocator.hpp

[error] 120-120: constructor does not initialize these fields: tensor_name

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp

[error] 3-3: 'cmath' file not found

(clang-diagnostic-error)

[error] 14-14: do not use unnamed namespaces in header files

(google-build-namespaces,-warnings-as-errors)

[error] 19-19: 2 adjacent parameters of 'makeRoPEInvFreq' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

[error] 28-28: 3 adjacent parameters of 'makeRotaryPosEmbedding' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

[error] 82-82: constructor does not initialize these fields: gate_proj_, up_proj_, down_proj_, silu_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

[error] 86-86: constructor does not initialize these fields: gate_proj_, up_proj_, down_proj_, silu_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

[error] 86-86: 2 adjacent parameters of 'QwenMLPCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

[error] 94-94: variable name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

[error] 96-96: variable name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

[error] 110-110: constructor does not initialize these fields: q_proj_, k_proj_, v_proj_, o_proj_, q_rope_, k_rope_, mask_, softmax_, kv_cache_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

[error] 114-114: constructor does not initialize these fields: q_proj_, k_proj_, v_proj_, o_proj_, q_rope_, k_rope_, mask_, softmax_, kv_cache_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

[error] 114-114: 2 adjacent parameters of 'QwenAttentionCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

[error] 146-146: variable 'B' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 146-146: variable name 'B' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

[error] 146-146: invalid case style for variable 'B'

(readability-identifier-naming,-warnings-as-errors)

[error] 147-147: variable 'S' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 147-147: variable name 'S' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

[error] 147-147: invalid case style for variable 'S'

(readability-identifier-naming,-warnings-as-errors)

[error] 209-209: constructor does not initialize these fields: input_layernorm_, post_attention_layernorm_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

[error] 211-211: member variable 'self_attn_' has public visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)

[error] 212-212: member variable 'mlp_' has public visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)

[error] 218-218: constructor does not initialize these fields: input_layernorm_, post_attention_layernorm_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

[error] 218-218: 2 adjacent parameters of 'QwenDecoderCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

[error] 250-250: constructor does not initialize these fields: decode_blocks_, norm_, embedding_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

[error] 254-254: constructor does not initialize these fields: decode_blocks_, norm_, embedding_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

[error] 254-254: 2 adjacent parameters of 'QwenTextCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

[error] 268-268: variable name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

[error] 286-286: constructor does not initialize these fields: cfg_, lm_head_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

[error] 290-290: constructor does not initialize these fields: cfg_, lm_head_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

[error] 290-290: 2 adjacent parameters of 'QwenForCausalLMCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

[error] 298-298: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

[error] 300-300: variable 'B' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 300-300: variable name 'B' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

[error] 300-300: invalid case style for variable 'B'

(readability-identifier-naming,-warnings-as-errors)

[error] 301-301: variable 'S' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 301-301: variable name 'S' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

[error] 301-301: invalid case style for variable 'S'

(readability-identifier-naming,-warnings-as-errors)

[error] 305-305: variable 'position_ids' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 316-316: loop variable name 'b' is too short, expected at least 2 characters

(readability-identifier-length,-warnings-as-errors)

[error] 317-317: loop variable name 's' is too short, expected at least 2 characters

(readability-identifier-length,-warnings-as-errors)

🔇 Additional comments (17)

mllm/backends/qnn/QNNUtils.hpp (1)

268-268: Trailing newline addition aligns with best practices.

The change properly adds a final newline to the file, which adheres to POSIX and C++ style guidelines (including Google's C++ Style Guide). No functional changes are introduced.

mllm/engine/Perf.hpp (1)

20-21: LGTM!

The new mllm.qnn.execute Perfetto category is correctly added and aligns with the trace event usage in QNNDispatcher.cpp. The macro syntax is properly formatted.
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (3)
82-107: LGTM!

The MLP block correctly implements the gated activation pattern (gate → SiLU → element-wise multiply with up projection → down projection). The module registration pattern is consistent with the codebase.

250-284: LGTM!

The text model implementation correctly registers RoPE inverse frequency as a buffer, iterates through decoder blocks with layer index assignment, and applies final normalization. The structured binding with enumerate is clean.

303-329: Unfortunately, I'm unable to clone and execute shell commands against the repository due to infrastructure limitations. However, I can analyze the concern based on the provided code context and general programming principles.

Unable to verify caller validation of seq_len parameter; recommend manual inspection of callers and addition of defensive guard.

The concern about real_seq potentially being zero or negative is theoretically valid:

Line 303 assigns real_seq either from args.at("seq_len") or defaults to S

Line 329 uses real_seq - 1 as a slice index, which would be -1 if real_seq == 0

Negative indices in tensor slicing typically cause undefined behavior or assertion failures

What cannot be verified without repository access:

Whether callers guarantee seq_len >= 1

Whether S (sequence length) is always >= 1

If there are existing guards elsewhere in the call chain

The exact slicing semantics of the Tensor class for negative indices

The suggested assertion guard remains valid defensive programming:
   auto real_seq = args.count("seq_len") ? args.at("seq_len").get<int>() : S;
+  MLLM_ASSERT(real_seq >= 1, "seq_len must be at least 1");
mllm/backends/qnn/QNNDispatcher.cpp (1)

62-67: LGTM!

The trace event name now correctly matches the Perfetto category "mllm.qnn.execute" defined in Perf.hpp. The trailing dot removal fixes the naming inconsistency.

mllm/backends/qnn/QNNAllocator.hpp (1)

120-129: LGTM!

The addition of hasLastRegistrationInfo_ provides explicit validity tracking for the LastRegistrationInfo cache, which is cleaner than relying on checking individual fields for defaults. The struct members have appropriate default initializers. The static analysis warning about tensor_name is a false positive since std::string default-constructs to an empty string.

mllm/nn/layers/KVCache.hpp (2)

8-8: LGTM!

The include for StaticCache.hpp is necessary for the new interface methods.

30-32: LGTM!

The setStaticCache and getStaticCache methods provide a clean interface for KV cache sharing between CPU and NPU execution phases. The [[nodiscard]] attribute on the getter and const correctness are appropriate. This aligns with the corresponding virtual methods in aops::KVCacheOp.hpp.

mllm/backends/qnn/QNNUtils.cpp (1)

437-461: LGTM!

The resetAlloc() simplification is clean. The initFromQnnTensor implementation correctly uses a shallow copy followed by pointer rebinding to internal storage (name_ and dimensions_), which is a safe pattern that avoids unnecessary allocations while maintaining ownership of the critical fields.

mllm/backends/qnn/QNNBackend.cpp (2)

533-536: LGTM on formatting changes.

The formatting adjustments to log messages and code layout improve readability without changing functionality.

209-214: I'm unable to clone the repository to perform the verification. However, based on the code snippet provided in the review comment, I can analyze what's visible:

The review comment identifies a legitimate concern: the minor version check is commented out on line 210, which does relax the QNN interface selection criteria. Without access to the repository, I cannot verify:

Whether there are compensating version checks elsewhere in the code

The rationale or documentation for this change

Whether this is intentional and documented in commit messages

How this impacts downstream API usage

Given these limitations, here's the rewritten review comment:

Verify the intentional relaxation of QNN API minor version check and confirm compensating safeguards.

The minor version check is now commented out (line 210), allowing the backend to accept any QNN interface provider with a matching major version, regardless of minor version. This could allow using an older QNN runtime that lacks APIs the code depends on.

Before merging, confirm:

Whether compensating version checks exist elsewhere in the codebase

If this change is intentional and documented (commit message, design doc, or code comment)

Whether minimum supported QNN version requirements are documented

If runtime API compatibility is validated through alternative mechanisms
mllm/backends/cpu/ops/KVCacheOp.cpp (1)
53-59: Verify: clearCache() clears the entire shared cache when using shared mode.

When shared_cache_ is set, clearCache() clears the entire shared cache, affecting all layers that share it. Is this the intended behavior?

If individual layers should only clear their own portion, consider:
 void CPUKVCacheOp::clearCache() {
   if (shared_cache_) {
-    shared_cache_->clearCache();
+    // Only clear this layer's cache, not the entire shared cache
+    shared_cache_->clearLayerCache(options_.layer_idx);
   } else {
     cache_.clearCache();
   }
 }
Otherwise, add a comment clarifying that clearing is intentionally global when using a shared cache.
mllm/models/qwen_npu/modeling_qwen_npu.hpp (2)

440-441: LGTM on the decode_blocks() accessor.

This accessor aligns with the same pattern in modeling_qwen_npu_cpu.hpp (line 277), enabling consistent access to decoder blocks for KV cache sharing in the mixed NPU prefill + CPU decode workflow.

446-451: LGTM on setKVCacheSeqLen with improved documentation.

The rename from setKVCacheSeqCnt to setKVCacheSeqLen uses clearer terminology, and the added comments explain the purpose well—synchronizing logical sequence length across layers when reusing a fixed-size KV buffer.

examples/qwen_npu/main.cpp (2)

23-27: LGTM on entry point structure and Perfetto integration.

The transition from macro-based entry to standard main() is clean. Perfetto instrumentation is properly guarded with #ifdef and follows correct start/stop/save pattern.

Also applies to: 282-288

178-178: I'll help you verify the clearKVCache() behavior with shared StaticCache. Let me search the codebase to understand the implementation.
<function_calls>

Search for clearKVCache implementation and StaticCache usage

#!/bin/bash

Find clearKVCache implementation

echo "=== Searching for clearKVCache implementation ==="
rg -n "clearKVCache" --type=cpp --type=h -B2 -A10

echo ""
echo "=== Searching for StaticCache references ==="
rg -n "StaticCache" --type=cpp --type=h -B2 -A5

</function_calls>
<function_calls>

Get context around line 178 in main.cpp

#!/bin/bash

Find and examine the main.cpp file

find . -name "main.cpp" -path "/qwen_npu/" | head -1 | xargs wc -l

Get context around line 178

find . -name "main.cpp" -path "/qwen_npu/" | head -1 | xargs sed -n '165,195p' | cat -n

</function_calls>

coderabbitai · 2025-11-28T07:17:49Z

+  if (!use_npu_prefill) {
+    MLLM_INFO("Starting CPU prefill...");
+    auto prefill_start = std::chrono::high_resolution_clock::now();
+
+    prefill_seq_len = static_cast<int32_t>(raw_input_tokens.shape()[1]);
+    mllm::models::ARGenerationArgs prefill_args;
+    prefill_args["seq_len"] = prefill_seq_len;
+    auto prefill_output = cpu_decode_model.forward(past, prefill_args);
+
+    auto prefill_end = std::chrono::high_resolution_clock::now();
+    auto prefill_duration = std::chrono::duration_cast<std::chrono::milliseconds>(prefill_end - prefill_start);
+    MLLM_INFO("CPU prefill completed: seq_len={}, time={:.2f}s", prefill_seq_len, prefill_duration.count() / 1000.0);


⚠️ Potential issue | 🔴 Critical

Missing KV cache setup for pure CPU inference path.

When use_npu_prefill is false, the CPU prefill path executes, but the KV cache for the CPU model is never configured. The shared_kv_cache remains nullptr and cpu_decode_model's KVCache layers are never linked to a StaticCache (that setup only happens in the use_npu_prefill block at lines 83-91).

This will likely cause incorrect decode behavior or runtime errors when the decode loop tries to use uninitialized KV state.

+ // Setup KV cache for CPU-only inference + if (!use_npu_prefill) { + shared_kv_cache = std::make_unique<mllm::nn::StaticCache>( + cpu_cfg.max_cache_length, cpu_cfg.num_hidden_layers, cpu_cfg.num_attention_heads, cpu_cfg.num_key_value_heads, + cpu_cfg.head_dim, mllm::kFloat32, mllm::kFloat32, mllm::kCPU, false); + + auto& cpu_decode_blocks = cpu_decode_model.text_model().decode_blocks(); + for (int32_t layer_idx = 0; layer_idx < cpu_cfg.num_hidden_layers; ++layer_idx) { + auto& cpu_block = cpu_decode_blocks.list()[layer_idx]; + auto& cpu_kv_cache = cpu_block.getKVCache(); + cpu_kv_cache.setStaticCache(shared_kv_cache.get()); + cpu_kv_cache.setLayerIndex(layer_idx); + } + MLLM_INFO("Configured {} CPU KVCache layers for pure CPU inference", cpu_cfg.num_hidden_layers); + } + // Prefill phase if (!use_npu_prefill) {

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Clang (14.0.6)

[error] 121-121: variable 'prefill_args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

🤖 Prompt for AI Agents

In examples/qwen_npu/main.cpp around lines 116 to 127, the CPU-only prefill path doesn't initialize or attach a KV cache: shared_kv_cache stays nullptr and cpu_decode_model's KVCache layers are never linked to a StaticCache (the setup is only done in the use_npu_prefill branch around lines 83-91). Fix by mirroring the NPU branch: allocate a StaticCache (or equivalent cache object) for the CPU path, set shared_kv_cache to it, and attach/link all cpu_decode_model KVCache layers to that StaticCache (same initialization and linking logic used in lines 83-91) before calling cpu_decode_model.forward so the decode loop has a valid KV state.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

examples/qwen_npu/main.cpp (1)

113-124: Missing KV cache setup for pure CPU inference path.

When use_npu_prefill is false, the shared KV cache is never initialized and the CPU decode model's KVCache layers are never linked to a StaticCache. The decode loop will use uninitialized KV state, causing incorrect behavior or crashes.

Add KV cache initialization before the CPU prefill block:

+  // Setup KV cache for CPU-only inference
+  if (!use_npu_prefill) {
+    shared_kv_cache = std::make_unique<mllm::nn::StaticCache>(
+        cpu_cfg.max_cache_length, cpu_cfg.num_hidden_layers, cpu_cfg.num_attention_heads, cpu_cfg.num_key_value_heads,
+        cpu_cfg.head_dim, mllm::kFloat32, mllm::kFloat32, mllm::kCPU, false);
+
+    auto& cpu_decode_blocks = cpu_decode_model.text_model().decode_blocks();
+    for (int32_t layer_idx = 0; layer_idx < cpu_cfg.num_hidden_layers; ++layer_idx) {
+      auto& cpu_block = cpu_decode_blocks.list()[layer_idx];
+      auto& cpu_kv_cache = cpu_block.getKVCache();
+      cpu_kv_cache.setStaticCache(shared_kv_cache.get());
+      cpu_kv_cache.setLayerIndex(layer_idx);
+    }
+    MLLM_INFO("Configured {} CPU KVCache layers for pure CPU inference", cpu_cfg.num_hidden_layers);
+  }
+
   // Prefill phase
   if (!use_npu_prefill) {

🧹 Nitpick comments (5)

examples/qwen_npu/main.cpp (2)
38-55: Consider making model paths configurable via command-line arguments.

The argc and argv parameters are available but unused. For better usability, consider using Argparse (already imported via mllm::Argparse) to make the model paths configurable instead of hardcoding them.

126-157: Consider extracting duplicated logits processing logic.

The prefill output processing (lines 126-157 for CPU, lines 186-217 for NPU) is nearly identical. Consider extracting this into a helper function to reduce duplication:
auto processFirstToken(const mllm::Tensor& prefill_logits, 
                       const mllm::Tensor& raw_input_tokens,
                       int prefill_seq_len,
                       mllm::models::qwen_npu::QwenTokenizer& tokenizer)
    -> std::pair<mllm::Tensor, mllm::Tensor>;
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (2)
209-214: Consider making self_attn_ and mlp_ private with accessors.

The public member variables expose implementation details. While this enables direct access from main.cpp for KV cache configuration, consider using accessor methods:
+  QwenAttentionCPU& self_attn() { return self_attn_; }
+
+ private:
   QwenAttentionCPU self_attn_;
   QwenMLPCPU mlp_;
Then update main.cpp to use block.self_attn().getKVCache() instead of block.getKVCache().

1-13: Consider adding documentation for public API classes.

Per coding guidelines, public APIs should have clear docstrings explaining purpose, parameters, and usage. Consider adding brief documentation for the main classes (QwenMLPCPU, QwenAttentionCPU, QwenDecoderCPU, QwenTextCPU, QwenForCausalLMCPU) to explain their roles in the CPU decode pipeline.
/// @brief CPU-side Qwen2 model for decode phase in mixed NPU/CPU inference.
/// 
/// This module provides a CPU implementation of the Qwen2 architecture that can
/// share KV cache state with the NPU prefill model for hybrid inference.
class QwenForCausalLMCPU : public nn::Module, public ARGeneration { ... };
mllm/nn/lmcache/StaticCache.hpp (1)

24-27: Confirm Tensor() vs Tensor::nil() semantics and clean up commented-out code

Switching the default updateKVCache implementation to return {Tensor(), Tensor()} instead of {Tensor::nil(), Tensor::nil()} is fine only if a default-constructed Tensor is semantically equivalent to the previous nil sentinel for all existing call sites (e.g., checks like “is empty / invalid cache entry”). If any code explicitly distinguishes nil from a default Tensor, this may subtly change behavior.

Also, the old return statement is now just commented-out dead code; it should be removed to avoid confusion.

Consider:

Verifying that Tensor::nil() and Tensor() are interchangeable in all usages that consume updateKVCache’s result.

Removing the commented-out line once the behavioral equivalence is confirmed.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e841d0d and 74d30a5.

📒 Files selected for processing (3)

examples/qwen_npu/main.cpp (1 hunks)
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (1 hunks)
mllm/nn/lmcache/StaticCache.hpp (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

{mllm,mllm-cli,pymllm}/**/*

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).

Files:

mllm/nn/lmcache/StaticCache.hpp
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Files:

mllm/nn/lmcache/StaticCache.hpp
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).

Files:

mllm/nn/lmcache/StaticCache.hpp
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp

🧬 Code graph analysis (3)

examples/qwen_npu/main.cpp (9)

mllm/utils/Argparse.hpp (1)

Argparse (135-135)

mllm/engine/Perf.hpp (3)

start (52-52)

stop (56-56)

saveReport (58-58)

mllm/models/qwen_npu/tokenization_qwen.hpp (2)

QwenTokenizer (153-162)

QwenTokenizer (153-153)

mllm/models/qwen_npu/configuration_qwen_npu.hpp (3)

QwenNPUConfig (11-11)

QwenNPUConfig (13-34)

QwenNPUConfig (13-13)

mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (5)

QwenForCausalLMCPU (288-288)

QwenForCausalLMCPU (290-296)

QwenForCausalLMCPU (290-290)

layer_idx (135-138)

layer_idx (135-135)

mllm/nn/lmcache/StaticCache.hpp (13)

layer_idx (16-16)

layer_idx (24-27)

layer_idx (24-24)

layer_idx (45-47)

layer_idx (49-49)

layer_idx (53-53)

layer_idx (55-55)

layer_idx (57-57)

layer_idx (59-59)

layer_idx (61-61)

layer_idx (98-98)

layer_idx (100-102)

layer_idx (106-106)

mllm/core/aops/KVCacheOp.hpp (1)

layer_idx (38-38)

mllm/nn/Module.hpp (2)

args (142-158)

args (142-142)

mllm/models/ARGeneration.cpp (1)

step (64-64)

mllm/nn/lmcache/StaticCache.hpp (2)

mllm/core/Tensor.cpp (1)

Tensor (68-68)

mllm/core/Tensor.hpp (2)

Tensor (63-63)

Tensor (166-166)

mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (2)

mllm/models/qwen_npu/modeling_qwen_npu.hpp (6)

seq_len (449-451)

seq_len (449-449)

decode_blocks_ (440-440)

block (442-444)

makeRotaryPosEmbedding (26-73)

makeRotaryPosEmbedding (26-27)

mllm/nn/layers/KVCache.hpp (1)

layer_idx (20-20)

🪛 Clang (14.0.6)

examples/qwen_npu/main.cpp

[error] 38-38: variable 'config_path' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 39-39: variable 'npu_model_path' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 40-40: variable 'cpu_decode_model_path' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 44-44: variable 'file_version' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 50-50: variable 'prefill_seq_len' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 105-105: variable 'eos_token_id' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 108-108: variable 'past' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 109-109: variable 'args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 118-118: variable 'prefill_args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 134-134: variable 'vocab_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 135-135: variable 'logits_data' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 137-137: variable 'max_logit' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 178-178: variable 'prefill_args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 194-194: variable 'vocab_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 195-195: variable 'logits_data' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 197-197: variable 'max_logit' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 232-232: variable 'step_time_ms' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 237-237: variable 'logits' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 238-238: variable 'logits_f' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 246-246: variable 'vocab_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 247-247: variable 'data' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 249-249: variable 'next_id' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 250-250: variable 'max_logit' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 276-276: 1000.0 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)

[error] 277-277: 1000.0 is a magic number; consider replacing it with a named constant