Skip to content

feat: qwen2 cpu model and connection with npu prefill#555

Merged
chenghuaWang merged 5 commits intoUbiquitousLearning:mainfrom
Sp0tless:cpu_decode
Nov 29, 2025
Merged

feat: qwen2 cpu model and connection with npu prefill#555
chenghuaWang merged 5 commits intoUbiquitousLearning:mainfrom
Sp0tless:cpu_decode

Conversation

@Sp0tless
Copy link
Copy Markdown
Contributor

@Sp0tless Sp0tless commented Nov 28, 2025

  • Implement qwen2 CPU model
  • Temporarily make KVCache layer expose inside StaticCache for KV cache sharing
  • Support mixed QNN prefill + CPU decode
  • Fix a missing category in perfetto config

Summary by CodeRabbit

  • New Features

    • Mixed inference workflow: accelerator prefill + CPU decode with shared KV cache
    • CPU-side Qwen NPU model and tokenizer for decoding and live output
    • Per-step performance tracing, cumulative timing and throughput stats
    • Extended tracing category for accelerator execution
  • Bug Fixes

    • Broadened compatibility with accelerator interface providers

✏️ Tip: You can customize this high-level summary in your review settings.

oreomaker and others added 3 commits November 25, 2025 15:25
…4 & minicpm o audio encoding

feat(MiniCPM4, MiniCPM-o, AvgPool1dOp): Add support for MiniCPM4 model, MiniCPM-o's audio modality inference capability, and AvgPool1dOp
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 28, 2025

Walkthrough

Replaces macro entry point with standard main; adds mixed NPU-prefill + CPU-decode flow in the Qwen NPU example; introduces shared StaticCache plumbing across KVCache/ops/backends; adds CPU-side Qwen model headers for decoding; and refactors QNN allocator/utils and perfetto tracing categories.

Changes

Cohort / File(s) Change Summary
Example entry point & workflow
examples/qwen_npu/main.cpp
Replaced MLLM_MAIN with int main(...). Added use_npu_prefill flag, two-phase prefill (QNN) + decode (CPU) flow, separate model paths/configs, shared StaticCache wiring, per-step decoding loop, timing instrumentation, and live token output.
KVCache API surface
mllm/core/aops/KVCacheOp.hpp, mllm/nn/layers/KVCache.hpp, mllm/nn/layers/KVCache.cpp
Added setStaticCache(const StaticCache*) and getStaticCache() const to KVCacheOp and KVCache wrapper; forward-declaration for StaticCache and plumbing to delegate static-cache access to operator implementations.
CPU KVCache op implementation
mllm/backends/cpu/ops/KVCacheOp.hpp, mllm/backends/cpu/ops/KVCacheOp.cpp
Added shared_cache_ pointer, setStaticCache()/getStaticCache() overrides, and conditional logic to use shared cache vs internal cache in forward, clear/set/getCurrentSeqCnt with per-call layer-index mapping.
StaticCache base behavior
mllm/nn/lmcache/StaticCache.hpp
Changed default AbstractStaticCache::updateKVCache return to pair of default-constructed Tensor objects (Tensor(), Tensor()) instead of Tensor::nil() values.
KV cache layer additions
mllm/nn/layers/KVCache.cpp, mllm/nn/layers/KVCache.hpp
Added methods on nn::KVCache to set/get associated StaticCache and forward to underlying aops operator via dynamic_cast.
Qwen NPU CPU modeling
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
New CPU-side Qwen model header: RoPE helpers, QwenMLPCPU, QwenAttentionCPU (with KV cache), QwenDecoderCPU, QwenTextCPU, and QwenForCausalLMCPU for CPU decode path.
Qwen NPU model API edits
mllm/models/qwen_npu/modeling_qwen_npu.hpp
Removed several const getKVCache() accessors and per-layer seq helpers; added decode_blocks() accessor and setKVCacheSeqLen() on QwenText.
QNN backend & utils
mllm/backends/qnn/QNNBackend.cpp, mllm/backends/qnn/QNNUtils.cpp, mllm/backends/qnn/QNNUtils.hpp, mllm/backends/qnn/QNNDispatcher.cpp, mllm/backends/qnn/QNNAllocator.cpp, mllm/backends/qnn/QNNAllocator.hpp
Relaxed QNN provider version minor check, added hasLastRegistrationInfo_ flag, refactored QNNTensorWrapper::initFromQnnTensor() to use shallow-copy semantics for name/dim pointers, minor formatting and perfetto trace name tweak (removed trailing dot).
Perfetto tracing
mllm/engine/Perf.hpp
Added perfetto::Category("mllm.qnn.execute") to tracing category list.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Example as examples/qwen_npu
    participant QNN as QNN Prefill
    participant StaticCache as Shared StaticCache
    participant CPU as CPU Decode
    participant Output as Console

    App->>Example: launch with use_npu_prefill flag
    Example->>QNN: load QNN model, run prefill(input tokens)
    QNN->>StaticCache: updateKVCache(layer, K, V)
    StaticCache-->>QNN: confirm storage
    alt use_npu_prefill enabled
        Example->>CPU: load CPU decode model + attach StaticCache
        loop per-step decode
            CPU->>StaticCache: read KV for attention
            CPU->>CPU: compute logits -> select token
            CPU->>StaticCache: updateKVCache(layer, K, V)
            CPU->>Output: print token
        end
    else pure CPU decode
        Example->>CPU: run CPU prefill+decode loop
    end
    Output-->>App: final EOS / completion
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Focus review on:
    • examples/qwen_npu/main.cpp (two-phase control flow, cache wiring, timing/instrumentation)
    • mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (new CPU model correctness, RoPE and KV cache integration)
    • mllm/backends/cpu/ops/KVCacheOp.{hpp,cpp} (shared vs internal cache indexing and updates)
    • mllm/nn/lmcache/StaticCache.hpp (default return semantics change)
    • mllm/backends/qnn/QNNBackend.cpp (relaxed version check impact)

Possibly related PRs

Suggested reviewers

  • yirongjie
  • oreomaker
  • liang1232018

Poem

🐰 I hopped into code with a twitchy ear,
Prefill on NPU, CPU decoding near.
Shared KV in my pouch, tokens hop along,
Timings tick-tock, and the output sings a song.
Two-phase carrots fetched—now the generation's clear.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.63% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ❓ Inconclusive The description provides a clear, concise list of key changes but lacks the detailed structure recommended by the repository template. Consider expanding the description to include motivation, detailed changes, testing information, and any breaking changes, following the repository's contribution guidelines template.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main changes: implementing a Qwen2 CPU model and connecting it with NPU prefill for mixed execution.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Collaborator

@chenghuaWang chenghuaWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Sp0tless Sp0tless changed the base branch from v2 to main November 28, 2025 06:45
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (11)
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (2)

168-169: Consider precomputing the attention scale factor.

The 1.f / std::sqrt(static_cast<float>(head_dim_)) is computed on every forward pass. Since head_dim_ is fixed after construction, consider storing this as a member variable initialized in the constructor.

 // In constructor after setting head_dim_:
+  attn_scale_ = 1.f / std::sqrt(static_cast<float>(head_dim_));

 // In forward:
-  auto attn_weights = nn::functional::matmul(query_states, key_states, /*trans_a=*/false, /*trans_b=*/true)
-                      * (1.f / std::sqrt(static_cast<float>(head_dim_)));
+  auto attn_weights = nn::functional::matmul(query_states, key_states, /*trans_a=*/false, /*trans_b=*/true)
+                      * attn_scale_;

211-212: Consider making self_attn_ and mlp_ private.

Public member variables in a class reduce encapsulation. The direct access at line 257 (b.self_attn_.set_layer_idx(...)) could be replaced with a set_layer_idx method on QwenDecoderCPU that forwards to self_attn_.

+  void set_layer_idx(int layer_idx) { self_attn_.set_layer_idx(layer_idx); }
+
+ private:
   QwenAttentionCPU self_attn_;
   QwenMLPCPU mlp_;
-
-  nn::KVCache& getKVCache() { return self_attn_.getKVCache(); }
mllm/nn/layers/KVCache.cpp (1)

34-43: Inconsistent cast strategy: dynamic_cast vs static_pointer_cast

The new methods use dynamic_cast while existing methods (lines 21, 24, 27, 31) use static_pointer_cast. This inconsistency is worth noting.

The dynamic_cast approach is safer for the new static cache methods since not all backends may support static cache sharing, but consider whether the silent failure in setStaticCache (line 36) could hide misconfiguration issues. For debugging purposes, you might want to add a log warning when the cast fails.

 void KVCache::setStaticCache(const nn::StaticCache* cache) {
   auto* kv_op = dynamic_cast<aops::KVCacheOp*>(impl()->getInstancedOp().get());
-  if (kv_op != nullptr) { kv_op->setStaticCache(cache); }
+  if (kv_op != nullptr) {
+    kv_op->setStaticCache(cache);
+  } else {
+    MLLM_DEBUG("setStaticCache: KVCacheOp cast failed, static cache not set");
+  }
 }
mllm/backends/qnn/QNNAllocator.cpp (1)

19-22: Consider replacing variadic macro with constexpr template function.

Static analysis flags QNN_ALLOCATOR_VERBOSE as a variadic macro. While functional, modern C++ prefers constexpr variadic template functions for type safety. This is a minor style improvement that can be deferred.

Example alternative:

template<typename... Args>
constexpr void QNN_ALLOCATOR_VERBOSE_FN(Args&&... args) {
  if constexpr (kVerboseQnnAllocatorLogs) { MLLM_INFO(std::forward<Args>(args)...); }
}
mllm/core/aops/KVCacheOp.hpp (1)

50-52: Address the "Hack" comment and add [[nodiscard]].

The // Hack comment indicates this is a temporary solution. Consider:

  1. Creating a tracking issue to properly design the static cache interface
  2. Adding a brief explanation of why this is considered a hack and what the ideal solution would be

Also, getStaticCache() should be marked [[nodiscard]] per static analysis.

-  // Hack
-  virtual void setStaticCache(const mllm::nn::StaticCache* cache) {}
-  virtual const mllm::nn::StaticCache* getStaticCache() const { return nullptr; }
+  // TODO: Temporary interface for KV cache sharing between CPU and QNN backends.
+  // Consider a cleaner abstraction (e.g., a CacheProvider interface) when stabilized.
+  virtual void setStaticCache(const mllm::nn::StaticCache* /*cache*/) {}
+  [[nodiscard]] virtual const mllm::nn::StaticCache* getStaticCache() const { return nullptr; }
mllm/backends/cpu/ops/KVCacheOp.cpp (2)

20-28: Consider extracting helper methods to reduce code duplication.

The cache_to_use and layer_idx_to_use pattern is repeated in reshape(), forward(), and getCurrentSeqCnt(). Consider extracting helper methods:

+private:
+  nn::StaticCache* getActiveCache() const {
+    return shared_cache_ ? shared_cache_ : const_cast<nn::StaticCache*>(&cache_);
+  }
+  
+  int32_t getLayerIdx() const {
+    return shared_cache_ ? options_.layer_idx : 0;
+  }

Then use these helpers consistently across all methods. This reduces the risk of inconsistency if the logic needs to change.


61-77: Add a comment explaining the asymmetry between setCurrentSeqCnt and getCurrentSeqCnt.

setCurrentSeqCnt sets the sequence count globally (for all layers) when using shared cache, while getCurrentSeqCnt retrieves the count for this specific layer. This asymmetry appears intentional for the decode flow but could confuse future maintainers.

 void CPUKVCacheOp::setCurrentSeqCnt(int32_t seq) {
   if (shared_cache_) {
+    // Set globally for all layers in shared cache - typically called once
+    // after prefill to sync sequence position across all decoder blocks
     shared_cache_->setCurrentSeqCnt(seq);
   } else {
     cache_.setCurrentSeqCnt(seq);
   }
 }
examples/qwen_npu/main.cpp (4)

29-36: Consider making use_npu_prefill configurable.

The flag is hardcoded to true, making the pure CPU path (lines 116-161) unreachable in production. Consider using an environment variable or command-line argument for flexibility:

-  // Mixed inference: QNN prefill + CPU decode
-  const bool use_npu_prefill = true;
+  // Mixed inference: QNN prefill + CPU decode (configurable via env var)
+  const bool use_npu_prefill = std::getenv("MLLM_USE_NPU_PREFILL") != nullptr;

139-146: Consider extracting repeated argmax logic into a helper function.

The greedy token selection (argmax) is duplicated three times across the file. Extracting it would improve maintainability:

// Helper function for greedy decoding
inline int argmax(const float* data, int64_t vocab_size) {
  int best_id = 0;
  float max_val = data[0];
  for (int64_t i = 1; i < vocab_size; ++i) {
    if (data[i] > max_val) {
      max_val = data[i];
      best_id = static_cast<int>(i);
    }
  }
  return best_id;
}

Similarly, the prefill output handling (lines 133-160 and 193-220) could be unified.

Also applies to: 199-206, 252-259


262-263: Potential output interleaving between MLLM_INFO and std::wcout.

Mixing narrow-character logging (MLLM_INFO likely uses stdout/stderr) with wide-character output (std::wcout) on the same conceptual line may cause interleaving or buffering issues. Consider flushing or using a consistent output mechanism:

-    MLLM_INFO("[Decode step {:3d}] time: {:.2f}ms, token: ", step, step_time_ms);
-    std::wcout << next_token_str << std::flush;
+    std::cout << fmt::format("[Decode step {:3d}] time: {:.2f}ms, token: ", step, step_time_ms);
+    std::wcout << next_token_str << std::endl;

Or consider consolidating all output to one stream.


275-281: Consider using a named constant for milliseconds-to-seconds conversion.

The value 1000.0 appears multiple times for time unit conversion. A named constant would improve readability:

+  constexpr double kMsPerSecond = 1000.0;
+
   if (decode_count > 0) {
     double avg_time_ms = total_decode_time_ms / decode_count;
     MLLM_INFO(
         "Decode completed: {} tokens, total time: {:.2f}ms ({:.2f}s), avg time: {:.2f}ms/token, throughput: {:.2f} tokens/s",
-        decode_count, total_decode_time_ms, total_decode_time_ms / 1000.0, avg_time_ms,
-        decode_count / (total_decode_time_ms / 1000.0));
+        decode_count, total_decode_time_ms, total_decode_time_ms / kMsPerSecond, avg_time_ms,
+        decode_count / (total_decode_time_ms / kMsPerSecond));
   }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5865176 and e841d0d.

📒 Files selected for processing (15)
  • examples/qwen_npu/main.cpp (1 hunks)
  • mllm/backends/cpu/ops/KVCacheOp.cpp (2 hunks)
  • mllm/backends/cpu/ops/KVCacheOp.hpp (1 hunks)
  • mllm/backends/qnn/QNNAllocator.cpp (18 hunks)
  • mllm/backends/qnn/QNNAllocator.hpp (4 hunks)
  • mllm/backends/qnn/QNNBackend.cpp (9 hunks)
  • mllm/backends/qnn/QNNDispatcher.cpp (1 hunks)
  • mllm/backends/qnn/QNNUtils.cpp (2 hunks)
  • mllm/backends/qnn/QNNUtils.hpp (1 hunks)
  • mllm/core/aops/KVCacheOp.hpp (2 hunks)
  • mllm/engine/Perf.hpp (1 hunks)
  • mllm/models/qwen_npu/modeling_qwen_npu.hpp (1 hunks)
  • mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (1 hunks)
  • mllm/nn/layers/KVCache.cpp (1 hunks)
  • mllm/nn/layers/KVCache.hpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
{mllm,mllm-cli,pymllm}/**/*

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).

Files:

  • mllm/backends/qnn/QNNUtils.hpp
  • mllm/nn/layers/KVCache.cpp
  • mllm/core/aops/KVCacheOp.hpp
  • mllm/engine/Perf.hpp
  • mllm/backends/cpu/ops/KVCacheOp.hpp
  • mllm/backends/cpu/ops/KVCacheOp.cpp
  • mllm/nn/layers/KVCache.hpp
  • mllm/backends/qnn/QNNDispatcher.cpp
  • mllm/backends/qnn/QNNUtils.cpp
  • mllm/backends/qnn/QNNAllocator.cpp
  • mllm/models/qwen_npu/modeling_qwen_npu.hpp
  • mllm/backends/qnn/QNNAllocator.hpp
  • mllm/backends/qnn/QNNBackend.cpp
  • mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Files:

  • mllm/backends/qnn/QNNUtils.hpp
  • mllm/nn/layers/KVCache.cpp
  • mllm/core/aops/KVCacheOp.hpp
  • mllm/engine/Perf.hpp
  • mllm/backends/cpu/ops/KVCacheOp.hpp
  • mllm/backends/cpu/ops/KVCacheOp.cpp
  • mllm/nn/layers/KVCache.hpp
  • mllm/backends/qnn/QNNDispatcher.cpp
  • mllm/backends/qnn/QNNUtils.cpp
  • mllm/backends/qnn/QNNAllocator.cpp
  • mllm/models/qwen_npu/modeling_qwen_npu.hpp
  • mllm/backends/qnn/QNNAllocator.hpp
  • mllm/backends/qnn/QNNBackend.cpp
  • mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).

Files:

  • mllm/backends/qnn/QNNUtils.hpp
  • mllm/nn/layers/KVCache.cpp
  • mllm/core/aops/KVCacheOp.hpp
  • mllm/engine/Perf.hpp
  • mllm/backends/cpu/ops/KVCacheOp.hpp
  • mllm/backends/cpu/ops/KVCacheOp.cpp
  • mllm/nn/layers/KVCache.hpp
  • mllm/backends/qnn/QNNDispatcher.cpp
  • mllm/backends/qnn/QNNUtils.cpp
  • mllm/backends/qnn/QNNAllocator.cpp
  • mllm/models/qwen_npu/modeling_qwen_npu.hpp
  • mllm/backends/qnn/QNNAllocator.hpp
  • mllm/backends/qnn/QNNBackend.cpp
  • mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.

Files:

  • mllm/nn/layers/KVCache.cpp
  • mllm/backends/cpu/ops/KVCacheOp.cpp
  • mllm/backends/qnn/QNNDispatcher.cpp
  • mllm/backends/qnn/QNNUtils.cpp
  • mllm/backends/qnn/QNNAllocator.cpp
  • mllm/backends/qnn/QNNBackend.cpp
🧬 Code graph analysis (8)
mllm/nn/layers/KVCache.cpp (3)
mllm/backends/cpu/ops/KVCacheOp.hpp (2)
  • cache (26-26)
  • cache (26-26)
mllm/core/aops/KVCacheOp.hpp (2)
  • cache (51-51)
  • cache (51-51)
mllm/nn/layers/KVCache.hpp (1)
  • cache (30-30)
mllm/core/aops/KVCacheOp.hpp (2)
mllm/backends/cpu/ops/KVCacheOp.hpp (2)
  • cache (26-26)
  • cache (26-26)
mllm/nn/layers/KVCache.hpp (1)
  • cache (30-30)
mllm/backends/cpu/ops/KVCacheOp.hpp (2)
mllm/core/aops/KVCacheOp.hpp (2)
  • cache (51-51)
  • cache (51-51)
mllm/nn/layers/KVCache.hpp (1)
  • cache (30-30)
examples/qwen_npu/main.cpp (6)
mllm/utils/Argparse.hpp (1)
  • Argparse (135-135)
mllm/engine/Perf.hpp (3)
  • start (52-52)
  • stop (56-56)
  • saveReport (58-58)
mllm/models/qwen_npu/tokenization_qwen.hpp (2)
  • QwenTokenizer (153-162)
  • QwenTokenizer (153-153)
mllm/models/qwen_npu/configuration_qwen_npu.hpp (3)
  • QwenNPUConfig (11-11)
  • QwenNPUConfig (13-34)
  • QwenNPUConfig (13-13)
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (5)
  • QwenForCausalLMCPU (288-288)
  • QwenForCausalLMCPU (290-296)
  • QwenForCausalLMCPU (290-290)
  • layer_idx (135-138)
  • layer_idx (135-135)
mllm/nn/layers/KVCache.hpp (1)
  • layer_idx (20-20)
mllm/nn/layers/KVCache.hpp (3)
mllm/backends/cpu/ops/KVCacheOp.hpp (2)
  • cache (26-26)
  • cache (26-26)
mllm/core/aops/KVCacheOp.hpp (2)
  • cache (51-51)
  • cache (51-51)
mllm/nn/lmcache/StaticCache.hpp (6)
  • nodiscard (16-16)
  • nodiscard (22-22)
  • nodiscard (50-50)
  • nodiscard (58-58)
  • nodiscard (60-60)
  • nodiscard (103-103)
mllm/backends/qnn/QNNAllocator.cpp (1)
mllm/backends/qnn/QNNAllocator.hpp (21)
  • ptr (52-55)
  • ptr (52-52)
  • ptr (56-56)
  • ptr (56-56)
  • ptr (74-74)
  • ptr (84-84)
  • ptr (85-85)
  • ptr (136-136)
  • ptr (154-154)
  • storage (43-43)
  • storage (45-45)
  • storage (45-45)
  • storage (47-47)
  • storage (49-49)
  • storage (49-49)
  • storage (58-63)
  • storage (58-58)
  • storage (65-65)
  • storage (65-65)
  • storage (72-72)
  • tensor_id (146-147)
mllm/models/qwen_npu/modeling_qwen_npu.hpp (1)
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (1)
  • decode_blocks_ (278-278)
mllm/backends/qnn/QNNBackend.cpp (2)
mllm/backends/qnn/QNNBackend.hpp (7)
  • graphName (93-93)
  • graphName (95-99)
  • graphName (101-101)
  • graphName (103-103)
  • graphName (106-107)
  • graphName (109-110)
  • graphName (113-113)
mllm/backends/qnn/QNNModel.hpp (2)
  • expectedOrder (80-80)
  • expectedOrder (80-80)
🪛 Clang (14.0.6)
mllm/core/aops/KVCacheOp.hpp

[error] 52-52: function 'getStaticCache' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

mllm/backends/cpu/ops/KVCacheOp.hpp

[error] 27-27: function 'getStaticCache' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

examples/qwen_npu/main.cpp

[error] 38-38: variable 'config_path' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 39-39: variable 'npu_model_path' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 40-40: variable 'cpu_decode_model_path' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 44-44: variable 'file_version' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 50-50: variable 'prefill_seq_len' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 108-108: variable 'past' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 109-109: variable 'args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 112-112: variable 'enable_timing' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 121-121: variable 'prefill_args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 137-137: variable 'vocab_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 138-138: variable 'logits_data' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 140-140: variable 'max_logit' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 181-181: variable 'prefill_args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 197-197: variable 'vocab_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 198-198: variable 'logits_data' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 200-200: variable 'max_logit' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 235-235: variable 'step_time_ms' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 240-240: variable 'logits' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 241-241: variable 'logits_f' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 249-249: variable 'vocab_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 250-250: variable 'data' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 253-253: variable 'max_logit' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 279-279: 1000.0 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 280-280: 1000.0 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)

mllm/backends/qnn/QNNAllocator.cpp

[error] 19-19: variadic macro 'QNN_ALLOCATOR_VERBOSE' used; consider using a 'constexpr' variadic template function

(cppcoreguidelines-macro-usage,-warnings-as-errors)

mllm/backends/qnn/QNNAllocator.hpp

[error] 120-120: constructor does not initialize these fields: tensor_name

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp

[error] 3-3: 'cmath' file not found

(clang-diagnostic-error)


[error] 14-14: do not use unnamed namespaces in header files

(google-build-namespaces,-warnings-as-errors)


[error] 19-19: 2 adjacent parameters of 'makeRoPEInvFreq' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 28-28: 3 adjacent parameters of 'makeRotaryPosEmbedding' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 82-82: constructor does not initialize these fields: gate_proj_, up_proj_, down_proj_, silu_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 86-86: constructor does not initialize these fields: gate_proj_, up_proj_, down_proj_, silu_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 86-86: 2 adjacent parameters of 'QwenMLPCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 94-94: variable name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 96-96: variable name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 110-110: constructor does not initialize these fields: q_proj_, k_proj_, v_proj_, o_proj_, q_rope_, k_rope_, mask_, softmax_, kv_cache_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 114-114: constructor does not initialize these fields: q_proj_, k_proj_, v_proj_, o_proj_, q_rope_, k_rope_, mask_, softmax_, kv_cache_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 114-114: 2 adjacent parameters of 'QwenAttentionCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 146-146: variable 'B' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 146-146: variable name 'B' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 146-146: invalid case style for variable 'B'

(readability-identifier-naming,-warnings-as-errors)


[error] 147-147: variable 'S' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 147-147: variable name 'S' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 147-147: invalid case style for variable 'S'

(readability-identifier-naming,-warnings-as-errors)


[error] 209-209: constructor does not initialize these fields: input_layernorm_, post_attention_layernorm_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 211-211: member variable 'self_attn_' has public visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 212-212: member variable 'mlp_' has public visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 218-218: constructor does not initialize these fields: input_layernorm_, post_attention_layernorm_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 218-218: 2 adjacent parameters of 'QwenDecoderCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 250-250: constructor does not initialize these fields: decode_blocks_, norm_, embedding_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 254-254: constructor does not initialize these fields: decode_blocks_, norm_, embedding_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 254-254: 2 adjacent parameters of 'QwenTextCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 268-268: variable name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 286-286: constructor does not initialize these fields: cfg_, lm_head_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 290-290: constructor does not initialize these fields: cfg_, lm_head_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 290-290: 2 adjacent parameters of 'QwenForCausalLMCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 298-298: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 300-300: variable 'B' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 300-300: variable name 'B' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 300-300: invalid case style for variable 'B'

(readability-identifier-naming,-warnings-as-errors)


[error] 301-301: variable 'S' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 301-301: variable name 'S' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 301-301: invalid case style for variable 'S'

(readability-identifier-naming,-warnings-as-errors)


[error] 305-305: variable 'position_ids' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 316-316: loop variable name 'b' is too short, expected at least 2 characters

(readability-identifier-length,-warnings-as-errors)


[error] 317-317: loop variable name 's' is too short, expected at least 2 characters

(readability-identifier-length,-warnings-as-errors)

🔇 Additional comments (17)
mllm/backends/qnn/QNNUtils.hpp (1)

268-268: Trailing newline addition aligns with best practices.

The change properly adds a final newline to the file, which adheres to POSIX and C++ style guidelines (including Google's C++ Style Guide). No functional changes are introduced.

mllm/engine/Perf.hpp (1)

20-21: LGTM!

The new mllm.qnn.execute Perfetto category is correctly added and aligns with the trace event usage in QNNDispatcher.cpp. The macro syntax is properly formatted.

mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (3)

82-107: LGTM!

The MLP block correctly implements the gated activation pattern (gate → SiLU → element-wise multiply with up projection → down projection). The module registration pattern is consistent with the codebase.


250-284: LGTM!

The text model implementation correctly registers RoPE inverse frequency as a buffer, iterates through decoder blocks with layer index assignment, and applies final normalization. The structured binding with enumerate is clean.


303-329: Unfortunately, I'm unable to clone and execute shell commands against the repository due to infrastructure limitations. However, I can analyze the concern based on the provided code context and general programming principles.

Unable to verify caller validation of seq_len parameter; recommend manual inspection of callers and addition of defensive guard.

The concern about real_seq potentially being zero or negative is theoretically valid:

  • Line 303 assigns real_seq either from args.at("seq_len") or defaults to S
  • Line 329 uses real_seq - 1 as a slice index, which would be -1 if real_seq == 0
  • Negative indices in tensor slicing typically cause undefined behavior or assertion failures

What cannot be verified without repository access:

  1. Whether callers guarantee seq_len >= 1
  2. Whether S (sequence length) is always >= 1
  3. If there are existing guards elsewhere in the call chain
  4. The exact slicing semantics of the Tensor class for negative indices

The suggested assertion guard remains valid defensive programming:

   auto real_seq = args.count("seq_len") ? args.at("seq_len").get<int>() : S;
+  MLLM_ASSERT(real_seq >= 1, "seq_len must be at least 1");
mllm/backends/qnn/QNNDispatcher.cpp (1)

62-67: LGTM!

The trace event name now correctly matches the Perfetto category "mllm.qnn.execute" defined in Perf.hpp. The trailing dot removal fixes the naming inconsistency.

mllm/backends/qnn/QNNAllocator.hpp (1)

120-129: LGTM!

The addition of hasLastRegistrationInfo_ provides explicit validity tracking for the LastRegistrationInfo cache, which is cleaner than relying on checking individual fields for defaults. The struct members have appropriate default initializers. The static analysis warning about tensor_name is a false positive since std::string default-constructs to an empty string.

mllm/nn/layers/KVCache.hpp (2)

8-8: LGTM!

The include for StaticCache.hpp is necessary for the new interface methods.


30-32: LGTM!

The setStaticCache and getStaticCache methods provide a clean interface for KV cache sharing between CPU and NPU execution phases. The [[nodiscard]] attribute on the getter and const correctness are appropriate. This aligns with the corresponding virtual methods in aops::KVCacheOp.hpp.

mllm/backends/qnn/QNNUtils.cpp (1)

437-461: LGTM!

The resetAlloc() simplification is clean. The initFromQnnTensor implementation correctly uses a shallow copy followed by pointer rebinding to internal storage (name_ and dimensions_), which is a safe pattern that avoids unnecessary allocations while maintaining ownership of the critical fields.

mllm/backends/qnn/QNNBackend.cpp (2)

533-536: LGTM on formatting changes.

The formatting adjustments to log messages and code layout improve readability without changing functionality.


209-214: I'm unable to clone the repository to perform the verification. However, based on the code snippet provided in the review comment, I can analyze what's visible:

The review comment identifies a legitimate concern: the minor version check is commented out on line 210, which does relax the QNN interface selection criteria. Without access to the repository, I cannot verify:

  1. Whether there are compensating version checks elsewhere in the code
  2. The rationale or documentation for this change
  3. Whether this is intentional and documented in commit messages
  4. How this impacts downstream API usage

Given these limitations, here's the rewritten review comment:


Verify the intentional relaxation of QNN API minor version check and confirm compensating safeguards.

The minor version check is now commented out (line 210), allowing the backend to accept any QNN interface provider with a matching major version, regardless of minor version. This could allow using an older QNN runtime that lacks APIs the code depends on.

Before merging, confirm:

  1. Whether compensating version checks exist elsewhere in the codebase
  2. If this change is intentional and documented (commit message, design doc, or code comment)
  3. Whether minimum supported QNN version requirements are documented
  4. If runtime API compatibility is validated through alternative mechanisms
mllm/backends/cpu/ops/KVCacheOp.cpp (1)

53-59: Verify: clearCache() clears the entire shared cache when using shared mode.

When shared_cache_ is set, clearCache() clears the entire shared cache, affecting all layers that share it. Is this the intended behavior?

If individual layers should only clear their own portion, consider:

 void CPUKVCacheOp::clearCache() {
   if (shared_cache_) {
-    shared_cache_->clearCache();
+    // Only clear this layer's cache, not the entire shared cache
+    shared_cache_->clearLayerCache(options_.layer_idx);
   } else {
     cache_.clearCache();
   }
 }

Otherwise, add a comment clarifying that clearing is intentionally global when using a shared cache.

mllm/models/qwen_npu/modeling_qwen_npu.hpp (2)

440-441: LGTM on the decode_blocks() accessor.

This accessor aligns with the same pattern in modeling_qwen_npu_cpu.hpp (line 277), enabling consistent access to decoder blocks for KV cache sharing in the mixed NPU prefill + CPU decode workflow.


446-451: LGTM on setKVCacheSeqLen with improved documentation.

The rename from setKVCacheSeqCnt to setKVCacheSeqLen uses clearer terminology, and the added comments explain the purpose well—synchronizing logical sequence length across layers when reusing a fixed-size KV buffer.

examples/qwen_npu/main.cpp (2)

23-27: LGTM on entry point structure and Perfetto integration.

The transition from macro-based entry to standard main() is clean. Perfetto instrumentation is properly guarded with #ifdef and follows correct start/stop/save pattern.

Also applies to: 282-288


178-178: I'll help you verify the clearKVCache() behavior with shared StaticCache. Let me search the codebase to understand the implementation.
<function_calls>

Search for clearKVCache implementation and StaticCache usage

#!/bin/bash

Find clearKVCache implementation

echo "=== Searching for clearKVCache implementation ==="
rg -n "clearKVCache" --type=cpp --type=h -B2 -A10

echo ""
echo "=== Searching for StaticCache references ==="
rg -n "StaticCache" --type=cpp --type=h -B2 -A5


</function_calls>
<function_calls>

Get context around line 178 in main.cpp

#!/bin/bash

Find and examine the main.cpp file

find . -name "main.cpp" -path "/qwen_npu/" | head -1 | xargs wc -l

Get context around line 178

find . -name "main.cpp" -path "/qwen_npu/" | head -1 | xargs sed -n '165,195p' | cat -n


</function_calls>

Comment thread examples/qwen_npu/main.cpp Outdated
Comment thread examples/qwen_npu/main.cpp Outdated
Comment on lines +116 to +127
if (!use_npu_prefill) {
MLLM_INFO("Starting CPU prefill...");
auto prefill_start = std::chrono::high_resolution_clock::now();

prefill_seq_len = static_cast<int32_t>(raw_input_tokens.shape()[1]);
mllm::models::ARGenerationArgs prefill_args;
prefill_args["seq_len"] = prefill_seq_len;
auto prefill_output = cpu_decode_model.forward(past, prefill_args);

auto prefill_end = std::chrono::high_resolution_clock::now();
auto prefill_duration = std::chrono::duration_cast<std::chrono::milliseconds>(prefill_end - prefill_start);
MLLM_INFO("CPU prefill completed: seq_len={}, time={:.2f}s", prefill_seq_len, prefill_duration.count() / 1000.0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing KV cache setup for pure CPU inference path.

When use_npu_prefill is false, the CPU prefill path executes, but the KV cache for the CPU model is never configured. The shared_kv_cache remains nullptr and cpu_decode_model's KVCache layers are never linked to a StaticCache (that setup only happens in the use_npu_prefill block at lines 83-91).

This will likely cause incorrect decode behavior or runtime errors when the decode loop tries to use uninitialized KV state.

+  // Setup KV cache for CPU-only inference
+  if (!use_npu_prefill) {
+    shared_kv_cache = std::make_unique<mllm::nn::StaticCache>(
+        cpu_cfg.max_cache_length, cpu_cfg.num_hidden_layers, cpu_cfg.num_attention_heads, cpu_cfg.num_key_value_heads,
+        cpu_cfg.head_dim, mllm::kFloat32, mllm::kFloat32, mllm::kCPU, false);
+    
+    auto& cpu_decode_blocks = cpu_decode_model.text_model().decode_blocks();
+    for (int32_t layer_idx = 0; layer_idx < cpu_cfg.num_hidden_layers; ++layer_idx) {
+      auto& cpu_block = cpu_decode_blocks.list()[layer_idx];
+      auto& cpu_kv_cache = cpu_block.getKVCache();
+      cpu_kv_cache.setStaticCache(shared_kv_cache.get());
+      cpu_kv_cache.setLayerIndex(layer_idx);
+    }
+    MLLM_INFO("Configured {} CPU KVCache layers for pure CPU inference", cpu_cfg.num_hidden_layers);
+  }
+
   // Prefill phase
   if (!use_npu_prefill) {

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Clang (14.0.6)

[error] 121-121: variable 'prefill_args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

🤖 Prompt for AI Agents
In examples/qwen_npu/main.cpp around lines 116 to 127, the CPU-only prefill path
doesn't initialize or attach a KV cache: shared_kv_cache stays nullptr and
cpu_decode_model's KVCache layers are never linked to a StaticCache (the setup
is only done in the use_npu_prefill branch around lines 83-91). Fix by mirroring
the NPU branch: allocate a StaticCache (or equivalent cache object) for the CPU
path, set shared_kv_cache to it, and attach/link all cpu_decode_model KVCache
layers to that StaticCache (same initialization and linking logic used in lines
83-91) before calling cpu_decode_model.forward so the decode loop has a valid KV
state.

Comment thread mllm/backends/cpu/ops/KVCacheOp.hpp
Comment thread mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
examples/qwen_npu/main.cpp (1)

113-124: Missing KV cache setup for pure CPU inference path.

When use_npu_prefill is false, the shared KV cache is never initialized and the CPU decode model's KVCache layers are never linked to a StaticCache. The decode loop will use uninitialized KV state, causing incorrect behavior or crashes.

Add KV cache initialization before the CPU prefill block:

+  // Setup KV cache for CPU-only inference
+  if (!use_npu_prefill) {
+    shared_kv_cache = std::make_unique<mllm::nn::StaticCache>(
+        cpu_cfg.max_cache_length, cpu_cfg.num_hidden_layers, cpu_cfg.num_attention_heads, cpu_cfg.num_key_value_heads,
+        cpu_cfg.head_dim, mllm::kFloat32, mllm::kFloat32, mllm::kCPU, false);
+
+    auto& cpu_decode_blocks = cpu_decode_model.text_model().decode_blocks();
+    for (int32_t layer_idx = 0; layer_idx < cpu_cfg.num_hidden_layers; ++layer_idx) {
+      auto& cpu_block = cpu_decode_blocks.list()[layer_idx];
+      auto& cpu_kv_cache = cpu_block.getKVCache();
+      cpu_kv_cache.setStaticCache(shared_kv_cache.get());
+      cpu_kv_cache.setLayerIndex(layer_idx);
+    }
+    MLLM_INFO("Configured {} CPU KVCache layers for pure CPU inference", cpu_cfg.num_hidden_layers);
+  }
+
   // Prefill phase
   if (!use_npu_prefill) {
🧹 Nitpick comments (5)
examples/qwen_npu/main.cpp (2)

38-55: Consider making model paths configurable via command-line arguments.

The argc and argv parameters are available but unused. For better usability, consider using Argparse (already imported via mllm::Argparse) to make the model paths configurable instead of hardcoding them.


126-157: Consider extracting duplicated logits processing logic.

The prefill output processing (lines 126-157 for CPU, lines 186-217 for NPU) is nearly identical. Consider extracting this into a helper function to reduce duplication:

auto processFirstToken(const mllm::Tensor& prefill_logits, 
                       const mllm::Tensor& raw_input_tokens,
                       int prefill_seq_len,
                       mllm::models::qwen_npu::QwenTokenizer& tokenizer)
    -> std::pair<mllm::Tensor, mllm::Tensor>;
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (2)

209-214: Consider making self_attn_ and mlp_ private with accessors.

The public member variables expose implementation details. While this enables direct access from main.cpp for KV cache configuration, consider using accessor methods:

+  QwenAttentionCPU& self_attn() { return self_attn_; }
+
+ private:
   QwenAttentionCPU self_attn_;
   QwenMLPCPU mlp_;

Then update main.cpp to use block.self_attn().getKVCache() instead of block.getKVCache().


1-13: Consider adding documentation for public API classes.

Per coding guidelines, public APIs should have clear docstrings explaining purpose, parameters, and usage. Consider adding brief documentation for the main classes (QwenMLPCPU, QwenAttentionCPU, QwenDecoderCPU, QwenTextCPU, QwenForCausalLMCPU) to explain their roles in the CPU decode pipeline.

/// @brief CPU-side Qwen2 model for decode phase in mixed NPU/CPU inference.
/// 
/// This module provides a CPU implementation of the Qwen2 architecture that can
/// share KV cache state with the NPU prefill model for hybrid inference.
class QwenForCausalLMCPU : public nn::Module, public ARGeneration { ... };
mllm/nn/lmcache/StaticCache.hpp (1)

24-27: Confirm Tensor() vs Tensor::nil() semantics and clean up commented-out code

Switching the default updateKVCache implementation to return {Tensor(), Tensor()} instead of {Tensor::nil(), Tensor::nil()} is fine only if a default-constructed Tensor is semantically equivalent to the previous nil sentinel for all existing call sites (e.g., checks like “is empty / invalid cache entry”). If any code explicitly distinguishes nil from a default Tensor, this may subtly change behavior.

Also, the old return statement is now just commented-out dead code; it should be removed to avoid confusion.

Consider:

  • Verifying that Tensor::nil() and Tensor() are interchangeable in all usages that consume updateKVCache’s result.
  • Removing the commented-out line once the behavioral equivalence is confirmed.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e841d0d and 74d30a5.

📒 Files selected for processing (3)
  • examples/qwen_npu/main.cpp (1 hunks)
  • mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (1 hunks)
  • mllm/nn/lmcache/StaticCache.hpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
{mllm,mllm-cli,pymllm}/**/*

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).

Files:

  • mllm/nn/lmcache/StaticCache.hpp
  • mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Files:

  • mllm/nn/lmcache/StaticCache.hpp
  • mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).

Files:

  • mllm/nn/lmcache/StaticCache.hpp
  • mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp
🧬 Code graph analysis (3)
examples/qwen_npu/main.cpp (9)
mllm/utils/Argparse.hpp (1)
  • Argparse (135-135)
mllm/engine/Perf.hpp (3)
  • start (52-52)
  • stop (56-56)
  • saveReport (58-58)
mllm/models/qwen_npu/tokenization_qwen.hpp (2)
  • QwenTokenizer (153-162)
  • QwenTokenizer (153-153)
mllm/models/qwen_npu/configuration_qwen_npu.hpp (3)
  • QwenNPUConfig (11-11)
  • QwenNPUConfig (13-34)
  • QwenNPUConfig (13-13)
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (5)
  • QwenForCausalLMCPU (288-288)
  • QwenForCausalLMCPU (290-296)
  • QwenForCausalLMCPU (290-290)
  • layer_idx (135-138)
  • layer_idx (135-135)
mllm/nn/lmcache/StaticCache.hpp (13)
  • layer_idx (16-16)
  • layer_idx (24-27)
  • layer_idx (24-24)
  • layer_idx (45-47)
  • layer_idx (49-49)
  • layer_idx (53-53)
  • layer_idx (55-55)
  • layer_idx (57-57)
  • layer_idx (59-59)
  • layer_idx (61-61)
  • layer_idx (98-98)
  • layer_idx (100-102)
  • layer_idx (106-106)
mllm/core/aops/KVCacheOp.hpp (1)
  • layer_idx (38-38)
mllm/nn/Module.hpp (2)
  • args (142-158)
  • args (142-142)
mllm/models/ARGeneration.cpp (1)
  • step (64-64)
mllm/nn/lmcache/StaticCache.hpp (2)
mllm/core/Tensor.cpp (1)
  • Tensor (68-68)
mllm/core/Tensor.hpp (2)
  • Tensor (63-63)
  • Tensor (166-166)
mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (2)
mllm/models/qwen_npu/modeling_qwen_npu.hpp (6)
  • seq_len (449-451)
  • seq_len (449-449)
  • decode_blocks_ (440-440)
  • block (442-444)
  • makeRotaryPosEmbedding (26-73)
  • makeRotaryPosEmbedding (26-27)
mllm/nn/layers/KVCache.hpp (1)
  • layer_idx (20-20)
🪛 Clang (14.0.6)
examples/qwen_npu/main.cpp

[error] 38-38: variable 'config_path' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 39-39: variable 'npu_model_path' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 40-40: variable 'cpu_decode_model_path' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 44-44: variable 'file_version' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 50-50: variable 'prefill_seq_len' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 105-105: variable 'eos_token_id' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 108-108: variable 'past' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 109-109: variable 'args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 118-118: variable 'prefill_args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 134-134: variable 'vocab_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 135-135: variable 'logits_data' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 137-137: variable 'max_logit' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 178-178: variable 'prefill_args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 194-194: variable 'vocab_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 195-195: variable 'logits_data' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 197-197: variable 'max_logit' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 232-232: variable 'step_time_ms' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 237-237: variable 'logits' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 238-238: variable 'logits_f' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 246-246: variable 'vocab_size' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 247-247: variable 'data' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 249-249: variable 'next_id' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 250-250: variable 'max_logit' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 276-276: 1000.0 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)


[error] 277-277: 1000.0 is a magic number; consider replacing it with a named constant

(cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers,-warnings-as-errors)

mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp

[error] 3-3: 'cmath' file not found

(clang-diagnostic-error)


[error] 19-19: 2 adjacent parameters of 'makeRoPEInvFreq' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 28-28: 3 adjacent parameters of 'makeRotaryPosEmbedding' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 82-82: constructor does not initialize these fields: gate_proj_, up_proj_, down_proj_, silu_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 86-86: constructor does not initialize these fields: gate_proj_, up_proj_, down_proj_, silu_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 86-86: 2 adjacent parameters of 'QwenMLPCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 94-94: variable name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 96-96: variable name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 110-110: constructor does not initialize these fields: q_proj_, k_proj_, v_proj_, o_proj_, q_rope_, k_rope_, mask_, softmax_, kv_cache_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 114-114: constructor does not initialize these fields: q_proj_, k_proj_, v_proj_, o_proj_, q_rope_, k_rope_, mask_, softmax_, kv_cache_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 114-114: 2 adjacent parameters of 'QwenAttentionCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 146-146: variable 'B' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 146-146: variable name 'B' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 146-146: invalid case style for variable 'B'

(readability-identifier-naming,-warnings-as-errors)


[error] 147-147: variable 'S' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 147-147: variable name 'S' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 147-147: invalid case style for variable 'S'

(readability-identifier-naming,-warnings-as-errors)


[error] 209-209: constructor does not initialize these fields: input_layernorm_, post_attention_layernorm_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 211-211: member variable 'self_attn_' has public visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 212-212: member variable 'mlp_' has public visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 218-218: constructor does not initialize these fields: input_layernorm_, post_attention_layernorm_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 218-218: 2 adjacent parameters of 'QwenDecoderCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 250-250: constructor does not initialize these fields: decode_blocks_, norm_, embedding_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 254-254: constructor does not initialize these fields: decode_blocks_, norm_, embedding_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 254-254: 2 adjacent parameters of 'QwenTextCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 268-268: variable name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 286-286: constructor does not initialize these fields: cfg_, lm_head_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 290-290: constructor does not initialize these fields: cfg_, lm_head_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 290-290: 2 adjacent parameters of 'QwenForCausalLMCPU' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 298-298: 2 adjacent parameters of 'forward' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 300-300: variable 'B' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 300-300: variable name 'B' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 300-300: invalid case style for variable 'B'

(readability-identifier-naming,-warnings-as-errors)


[error] 301-301: variable 'S' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 301-301: variable name 'S' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 301-301: invalid case style for variable 'S'

(readability-identifier-naming,-warnings-as-errors)


[error] 305-305: variable 'position_ids' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 316-316: loop variable name 'b' is too short, expected at least 2 characters

(readability-identifier-length,-warnings-as-errors)


[error] 317-317: loop variable name 's' is too short, expected at least 2 characters

(readability-identifier-length,-warnings-as-errors)

🔇 Additional comments (8)
examples/qwen_npu/main.cpp (2)

23-27: Entry point and initialization look good.

Clean transition from macro-based entry to standard main(). Perfetto tracing is properly conditionally compiled.


227-270: Decode loop implementation looks good.

The decode loop correctly handles per-step timing, greedy token selection, EOS termination, and live output. The throughput statistics at the end provide useful benchmarking information.

mllm/models/qwen_npu/modeling_qwen_npu_cpu.hpp (5)

14-26: Named namespace properly addresses the previous review concern.

The utility functions are now in a named mllm::models::qwen_npu::utils namespace instead of an unnamed namespace, which resolves the ODR violation concern from the previous review.


19-26: RoPE inverse frequency calculation is correct.

The implementation correctly computes 1.0 / (rope_theta ^ (2i / dim)) for rotary position embeddings.


82-107: QwenMLPCPU correctly implements SwiGLU-style MLP.

The gate-up-down projection pattern with SiLU activation follows the Qwen2 architecture correctly.


140-185: Attention forward implementation is correct and follows HuggingFace Qwen2.

The attention correctly:

  • Projects to Q/K/V and reshapes to multi-head format
  • Applies RoPE to Q and K
  • Uses KV cache for cached key/value states
  • Applies causal masking and scaled dot-product attention
  • Computes softmax in float32 for numerical stability (matching HF)

298-337: Forward implementation correctly handles both prefill and decode phases.

The position ID management properly handles:

  • Initial prefill: generates sequential position IDs
  • Decode: increments from the last position in previous output
  • The real_seq parameter allows clipping for proper last-token logit extraction
mllm/nn/lmcache/StaticCache.hpp (1)

13-28: I'll verify the review comment about the missing virtual destructor in AbstractStaticCache. Let me start by examining the file and exploring the class hierarchy.
<function_calls>

cat -n mllm/nn/lmcache/StaticCache.hpp

</function_calls>

Let me search for derived classes of AbstractStaticCache:
<function_calls>

rg -n "class.*:.*AbstractStaticCache" --type hpp --type cpp --type h

</function_calls>

Let me search more broadly for classes that inherit from AbstractStaticCache:
<function_calls>

rg -n -A2 "class.*StaticCache" --type hpp --type cpp --type h | head -50

</function_calls>

Let me search for usage patterns where AbstractStaticCache pointers might be deleted:
<function_calls>

rg -n "AbstractStaticCache" --type hpp --type cpp --type h

</function_calls>

Let me search for the full file structure to understand the context better:
<function_calls>

fd "StaticCache" --type f

</function_calls>

Let me examine the StaticCache files that were found:
<function_calls>

find . -name "StaticCache" -type f 2>/dev/null

</function_calls>

Let me check the directory structure around the file:
<function_calls>

ls -la mllm/nn/lmcache/ 2>/dev/null || find . -path "lmcache" -type d

</function_calls>

Let me look at the actual file more carefully:
<function_calls>

wc -l mllm/nn/lmcache/StaticCache.hpp && echo "---" && file mllm/nn/lmcache/StaticCache.hpp

</function_calls>

Let me check if the file exists and get its full contents:
<function_calls>

stat mllm/nn/lmcache/StaticCache.hpp 2>&1 || echo "File not found in root; searching..."
find . -name "StaticCache.hpp" 2>/dev/null

</function_calls>

Let me search for all hpp files to understand the repository structure:
<function_calls>

git ls-files | grep -i "staticcache|lmcache" | head -20

</function_calls>

Let me try a broader search in the repository:
<function_calls>

git ls-files | grep ".hpp$" | head -20

</function_calls>

Let me check the repository structure at the root:
<function_calls>

ls -la | head -30 && echo "---" && git ls-files | wc -l

</function_calls>

Let me search for files matching the path pattern mentioned in the review:
<function_calls>

git ls-files "lmcache" 2>/dev/null || find . -type d -name "lmcache" 2>/dev/null

</function_calls>

Let me check character encoding and format rules compliance on the potential file path:
<function_calls>

#!/bin/bash

Check file encoding rules

TARGET_FILE="mllm/nn/lmcache/StaticCache.hpp"

if [ -f "$TARGET_FILE" ]; then
echo "=== File exists ==="
file "$TARGET_FILE"

Check for BOM

if head -c 3 "$TARGET_FILE" | od -An -tx1 | grep -q "ef bb bf"; then
echo "ERROR: BOM detected"
else
echo "OK: No BOM"
fi

Check for C0/C1 control codes (excluding tab 0x09 and LF 0x0A)

if od -An -tx1 "$TARGET_FILE" | grep -E " (0[0-8]|0b|0c|0[e-1f]|7f|[89][0-9a-f])" | head -1; then
echo "Checking for invalid control codes..."
else
echo "OK: No invalid control codes"
fi

Check line endings

if file "$TARGET_FILE" | grep -q "CRLF"; then
echo "ERROR: CRLF line endings detected"
else
echo "OK: Unix line endings"
fi

Check trailing whitespace

if grep -n ' $' "$TARGET_FILE"; then
echo "ERROR: Trailing whitespace detected"
else
echo "OK: No trailing whitespace"
fi
else
echo "File not found: $TARGET_FILE"
echo "Searching for related files..."
find . -name "StaticCache.hpp" -o -name "lmcache" 2>/dev/null | head -20
fi


</function_calls>

Let me try a simpler approach to locate the file:
<function_calls>

pwd && find . -maxdepth 5 -type d -name "mllm" 2>/dev/null

</function_calls>

Now let me read the actual file content:
<function_calls>

cat mllm/nn/lmcache/StaticCache.hpp 2>&1

</function_calls>

Let me verify the virtual destructor presence and search for derived classes more thoroughly:
<function_calls>

#!/bin/bash

Search for AbstractStaticCache definition and derived classes

echo "=== Searching for AbstractStaticCache definition ==="
rg -n "class AbstractStaticCache" -A 20

echo -e "\n=== Searching for classes inheriting from AbstractStaticCache ==="
rg -n ": public AbstractStaticCache|: protected AbstractStaticCache|: private AbstractStaticCache"

echo -e "\n=== Searching for AbstractStaticCache usage patterns ==="
rg -n "AbstractStaticCache*|unique_ptr.*AbstractStaticCache|shared_ptr.*AbstractStaticCache" | head -20

echo -e "\n=== Checking for destructor in AbstractStaticCache ==="
rg -n "~AbstractStaticCache"


</function_calls>

Let me also verify coding guidelines compliance on the specific lines:
<function_calls>

#!/bin/bash

Verify encoding and format compliance for the file

TARGET_FILE="mllm/nn/lmcache/StaticCache.hpp"

if [ -f "$TARGET_FILE" ]; then
echo "=== Checking file: $TARGET_FILE ==="

Read lines 13-28 specifically

echo -e "\n=== Lines 13-28 ==="
sed -n '13,28p' "$TARGET_FILE" | cat -n

Check for BOM

echo -e "\n=== BOM Check ==="
hexdump -C "$TARGET_FILE" | head -1

Check for trailing whitespace in lines 13-28

echo -e "\n=== Trailing Whitespace in 13-28 ==="
sed -n '13,28p' "$TARGET_FILE" | grep -n ' $' || echo "None found (OK)"

Check line endings

echo -e "\n=== Line Ending Check ==="
file "$TARGET_FILE"
else
echo "File not found at $TARGET_FILE"
fi


</function_calls>

@chenghuaWang chenghuaWang merged commit 6cf710e into UbiquitousLearning:main Nov 29, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants