feat: Implement Qwen NPU Decoding Support with Memory Management Fixes by jialilve · Pull Request #537 · UbiquitousLearning/mllm

jialilve · 2025-11-20T04:00:57Z

Summary

This PR implements complete decoding support for Qwen NPU models on QNN backend, including both single-chunk and multi-chunk decoding capabilities. It also fixes critical memory management issues encountered during decode phase and improves CausalMaskOp for multi-chunk scenarios.

Features Implemented

1. Single-Chunk Decoding Support

Implemented basic decoding functionality for input sequences shorter than chunk size (128 tokens):

KV Cache Sequence Management: Added setKVCacheSeqCnt() and getKVCacheSeqCnt() methods across the KV cache hierarchy
- aops::KVCacheOp: Added virtual setCurrentSeqCnt() and getCurrentSeqCnt() methods
- CPUKVCacheOp: Implemented sequence count management using StaticCache
- nn::KVCache: Added layer interface for sequence count control
- QwenText and QwenForCausalLM: Added model-level APIs for KV cache management
Decode Loop Implementation:
- Implemented iterative token generation loop in examples/qwen_npu/main.cpp
- Handles position_ids correctly for decode phase
- Supports EOS token (151645) termination check
- Manages input sequence buffer with padding area for new tokens
Forward Method Updates:
- Enhanced QwenForCausalLM::forward() to support decode phase
- Proper handling of position_ids increment for decode iterations
- Support for variable sequence lengths during decode

2. Multi-Chunk Decoding Support

Extended decoding to handle long input sequences that exceed chunk size:

Chunked Prefill: Processes long prompts in 128-token chunks
KV Cache Alignment: Correctly aligns KV cache offsets for multi-chunk scenarios
- Uses absolute sequence length from start of entire sequence
- Sets KV cache sequence count to chunk start offset before each prefill
Decode Continuation: Continues decoding after processing all prompt chunks
Position IDs Generation: Generates position_ids starting from chunk offset for multi-chunk prefill

3. CausalMaskOp Improvement

CausalMaskOp improvement by @oreomaker.
Fixed causal mask calculation for multi-chunk decoding scenarios:

Problem: Original mask calculation copy_count = std::min(r + 1, (size_t)D) was incorrect for multi-chunk scenarios where sequence length (S) and dimension (D) differ.

Solution: Changed to copy_count = D - S + r + 1 to correctly handle cases where S < D (multi-chunk scenarios with padding).

Applied to both AVX2 (x86_64) and NEON (ARM64) implementations
Ensures correct masking behavior across chunk boundaries
Maintains backward compatibility with single-chunk scenarios

4. Memory Management Fixes

Fixed critical memory management issues in QNN backend during decode phase:

Problem 1: Failed to find memHandle 0x1

Root Cause: Same tensor (by ID/name) was registered multiple times, causing QNN to lose track of memory handles
Solution: Implemented buffer reuse mechanism with multi-level fallback

Problem 2: FastRPC Memory Mapping Failures

Root Cause: QNN HTP device memory exhausted (~2.5GB limit) when registering too many buffers
Solution: Multi-level fallback strategy to reuse existing buffers when registration fails

Problem 3: memDeRegister Failures

Root Cause: Attempts to de-register memory handles that were already de-registered or shared by multiple pointers
Solution: Implemented alias detection and reference counting for memory handle lifecycle management

Key Changes

Core KV Cache Interface (`mllm/core/aops/KVCacheOp.hpp`, `mllm/backends/cpu/ops/KVCacheOp.{hpp,cpp}`)

Added setCurrentSeqCnt(int32_t seq) virtual method to aops::KVCacheOp
Added getCurrentSeqCnt() const method to aops::KVCacheOp
Implemented both methods in CPUKVCacheOp using StaticCache API

Layer Interface (`mllm/nn/layers/KVCache.{hpp,cpp}`)

Added setCurrentSeqCnt(int32_t seq) method
Added getCurrentSeqCnt(int32_t layer_idx) const method

Model Interface (`mllm/models/qwen_npu/modeling_qwen_npu.hpp`)

Added setKVCacheSeqCnt(int32_t seq) to QwenText and QwenForCausalLM
Added getKVCacheSeqCnt(int32_t layer_idx) const method
Updated forward() method to handle decode phase with position_ids

QNN Backend Memory Management (`mllm/backends/qnn/QNNAllocator.{hpp,cpp}`)

Added tensorIdToPtrMap_ and tensorNameToPtrMap_ for buffer lookup by tensor identity
Implemented reuseExistingBuffer() lambda with multi-level fallback:
- Level 1: Check exact buffer pointer
- Level 2: Lookup by tensor ID
- Level 3: Lookup by tensor name
- Level 4: Reuse last successfully registered buffer
Added LastRegistrationInfo structure to track last successful registration
Implemented helper functions:
- eraseTensorMappingsForPtr(): Clean up tensor ID/name mappings
- rememberLastRegistration(): Track successful registrations
- clearLastRegistrationIfMatches(): Clean up last registration info
Enhanced free() method with alias detection and reference counting
Added multi-level fallback in registerQnnTensorToSharedBuffer() when registration fails

QNN Backend Execution (`mllm/backends/qnn/QNNBackend.cpp`)

Improved input tensor data copying in graphExecute()
Added size mismatch detection and zero-padding for decode phase inputs
Enhanced error messages with detailed tensor information

QNN Utils (`mllm/backends/qnn/QNNUtils.cpp`)

Added buffer size validation in QNNTensorWrapper::alloc()
Implemented automatic de-registration when registered buffer is too small
Added checks for buffer validity before reuse

CausalMaskOp (`mllm/backends/cpu/ops/CausalMaskOp.cpp`)

Fixed mask calculation for multi-chunk scenarios:
- Changed from copy_count = std::min(r + 1, (size_t)D)
- To copy_count = D - S + r + 1
Applied fix to both AVX2 and NEON implementations

Example Implementation (`examples/qwen_npu/main.cpp`)

Implemented single-chunk decoding loop with:
- KV cache sequence count management
- Position IDs handling
- EOS token termination
- Input sequence buffer management
Extended to multi-chunk decoding with:
- Chunked prefill processing
- KV cache alignment across chunks
- Decode continuation after all chunks processed
- Proper position IDs generation for multi-chunk scenarios

Related Commits

This PR consolidates the following commits:

1d5d253: feat: implement Qwen NPU simple single chunk decoding support and Memory management fixes
b438b3d: implement Qwen NPU simple muti-chunk decoding support
e26b11b: fix: stabilize QNN multi-chunk decoding (including CausalMaskOp improvement)

Co-authors

This PR is a collaborative effort:

@oreomaker - Technical guidance, CausalMaskOp improvement for multi-chunk decoding, and code review
@jialilve - Main implementation including single-chunk/multi-chunk decoding support and memory management fixes

Summary by CodeRabbit

Release Notes

New Features
- Enhanced multi-chunk sequence processing with support for larger input sequences.
- Improved memory management with robust buffer registration and fallback handling.
Bug Fixes
- Refined input validation and data handling in graph execution.
- Fixed causal mask calculations for improved attention mask accuracy.
Improvements
- Added KV cache sequence count management for better cache tracking during inference.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

- Add KV cache sequence count management - Implement decode loop with position_ids handling - Add EOS token termination check - Update forward method to support decode phase

…ding-backup

@oreomaker

- correct multi-chunk decode loop and KV cache sequencing - CausalMaskOp improvement by @oreomaker

…ding

coderabbitai · 2025-11-20T04:01:02Z

Walkthrough

The PR introduces KV cache sequence count management APIs across core and backend layers, significantly enhances QNN allocator buffer lifecycle and registration handling, refactors the Qwen NPU example to support chunked inference with multi-phase prefill/decode, and updates causal mask calculation logic for improved row-wise processing.

Changes

Cohort / File(s)	Summary
KV Cache Sequence Management `mllm/core/aops/KVCacheOp.hpp`, `mllm/nn/layers/KVCache.hpp`, `mllm/nn/layers/KVCache.cpp`, `mllm/backends/cpu/ops/KVCacheOp.hpp`, `mllm/backends/cpu/ops/KVCacheOp.cpp`	Added virtual methods `setCurrentSeqCnt(int32_t)` and `getCurrentSeqCnt() const` with default implementations; CPU backend delegates to underlying cache; enables sequence counter queries across the KV cache hierarchy.
Qwen NPU Model Accessors `mllm/models/qwen_npu/modeling_qwen_npu.hpp`	Added const-qualified `getKVCache() const` overloads to `QwenAttentionMatmul` and `QwenDecoder`; added `setKVCacheSeqCnt()` and `getKVCacheSeqCnt()` to `QwenText` and `QwenForCausalLM` for KV cache sequence management propagation through model hierarchy.
QNN Allocator Lifecycle & Buffer Management `mllm/backends/qnn/QNNAllocator.hpp`, `mllm/backends/qnn/QNNAllocator.cpp`	Added public destructor, `free()` method, `registerQnnTensorToSharedBuffer(Storage*, Qnn_Tensor_t&)` returning bool with multi-level fallback logic, `deRegisterQnnTensorFromSharedBuffer()`, buffer inspection methods (`isRegistered()`, `getRegisteredBufferSize()`, `getRegisteredBufferStats()`), and internal tracking helpers; changed registration signature from void pointer to Storage object.
QNN Tensor Wrapper Allocation `mllm/backends/qnn/QNNUtils.hpp`, `mllm/backends/qnn/QNNUtils.cpp`	Added `registeredPtr_` private field, `resetAlloc()` method, and robust shared-buffer registration with multi-step lifecycle management for static tensors; marked `getName()` with `[[nodiscard]]`.
QNN Backend Input Handling `mllm/backends/qnn/QNNBackend.cpp`	Added defensive input preparation in `graphExecute()` with byte-size validation, data copying with zero-padding for undersized inputs, wrapper tensor allocation, and QNN allocator registration before execution.
Causal Mask Calculation `mllm/backends/cpu/ops/CausalMaskOp.cpp`	Updated copy/fill count arithmetic from `min(r+1, D)` and `D > copy_count ? (D-copy_count) : 0` to `copy_count = D - S + r + 1` and `fill_count = max(D - copy_count, 0)` in both AVX2 and NEON code paths.
Chunked Inference Pipeline `examples/qwen_npu/main.cpp`	Refactored to support chunked inference: widened input tensor from {1, 32} to {1, 128}; added chunking logic with chunk_size=128 and prompt_chunks; introduced KV-cache sequence alignment via `setKVCacheSeqCnt(chunk_start)`; implemented two-phase per-chunk process (prefill + decode); added position_ids handling, EOS detection (token_id 151645), per-chunk decode looping, and verbose logging; replaced single-step forward with persistent sequence/position management across chunks.
Const-Correctness Utilities `mllm/nn/Module.hpp`	Added const-qualified `list() const` overload to `ModuleList` for accessing internal `layers_` vector without mutation.

Sequence Diagram(s)

sequenceDiagram
    participant Main as main()
    participant Model as QwenForCausalLM
    participant Cache as KVCache
    participant Allocator as QNNAllocator
    participant Backend as QNNBackend
    
    Main->>Model: setKVCacheSeqCnt(chunk_start)
    Model->>Cache: setCurrentSeqCnt(seq)
    Cache->>Backend: Update sequence counter
    
    loop For each chunk
        Main->>Model: forward(prompt_chunk or position_ids)
        Model->>Cache: Store KV cache state
        Cache->>Allocator: registerQnnTensorToSharedBuffer(Storage*)
        Allocator->>Allocator: Multi-level fallback search
        alt Registration success
            Allocator->>Backend: Update tensor mem_handle
            Backend->>Backend: Copy input data with validation
            Backend->>Model: Execute graph
            Model->>Model: Generate next token
        else Registration fails
            Allocator->>Allocator: Free buffer & restore state
        end
        Main->>Model: Token emit & EOS check
    end

sequenceDiagram
    participant Input as Input Data
    participant QNNBackend as QNNBackend::graphExecute()
    participant Wrapper as QNNTensorWrapper
    participant Allocator as QNNAllocator
    participant QNN as QNN Runtime
    
    Input->>QNNBackend: Runtime inputs
    QNNBackend->>QNNBackend: Validate input non-nil
    QNNBackend->>Wrapper: Allocate wrapper tensor
    QNNBackend->>Wrapper: Copy data (min bytes or zero-pad)
    Wrapper->>Allocator: registerQnnTensorToSharedBuffer()
    alt Buffer registered
        Allocator->>Allocator: Reuse or create mem_handle
        Allocator->>Wrapper: Update Qnn_Tensor_t
    else Fallback exhausted
        Allocator->>Wrapper: Fail & free buffer
        Wrapper->>QNNBackend: Registration error
    end
    QNNBackend->>QNN: Execute with prepared inputs

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

QNN Allocator buffer lifecycle management (QNNAllocator.hpp/.cpp): Multi-level fallback registration logic, pointer aliasing detection, and memory deregistration require careful verification of edge cases and state consistency.
Chunked inference refactoring (examples/qwen_npu/main.cpp): Significant architectural change with new chunking logic, KV-cache alignment per chunk, position_ids handling, and decode looping; requires validation of token emission, EOS detection, and padding correctness.
Cross-layer KVCache API propagation (KVCacheOp, KVCache, QwenText, QwenForCausalLM): Verify consistency of sequence count management throughout the hierarchy and interaction with QNN allocator registration.
Input preparation and validation (QNNBackend.cpp): Byte-size mismatch handling and zero-padding logic must align with expected tensor layouts and prevent buffer overruns.

Possibly related PRs

feat(qnn): Basic QNN Prefill on v2 #485: Directly related—both modify Qwen NPU example and QNN/KV-cache backend interfaces including example main.cpp and KVCache/QNN API changes.
feat: add kai&qnn-vl&opencl #489: Related through Qwen NPU example and QNN/Context/KV-cache integration patterns in examples and backend handling.
feat(cpu): add inplace rmsnorm implementations for fp32 and fp16 #483: Related through KV-cache management API additions (set/get sequence count) and exposure of KV-cache accessors across the model hierarchy.

Suggested reviewers

liang1232018
yirongjie
oreomaker

Poem

🐰 Hops with glee through chunks so fine,
KV caches aligned in perfect line,
Buffers freed and registered with care,
Sequences managed everywhere—eared
A feast of frames, decoded right,
The NPU dances through the night! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main objectives of the PR: implementing Qwen NPU decoding support and fixing memory management issues, which align with the substantial changes across decoder APIs, KV cache management, CausalMaskOp, and QNN backend memory handling.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ding

chenghuaWang · 2025-11-20T08:22:43Z

@coderabbitai review

coderabbitai · 2025-11-20T08:22:48Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

mllm/backends/cpu/ops/CausalMaskOp.cpp (2)
61-61: Critical: Undefined variable row_offset in AVX2 path.

Line 61 references row_offset which is not defined in this scope (the non-sliding-window block). The loop variable is r, and this should match line 59's addressing and the NEON implementation on line 76.

Apply this diff to fix the undefined variable:
-            float* fill_start = o_ptr + row_offset + copy_count;
+            float* fill_start = o_ptr + r * D + copy_count;
146-181: Float16 paths not updated with multi-chunk fix.

The float16 implementations (both AVX2 and NEON) still use the old formula copy_count = s + 1 instead of the corrected D - S + r + 1. This creates an inconsistency where float32 correctly handles multi-chunk scenarios but float16 still assumes D == S.

If float16 dtype is used during multi-chunk decoding, the attention masks will be incorrect.

Apply this diff to update the float16 AVX2 path (lines 149-152):
           for (size_t s = 0; s < S; ++s) {
             const size_t row_offset = s * S;
-            const size_t copy_count = s + 1;
-            const size_t fill_count = S - copy_count;
+            const size_t copy_count = D - S + s + 1;
+            const size_t fill_count = std::max(D - copy_count, (size_t)0);
Apply this diff to update the float16 NEON path (lines 167-170):
           for (size_t s = 0; s < S; ++s) {
             const size_t row_offset = s * S;
-            const size_t copy_count = s + 1;
-            const size_t fill_count = S - copy_count;
+            const size_t copy_count = D - S + s + 1;
+            const size_t fill_count = std::max(D - copy_count, (size_t)0);

🧹 Nitpick comments (7)

mllm/nn/Module.hpp (1)
190-212: Const list() overload looks good; consider marking both overloads [[nodiscard]]

The added const std::vector<T>& list() const is a clean const‑correct accessor and matches existing usage of layers_. To keep clang-tidy happy and make intent explicit that callers shouldn’t silently ignore this, you can mark both overloads as [[nodiscard]]:
[[nodiscard]] std::vector<T>& list() { return layers_; }
[[nodiscard]] const std::vector<T>& list() const { return layers_; }
mllm/backends/cpu/ops/KVCacheOp.hpp (1)

12-28: CPUKVCacheOp seq-count overrides are consistent; consider [[nodiscard]]

The setCurrentSeqCnt / getCurrentSeqCnt overrides align with the new aops::KVCacheOp API and look correct.

To satisfy clang-tidy and make ignoring the result explicit, consider marking the base KVCacheOp::getCurrentSeqCnt() (and thus this override) as [[nodiscard]] so all backends inherit the contract.
mllm/nn/layers/KVCache.hpp (1)
19-28: KVCache seq-count API is reasonable; consider [[nodiscard]] on getter

The added setCurrentSeqCnt / getCurrentSeqCnt accessors give an appropriate hook for higher layers to manage KV cache sequence length.

Given that ignoring the return value from getCurrentSeqCnt() is almost certainly unintended and clang-tidy is already flagging it, consider:
[[nodiscard]] int32_t getCurrentSeqCnt() const;
to make the contract explicit and quiet the warning.
mllm/core/aops/KVCacheOp.hpp (1)
35-49: Base KVCacheOp seq-count hooks are well-scoped; consider [[nodiscard]] on getter (and possibly options())

The new virtuals with safe defaults are a good way to expose seq-count management without forcing all backends to implement it immediately.

Given this is effectively a query API, it’s worth marking getCurrentSeqCnt() as [[nodiscard]] in the base class so callers don’t accidentally drop the value:
virtual [[nodiscard]] int32_t getCurrentSeqCnt() const { return -1; }
You may also want to mark options() as [[nodiscard]] to address the clang-tidy hint, but that’s more stylistic.
mllm/backends/qnn/QNNBackend.cpp (1)
567-580: Guard null src_ptr check by bytes_to_copy to handle 0-byte inputs safely

Right now, a runtime tensor with src_bytes == 0 but a null data pointer will still trigger the error path, even though no bytes need to be copied:
const void* src_ptr = runtime_input.ptr<void>();
size_t bytes_to_copy = std::min(dst_bytes, src_bytes);
if (!src_ptr) {
  MLLM_ERROR("Runtime input tensor {} for graph '{}' has null data pointer", i, graphName);
  return;
}
if (dst_ptr && src_ptr && dst_ptr != src_ptr) {
  if (bytes_to_copy > 0) {
    std::memcpy(dst_ptr, src_ptr, bytes_to_copy);
  }
  ...
}
Consider only enforcing src_ptr != nullptr when bytes_to_copy > 0, e.g.:
-      const void* src_ptr = runtime_input.ptr<void>();
-      size_t bytes_to_copy = std::min(dst_bytes, src_bytes);
-      if (!src_ptr) {
-        MLLM_ERROR("Runtime input tensor {} for graph '{}' has null data pointer", i, graphName);
-        return;
-      }
+      const void* src_ptr = runtime_input.ptr<void>();
+      size_t bytes_to_copy = std::min(dst_bytes, src_bytes);
+      if (bytes_to_copy > 0 && !src_ptr) {
+        MLLM_ERROR("Runtime input tensor {} for graph '{}' has null data pointer", i, graphName);
+        return;
+      }
This keeps the error for real data copies while remaining robust if a zero-length tensor is passed.
mllm/backends/qnn/QNNAllocator.cpp (2)
94-107: Consider nulling storage->ptr_ even when the pointer is not tracked

In QNNAllocator::free(Storage* storage), you early-return when the pointer is not in qnnMemPtrSet_:
if (qnnMemPtrSet_.count(ptr) == 0) {
  QNN_ALLOCATOR_VERBOSE("QNNAllocator::free called for ptr={} that is not in qnnMemPtrSet_, ignoring", ptr);
  return;
}
...
storage->ptr_ = nullptr;
This leaves storage->ptr_ untouched in that path, so callers may still see a non-null pointer after free() if the storage was never allocated by this allocator (or was already cleaned up elsewhere).

For extra safety and clearer semantics (“after free, ptr_ is either owned nowhere or null”), you could null it even on the early-return path:
  if (qnnMemPtrSet_.count(ptr) == 0) {
    QNN_ALLOCATOR_VERBOSE("QNNAllocator::free called for ptr={} that is not in qnnMemPtrSet_, ignoring", ptr);
-    return;
+    storage->ptr_ = nullptr;
+    return;
  }
This shouldn’t affect the normal ownership protocol but makes misuses of the allocator interface less likely to leave stale pointers around.

Also applies to: 181-182

15-23: Macro and naming style issues are non-functional but can be aligned with lint expectations

Static analysis complains about:

kVerboseQnnAllocatorLogs naming style.

The variadic macro QNN_ALLOCATOR_VERBOSE.

These are stylistic only, but if you want to silence the lints you can:

Rename the constant to match your project’s convention (e.g. kVerboseQnnAllocatorLogs → kVerboseQnnAllocatorLogsEnabled or VERBOSE_QNN_ALLOCATOR_LOGS depending on your rules).

Replace the macro with an inline helper:
inline void QnnAllocatorVerboseLog(const char* fmt, auto&&... args) {
  if constexpr (kVerboseQnnAllocatorLogs) {
    MLLM_INFO(fmt, std::forward<decltype(args)>(args)...);
  }
}
and call QnnAllocatorVerboseLog(...) instead of the macro.

Given the flag is currently false, this is low priority and purely to keep tooling quiet.

Also applies to: 492-512

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31ab2ff and 2f6077b.

📒 Files selected for processing (14)

examples/qwen_npu/main.cpp (3 hunks)
mllm/backends/cpu/ops/CausalMaskOp.cpp (2 hunks)
mllm/backends/cpu/ops/KVCacheOp.cpp (1 hunks)
mllm/backends/cpu/ops/KVCacheOp.hpp (1 hunks)
mllm/backends/qnn/QNNAllocator.cpp (4 hunks)
mllm/backends/qnn/QNNAllocator.hpp (4 hunks)
mllm/backends/qnn/QNNBackend.cpp (2 hunks)
mllm/backends/qnn/QNNUtils.cpp (4 hunks)
mllm/backends/qnn/QNNUtils.hpp (2 hunks)
mllm/core/aops/KVCacheOp.hpp (1 hunks)
mllm/models/qwen_npu/modeling_qwen_npu.hpp (4 hunks)
mllm/nn/Module.hpp (1 hunks)
mllm/nn/layers/KVCache.cpp (1 hunks)
mllm/nn/layers/KVCache.hpp (1 hunks)

🧰 Additional context used

🪛 Clang (14.0.6)

mllm/nn/layers/KVCache.hpp

[error] 27-27: function 'getCurrentSeqCnt' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

mllm/backends/cpu/ops/KVCacheOp.hpp

[error] 24-24: function 'getCurrentSeqCnt' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

mllm/nn/Module.hpp

[error] 211-211: function 'list' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

mllm/backends/qnn/QNNUtils.cpp

[error] 376-376: variable 'currentPtr' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 376-376: invalid case style for variable 'currentPtr'

(readability-identifier-naming,-warnings-as-errors)

[error] 377-377: implicit conversion 'void *' -> bool

(readability-implicit-bool-conversion,-warnings-as-errors)

[error] 387-387: variable 'requiredBytes' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 387-387: invalid case style for variable 'requiredBytes'

(readability-identifier-naming,-warnings-as-errors)

[error] 392-392: implicit conversion 'void *' -> bool

(readability-implicit-bool-conversion,-warnings-as-errors)

[error] 401-401: variable 'registeredBytes' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 401-401: invalid case style for variable 'registeredBytes'

(readability-identifier-naming,-warnings-as-errors)

[error] 416-416: variable 'freshPtr' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 416-416: invalid case style for variable 'freshPtr'

(readability-identifier-naming,-warnings-as-errors)

[error] 417-417: variable 'bytesToCopy' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 417-417: invalid case style for variable 'bytesToCopy'

(readability-identifier-naming,-warnings-as-errors)

[error] 419-419: implicit conversion 'void *' -> bool

(readability-implicit-bool-conversion,-warnings-as-errors)

mllm/core/aops/KVCacheOp.hpp

[error] 43-43: function 'getCurrentSeqCnt' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

[error] 45-45: function 'options' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

[error] 48-48: member variable 'options_' has protected visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)

mllm/backends/qnn/QNNAllocator.cpp

[error] 16-16: invalid case style for global constant 'kVerboseQnnAllocatorLogs'

(readability-identifier-naming,-warnings-as-errors)

[error] 19-19: variadic macro 'QNN_ALLOCATOR_VERBOSE' used; consider using a 'constexpr' variadic template function

(cppcoreguidelines-macro-usage,-warnings-as-errors)

mllm/backends/qnn/QNNAllocator.hpp

[error] 77-77: constructor does not initialize these fields: count, total_bytes

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

[error] 120-120: constructor does not initialize these fields: tensor_name

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

🔇 Additional comments (6)

mllm/backends/cpu/ops/CausalMaskOp.cpp (1)

56-57: Multi-chunk causal mask formula is mathematically correct.

The updated formula copy_count = D - S + r + 1 correctly handles both single-chunk (D == S → r + 1) and multi-chunk scenarios (D > S → allows attending to all previous chunks). The defensive max(D - copy_count, 0) prevents underflow.

Also applies to: 71-72

mllm/backends/qnn/QNNUtils.hpp (1)

184-222: QNNTensorWrapper header additions look consistent with implementation

The [[nodiscard]] getName(), the new resetAlloc() API, and the registeredPtr_ member cleanly mirror the allocation/registration logic implemented in QNNUtils.cpp. Interface shape and constness look good; no issues from the header side.

mllm/nn/layers/KVCache.cpp (1)

20-32: KVCache seq-count forwarding matches existing delegation pattern

setCurrentSeqCnt and getCurrentSeqCnt correctly forward to the underlying aops::KVCacheOp using the same static_pointer_cast pattern as setLayerIndex and clearCache. No functional issues from this layer wrapper.

mllm/backends/cpu/ops/KVCacheOp.cpp (1)

45-50: Please confirm StaticCache::setCurrentSeqCnt semantics vs per-layer usage

CPUKVCacheOp::getCurrentSeqCnt() queries cache_.getCurrentSeqCnt(options_.layer_idx), while setCurrentSeqCnt(int32_t seq) calls cache_.setCurrentSeqCnt(seq) without the layer index.

If nn::StaticCache tracks sequence counts per layer (as the getter signature suggests), you may want a symmetric API that also keys setCurrentSeqCnt by layer_idx. If instead setCurrentSeqCnt is intentionally global/shared, this is fine—just worth confirming to avoid mixing global and per-layer state by accident.

mllm/backends/qnn/QNNUtils.cpp (1)

14-15: QNNTensorWrapper registered-buffer reuse logic looks consistent and defensive

Including <cstring> for std::memcpy, tracking registeredPtr_ for static tensors, and the expanded alloc() logic collectively give you:

Safe reuse of an existing registered buffer when it’s still valid and large enough.

A clear path to drop and re-register when the old buffer is too small.

Protection against dangling registeredPtr_ via isRegistered() checks.

An early return when the current storage is already the registered buffer, avoiding redundant registration.

The separation between isAlloc_ (binding state) and registeredPtr_ (last successful buffer) also aligns with the allocator-level “remember last registration” behavior mentioned in the PR description. From this file alone, the control flow and memory handling look sound.

Also applies to: 352-371, 373-439

mllm/models/qwen_npu/modeling_qwen_npu.hpp (1)

270-272: KV cache accessors and seq-count plumbing look consistent

The new const getKVCache overloads, QwenText::setKVCacheSeqCnt / getKVCacheSeqCnt, and the corresponding QwenForCausalLM forwards form a clean, minimal surface for external KV cache sequence management, with reasonable bounds checking on layer_idx. No issues from this header-level change.

Also applies to: 446-455, 469-473

coderabbitai · 2025-11-20T08:30:43Z

+    while (!reached_eos && current_chunk_len < chunk_size) {
+      total_decode_steps++;
+
+      // Calculate absolute sequence length from the start of the entire sequence
+      const int absolute_seq_len = chunk_start + current_chunk_len;
+
+      // MLLM_INFO("--- Chunk {} Decode Step {} ---", chunk_index, total_decode_steps);
+      // MLLM_INFO("Current chunk length: {} (relative), Absolute sequence length: {} (absolute)", current_chunk_len, absolute_seq_len);
+
+      // Keep padding clean for the remaining area
+      for (int i = current_chunk_len; i < chunk_size; ++i) { sequence_ptr[i] = -1; }
+
+      // Set KV cache to absolute sequence length (where the next token will be written)
+      // [Maybe Wrong]
+      model.setKVCacheSeqCnt(chunk_start);
+      // MLLM_INFO("KV cache seq_cnt set to: {} (relative position)", chunk_start);
+
+      // Prepare decode input with position_ids from previous step
+      mllm::models::ARGenerationOutputPast decode_inputs{
+          {"sequence", sequence_tensor},
+          {"position_ids", position_ids}};
+
+      // real_seq should be the effective length in the current input tensor (relative position)
+      // hidden_states shape is [1, chunk_size, hidden_size], we need to index it with current_chunk_len - 1
+      auto decode_output = model.forward(
+          decode_inputs, {{"seq_len", mllm::AnyValue(mllm::any_copy_tag, current_chunk_len)}});
+
+      auto& decode_logits = decode_output["sequence"];
+      next_token = model.sampleGreedy(decode_logits);
+      decode_logits.delete_();
+      decode_output.erase("sequence");
+      decode_output.clear();
+
+      emit_token(next_token);
+      if (reached_eos) { break; }
+
+      sequence_ptr[current_chunk_len] = next_token;
+      current_chunk_len++;
+    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

KV cache seq count is likely wrong in the decode loop (comment already says [Maybe Wrong])

In the per-chunk decode loop you recompute the absolute sequence length:

const int absolute_seq_len = chunk_start + current_chunk_len; ... // Set KV cache to absolute sequence length (where the next token will be written) // [Maybe Wrong] model.setKVCacheSeqCnt(chunk_start);

Using chunk_start here ignores already-consumed tokens in the current chunk and appears inconsistent with the comment and with the new KV cache APIs. This can misalign the KV cache during multi-step decode, especially after the first generated token within the last chunk.

A more consistent approach would be to base the sequence count on the absolute number of valid tokens processed so far, e.g.:

- // Set KV cache to absolute sequence length (where the next token will be written) - // [Maybe Wrong] - model.setKVCacheSeqCnt(chunk_start); + // Set KV cache to the absolute sequence length so far (prefill + decoded tokens) + // The next token will be written after `absolute_seq_len - 1`. + model.setKVCacheSeqCnt(absolute_seq_len);

or, if setKVCacheSeqCnt is defined to take “index of next write” rather than “current valid length”, adjust accordingly (e.g. absolute_seq_len - 1).

Please double-check this against nn::KVCache::setCurrentSeqCnt semantics; the current implementation is very likely incorrect for multi-step decode.

--- 🏁 Script executed: ```shell #!/bin/bash # Find setKVCacheSeqCnt definition and usage rg -n 'setKVCacheSeqCnt' --type=cpp -B3 -A3

Length of output: 2302

🏁 Script executed:

#!/bin/bash # Examine the full context of main.cpp around the decode loop (lines 140-210) sed -n '140,210p' examples/qwen_npu/main.cpp

Length of output: 2654

🏁 Script executed:

#!/bin/bash # Search for prefill loop or chunked processing setup rg -n 'chunk_start\|chunk_size\|prefill' examples/qwen_npu/main.cpp --type=cpp -B2 -A2

Length of output: 0

🏁 Script executed:

#!/bin/bash # Find KVCache class definition find . -name '*.hpp' -o -name '*.h' | xargs rg -l 'class KVCache\|struct KVCache' 2>/dev/null | head -5

Length of output: 0

--- 🏁 Script executed: ```shell #!/bin/bash # Get the prefill loop context before the decode loop sed -n '70,145p' examples/qwen_npu/main.cpp

Length of output: 3484

🏁 Script executed:

#!/bin/bash # Search for setCurrentSeqCnt definition and usage rg -n 'setCurrentSeqCnt' --type=cpp --type=hpp -B2 -A5

Length of output: 27

🏁 Script executed:

#!/bin/bash # Look for any StaticCache or KVCache definition in the mllm directory find mllm -name '*.hpp' -o -name '*.h' | xargs grep -l 'setCurrentSeqCnt' 2>/dev/null | head -3

Length of output: 128

🏁 Script executed:

#!/bin/bash # Search for how prefill is called and initialized rg -n 'forward.*prefill\|prefill.*forward' examples/qwen_npu/main.cpp --type=cpp -B5 -A5

Length of output: 0

--- 🏁 Script executed: ```shell #!/bin/bash # Find setCurrentSeqCnt definition in KVCacheOp.hpp cat mllm/backends/cpu/ops/KVCacheOp.hpp | grep -A10 'setCurrentSeqCnt'

Length of output: 326

🏁 Script executed:

#!/bin/bash # Look for StaticCache or KVCache class definition rg -n 'class.*KVCache\|class.*StaticCache' mllm/backends/cpu/ops/KVCacheOp.hpp -A15

Length of output: 0

--- 🏁 Script executed: ```shell #!/bin/bash # Find nn::StaticCache definition to confirm setCurrentSeqCnt semantics find mllm -name '*.hpp' -o -name '*.h' | xargs grep -l 'class StaticCache\|struct StaticCache' 2>/dev/null

Length of output: 31

🏁 Script executed:

#!/bin/bash # Get broader context of StaticCache to understand seq_cnt semantics rg -n 'setCurrentSeqCnt|getCurrentSeqCnt' mllm/backends/cpu/ -B3 -A8 | head -40

Length of output: 2657

--- 🏁 Script executed: ```shell #!/bin/bash # Check StaticCache implementation sed -n '1,100p' mllm/nn/lmcache/StaticCache.hpp

Length of output: 3312

🏁 Script executed:

#!/bin/bash # Search for setCurrentSeqCnt in StaticCache rg -n 'setCurrentSeqCnt' mllm/nn/lmcache/StaticCache.hpp -B2 -A8

Length of output: 943

KV cache seq count must track absolute sequence length, not just chunk offset

The bug is confirmed. In the decode loop at line 176, you reset the KV cache seq count to chunk_start on every iteration, but current_chunk_len increments. This causes the cache to lose track of tokens decoded within the chunk:

Iteration 1: current_chunk_len = chunk_prompt_len + 1, but cache set to chunk_start → off by chunk_prompt_len tokens

Iteration 2: current_chunk_len = chunk_prompt_len + 2, but cache still chunk_start → off by chunk_prompt_len + 1 tokens

The computed absolute_seq_len variable (line 163) is never used, indicating incomplete implementation. The fix is to use it:

- model.setKVCacheSeqCnt(chunk_start); + model.setKVCacheSeqCnt(absolute_seq_len);

Additionally, the prefill at line 97 should similarly use chunk_start + chunk_prompt_len (the absolute valid length after prefill) instead of just chunk_start.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

while (!reached_eos && current_chunk_len < chunk_size) {

total_decode_steps++;

// Calculate absolute sequence length from the start of the entire sequence

const int absolute_seq_len = chunk_start + current_chunk_len;

// MLLM_INFO("--- Chunk {} Decode Step {} ---", chunk_index, total_decode_steps);

// MLLM_INFO("Current chunk length: {} (relative), Absolute sequence length: {} (absolute)", current_chunk_len, absolute_seq_len);

// Keep padding clean for the remaining area

for (int i = current_chunk_len; i < chunk_size; ++i) { sequence_ptr[i] = -1; }

// Set KV cache to absolute sequence length (where the next token will be written)

// [Maybe Wrong]

model.setKVCacheSeqCnt(chunk_start);

// MLLM_INFO("KV cache seq_cnt set to: {} (relative position)", chunk_start);

// Prepare decode input with position_ids from previous step

mllm::models::ARGenerationOutputPast decode_inputs{

{"sequence", sequence_tensor},

{"position_ids", position_ids}};

// real_seq should be the effective length in the current input tensor (relative position)

// hidden_states shape is [1, chunk_size, hidden_size], we need to index it with current_chunk_len - 1

auto decode_output = model.forward(

decode_inputs, {{"seq_len", mllm::AnyValue(mllm::any_copy_tag, current_chunk_len)}});

auto& decode_logits = decode_output["sequence"];

next_token = model.sampleGreedy(decode_logits);

decode_logits.delete_();

decode_output.erase("sequence");

decode_output.clear();

emit_token(next_token);

if (reached_eos) { break; }

sequence_ptr[current_chunk_len] = next_token;

current_chunk_len++;

}

while (!reached_eos && current_chunk_len < chunk_size) {

total_decode_steps++;

// Calculate absolute sequence length from the start of the entire sequence

const int absolute_seq_len = chunk_start + current_chunk_len;

// MLLM_INFO("--- Chunk {} Decode Step {} ---", chunk_index, total_decode_steps);

// MLLM_INFO("Current chunk length: {} (relative), Absolute sequence length: {} (absolute)", current_chunk_len, absolute_seq_len);

// Keep padding clean for the remaining area

for (int i = current_chunk_len; i < chunk_size; ++i) { sequence_ptr[i] = -1; }

// Set KV cache to absolute sequence length (where the next token will be written)

// [Maybe Wrong]

model.setKVCacheSeqCnt(absolute_seq_len);

// MLLM_INFO("KV cache seq_cnt set to: {} (relative position)", chunk_start);

// Prepare decode input with position_ids from previous step

mllm::models::ARGenerationOutputPast decode_inputs{

{"sequence", sequence_tensor},

{"position_ids", position_ids}};

// real_seq should be the effective length in the current input tensor (relative position)

// hidden_states shape is [1, chunk_size, hidden_size], we need to index it with current_chunk_len - 1

auto decode_output = model.forward(

decode_inputs, {{"seq_len", mllm::AnyValue(mllm::any_copy_tag, current_chunk_len)}});

auto& decode_logits = decode_output["sequence"];

next_token = model.sampleGreedy(decode_logits);

decode_logits.delete_();

decode_output.erase("sequence");

decode_output.clear();

emit_token(next_token);

if (reached_eos) { break; }

sequence_ptr[current_chunk_len] = next_token;

current_chunk_len++;

}

🤖 Prompt for AI Agents

In examples/qwen_npu/main.cpp around lines 162 to 200, the KV cache seq count is being set to chunk_start each decode iteration which ignores tokens already decoded in the current chunk; replace the call to model.setKVCacheSeqCnt(chunk_start) with model.setKVCacheSeqCnt(absolute_seq_len) (where absolute_seq_len = chunk_start + current_chunk_len) so the cache uses the true absolute write position, and also update the prefill call near line 97 to set the seq count to chunk_start + chunk_prompt_len (the absolute length after prefill) instead of just chunk_start.

coderabbitai · 2025-11-20T08:30:43Z

+  struct BufferStats {
+    size_t count;
+    size_t total_bytes;
+  };
+  [[nodiscard]] BufferStats getRegisteredBufferStats() const;
+


⚠️ Potential issue | 🟠 Major

Initialize BufferStats and LastRegistrationInfo members to satisfy static analysis and avoid UB

BufferStats currently has uninitialized members, and LastRegistrationInfo::tensor_name relies on implicit default construction. Clang-tidy is already flagging these. Adding explicit in-class initializers is cheap and removes any doubt about default state:

- struct BufferStats { - size_t count; - size_t total_bytes; - }; + struct BufferStats { + size_t count = 0; + size_t total_bytes = 0; + };

For LastRegistrationInfo, you can also make the initialization explicit:

- struct LastRegistrationInfo { - uint32_t tensor_id = 0; // Tensor ID of the registered tensor - std::string tensor_name; // Tensor name of the registered tensor - void* ptr = nullptr; // Buffer pointer that was successfully registered - Qnn_MemHandle_t mem_handle = nullptr; // QNN memory handle from successful registration - size_t bytes = 0; // Size of the registered buffer in bytes - }; + struct LastRegistrationInfo { + uint32_t tensor_id = 0; // Tensor ID of the registered tensor + std::string tensor_name{}; // Tensor name of the registered tensor + void* ptr = nullptr; // Buffer pointer that was successfully registered + Qnn_MemHandle_t mem_handle = nullptr; // QNN memory handle from successful registration + size_t bytes = 0; // Size of the registered buffer in bytes + };

This aligns with the static-analysis expectations and guarantees well-defined default values.

Also applies to: 120-129

🧰 Tools

🪛 Clang (14.0.6)

[error] 77-77: constructor does not initialize these fields: count, total_bytes

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

🤖 Prompt for AI Agents

In mllm/backends/qnn/QNNAllocator.hpp around lines 77-82 (and also apply same change for lines ~120-129), BufferStats and LastRegistrationInfo have members that are not explicitly initialized; update the class/struct declarations to provide in-class default initializers (e.g., set BufferStats::count = 0 and BufferStats::total_bytes = 0, and initialize LastRegistrationInfo::tensor_name to an empty string) so that default construction yields well-defined values and satisfies static analysis.

jialilve added 10 commits November 13, 2025 15:57

backup WIP qwen-npu-decoding changes

1270b5f

feat: implement Qwen NPU simple single chunk decoding support

1d5d253

- Add KV cache sequence count management - Implement decode loop with position_ids handling - Add EOS token termination check - Update forward method to support decode phase

implement Qwen NPU simple muti-chunk decoding support

b438b3d

Merge remote-tracking branch 'upstream/v2' into feature/qwen-npu-deco…

d773b69

…ding-backup

fix: stabilize QNN multi-chunk decoding

e26b11b

- correct multi-chunk decode loop and KV cache sequencing - CausalMaskOp improvement by @oreomaker

update examples/qwen_npu/main.cpp

079a01e

Merge remote-tracking branch 'upstream/v2' into feature/qwen-npu-deco…

e777dbe

…ding

Add code comments to QNN backend

cdb7dfa

Merge remote-tracking branch 'upstream/v2' into feature/qwen-npu-deco…

0524b6e

…ding

chore: update kleidiai submodule to v1.12.0

834db98

jialilve requested review from chenghuaWang, liang1232018, oreomaker and yirongjie as code owners November 20, 2025 04:00

jialilve added 2 commits November 20, 2025 06:24

Merge remote-tracking branch 'upstream/v2' into feature/qwen-npu-deco…

e5ac2d4

…ding

chore: bump kleidiai submodule to 84796ec

2f6077b

coderabbitai Bot reviewed Nov 20, 2025

View reviewed changes

oreomaker approved these changes Nov 20, 2025

View reviewed changes

oreomaker merged commit 58da27e into UbiquitousLearning:v2 Nov 20, 2025
3 checks passed

This was referenced Nov 23, 2025

V2 Release #545

Closed

feat: qwen2 cpu model and connection with npu prefill #555

Merged

This was referenced Dec 26, 2025

[Bug]: mllm-qwen-npu failed to run on OnePlus Ace5 pro #575

Closed

feat(Qnn AOT): Implement LLMQuantRecipePass and associated patterns for quantization #572

Merged

coderabbitai Bot mentioned this pull request Jan 14, 2026

feat(qualcomm): Qnn aot runner #603

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement Qwen NPU Decoding Support with Memory Management Fixes#537

feat: Implement Qwen NPU Decoding Support with Memory Management Fixes#537
oreomaker merged 12 commits intoUbiquitousLearning:v2from
jialilve:feature/qwen-npu-decoding

jialilve commented Nov 20, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

chenghuaWang commented Nov 20, 2025

Uh oh!

coderabbitai Bot commented Nov 20, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Nov 20, 2025

Uh oh!

coderabbitai Bot Nov 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jialilve commented Nov 20, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features Implemented

1. Single-Chunk Decoding Support

2. Multi-Chunk Decoding Support

3. CausalMaskOp Improvement

4. Memory Management Fixes

Problem 1: Failed to find memHandle 0x1

Problem 2: FastRPC Memory Mapping Failures

Problem 3: memDeRegister Failures

Key Changes

Core KV Cache Interface (mllm/core/aops/KVCacheOp.hpp, mllm/backends/cpu/ops/KVCacheOp.{hpp,cpp})

Layer Interface (mllm/nn/layers/KVCache.{hpp,cpp})

Model Interface (mllm/models/qwen_npu/modeling_qwen_npu.hpp)

QNN Backend Memory Management (mllm/backends/qnn/QNNAllocator.{hpp,cpp})

QNN Backend Execution (mllm/backends/qnn/QNNBackend.cpp)

QNN Utils (mllm/backends/qnn/QNNUtils.cpp)

CausalMaskOp (mllm/backends/cpu/ops/CausalMaskOp.cpp)

Example Implementation (examples/qwen_npu/main.cpp)

Related Commits

Co-authors

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

chenghuaWang commented Nov 20, 2025

Uh oh!

coderabbitai Bot commented Nov 20, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jialilve commented Nov 20, 2025 •

edited by coderabbitai Bot

Loading

Core KV Cache Interface (`mllm/core/aops/KVCacheOp.hpp`, `mllm/backends/cpu/ops/KVCacheOp.{hpp,cpp}`)

Layer Interface (`mllm/nn/layers/KVCache.{hpp,cpp}`)

Model Interface (`mllm/models/qwen_npu/modeling_qwen_npu.hpp`)

QNN Backend Memory Management (`mllm/backends/qnn/QNNAllocator.{hpp,cpp}`)

QNN Backend Execution (`mllm/backends/qnn/QNNBackend.cpp`)

QNN Utils (`mllm/backends/qnn/QNNUtils.cpp`)

CausalMaskOp (`mllm/backends/cpu/ops/CausalMaskOp.cpp`)

Example Implementation (`examples/qwen_npu/main.cpp`)

coderabbitai Bot commented Nov 20, 2025 •

edited

Loading