Skip to content

feat: Implement Qwen NPU Decoding Support with Memory Management Fixes#537

Merged
oreomaker merged 12 commits intoUbiquitousLearning:v2from
jialilve:feature/qwen-npu-decoding
Nov 20, 2025
Merged

feat: Implement Qwen NPU Decoding Support with Memory Management Fixes#537
oreomaker merged 12 commits intoUbiquitousLearning:v2from
jialilve:feature/qwen-npu-decoding

Conversation

@jialilve
Copy link
Copy Markdown
Contributor

@jialilve jialilve commented Nov 20, 2025

Summary

This PR implements complete decoding support for Qwen NPU models on QNN backend, including both single-chunk and multi-chunk decoding capabilities. It also fixes critical memory management issues encountered during decode phase and improves CausalMaskOp for multi-chunk scenarios.

Features Implemented

1. Single-Chunk Decoding Support

Implemented basic decoding functionality for input sequences shorter than chunk size (128 tokens):

  • KV Cache Sequence Management: Added setKVCacheSeqCnt() and getKVCacheSeqCnt() methods across the KV cache hierarchy

    • aops::KVCacheOp: Added virtual setCurrentSeqCnt() and getCurrentSeqCnt() methods
    • CPUKVCacheOp: Implemented sequence count management using StaticCache
    • nn::KVCache: Added layer interface for sequence count control
    • QwenText and QwenForCausalLM: Added model-level APIs for KV cache management
  • Decode Loop Implementation:

    • Implemented iterative token generation loop in examples/qwen_npu/main.cpp
    • Handles position_ids correctly for decode phase
    • Supports EOS token (151645) termination check
    • Manages input sequence buffer with padding area for new tokens
  • Forward Method Updates:

    • Enhanced QwenForCausalLM::forward() to support decode phase
    • Proper handling of position_ids increment for decode iterations
    • Support for variable sequence lengths during decode

2. Multi-Chunk Decoding Support

Extended decoding to handle long input sequences that exceed chunk size:

  • Chunked Prefill: Processes long prompts in 128-token chunks
  • KV Cache Alignment: Correctly aligns KV cache offsets for multi-chunk scenarios
    • Uses absolute sequence length from start of entire sequence
    • Sets KV cache sequence count to chunk start offset before each prefill
  • Decode Continuation: Continues decoding after processing all prompt chunks
  • Position IDs Generation: Generates position_ids starting from chunk offset for multi-chunk prefill

3. CausalMaskOp Improvement

CausalMaskOp improvement by @oreomaker.
Fixed causal mask calculation for multi-chunk decoding scenarios:

Problem: Original mask calculation copy_count = std::min(r + 1, (size_t)D) was incorrect for multi-chunk scenarios where sequence length (S) and dimension (D) differ.

Solution: Changed to copy_count = D - S + r + 1 to correctly handle cases where S < D (multi-chunk scenarios with padding).

  • Applied to both AVX2 (x86_64) and NEON (ARM64) implementations
  • Ensures correct masking behavior across chunk boundaries
  • Maintains backward compatibility with single-chunk scenarios

4. Memory Management Fixes

Fixed critical memory management issues in QNN backend during decode phase:

Problem 1: Failed to find memHandle 0x1

  • Root Cause: Same tensor (by ID/name) was registered multiple times, causing QNN to lose track of memory handles
  • Solution: Implemented buffer reuse mechanism with multi-level fallback

Problem 2: FastRPC Memory Mapping Failures

  • Root Cause: QNN HTP device memory exhausted (~2.5GB limit) when registering too many buffers
  • Solution: Multi-level fallback strategy to reuse existing buffers when registration fails

Problem 3: memDeRegister Failures

  • Root Cause: Attempts to de-register memory handles that were already de-registered or shared by multiple pointers
  • Solution: Implemented alias detection and reference counting for memory handle lifecycle management

Key Changes

Core KV Cache Interface (mllm/core/aops/KVCacheOp.hpp, mllm/backends/cpu/ops/KVCacheOp.{hpp,cpp})

  • Added setCurrentSeqCnt(int32_t seq) virtual method to aops::KVCacheOp
  • Added getCurrentSeqCnt() const method to aops::KVCacheOp
  • Implemented both methods in CPUKVCacheOp using StaticCache API

Layer Interface (mllm/nn/layers/KVCache.{hpp,cpp})

  • Added setCurrentSeqCnt(int32_t seq) method
  • Added getCurrentSeqCnt(int32_t layer_idx) const method

Model Interface (mllm/models/qwen_npu/modeling_qwen_npu.hpp)

  • Added setKVCacheSeqCnt(int32_t seq) to QwenText and QwenForCausalLM
  • Added getKVCacheSeqCnt(int32_t layer_idx) const method
  • Updated forward() method to handle decode phase with position_ids

QNN Backend Memory Management (mllm/backends/qnn/QNNAllocator.{hpp,cpp})

  • Added tensorIdToPtrMap_ and tensorNameToPtrMap_ for buffer lookup by tensor identity
  • Implemented reuseExistingBuffer() lambda with multi-level fallback:
    • Level 1: Check exact buffer pointer
    • Level 2: Lookup by tensor ID
    • Level 3: Lookup by tensor name
    • Level 4: Reuse last successfully registered buffer
  • Added LastRegistrationInfo structure to track last successful registration
  • Implemented helper functions:
    • eraseTensorMappingsForPtr(): Clean up tensor ID/name mappings
    • rememberLastRegistration(): Track successful registrations
    • clearLastRegistrationIfMatches(): Clean up last registration info
  • Enhanced free() method with alias detection and reference counting
  • Added multi-level fallback in registerQnnTensorToSharedBuffer() when registration fails

QNN Backend Execution (mllm/backends/qnn/QNNBackend.cpp)

  • Improved input tensor data copying in graphExecute()
  • Added size mismatch detection and zero-padding for decode phase inputs
  • Enhanced error messages with detailed tensor information

QNN Utils (mllm/backends/qnn/QNNUtils.cpp)

  • Added buffer size validation in QNNTensorWrapper::alloc()
  • Implemented automatic de-registration when registered buffer is too small
  • Added checks for buffer validity before reuse

CausalMaskOp (mllm/backends/cpu/ops/CausalMaskOp.cpp)

  • Fixed mask calculation for multi-chunk scenarios:
    • Changed from copy_count = std::min(r + 1, (size_t)D)
    • To copy_count = D - S + r + 1
  • Applied fix to both AVX2 and NEON implementations

Example Implementation (examples/qwen_npu/main.cpp)

  • Implemented single-chunk decoding loop with:
    • KV cache sequence count management
    • Position IDs handling
    • EOS token termination
    • Input sequence buffer management
  • Extended to multi-chunk decoding with:
    • Chunked prefill processing
    • KV cache alignment across chunks
    • Decode continuation after all chunks processed
    • Proper position IDs generation for multi-chunk scenarios

Related Commits

This PR consolidates the following commits:

  • 1d5d253: feat: implement Qwen NPU simple single chunk decoding support and Memory management fixes
  • b438b3d: implement Qwen NPU simple muti-chunk decoding support
  • e26b11b: fix: stabilize QNN multi-chunk decoding (including CausalMaskOp improvement)

Co-authors

This PR is a collaborative effort:

  • @oreomaker - Technical guidance, CausalMaskOp improvement for multi-chunk decoding, and code review
  • @jialilve - Main implementation including single-chunk/multi-chunk decoding support and memory management fixes

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced multi-chunk sequence processing with support for larger input sequences.
    • Improved memory management with robust buffer registration and fallback handling.
  • Bug Fixes

    • Refined input validation and data handling in graph execution.
    • Fixed causal mask calculations for improved attention mask accuracy.
  • Improvements

    • Added KV cache sequence count management for better cache tracking during inference.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 20, 2025

Walkthrough

The PR introduces KV cache sequence count management APIs across core and backend layers, significantly enhances QNN allocator buffer lifecycle and registration handling, refactors the Qwen NPU example to support chunked inference with multi-phase prefill/decode, and updates causal mask calculation logic for improved row-wise processing.

Changes

Cohort / File(s) Summary
KV Cache Sequence Management
mllm/core/aops/KVCacheOp.hpp, mllm/nn/layers/KVCache.hpp, mllm/nn/layers/KVCache.cpp, mllm/backends/cpu/ops/KVCacheOp.hpp, mllm/backends/cpu/ops/KVCacheOp.cpp
Added virtual methods setCurrentSeqCnt(int32_t) and getCurrentSeqCnt() const with default implementations; CPU backend delegates to underlying cache; enables sequence counter queries across the KV cache hierarchy.
Qwen NPU Model Accessors
mllm/models/qwen_npu/modeling_qwen_npu.hpp
Added const-qualified getKVCache() const overloads to QwenAttentionMatmul and QwenDecoder; added setKVCacheSeqCnt() and getKVCacheSeqCnt() to QwenText and QwenForCausalLM for KV cache sequence management propagation through model hierarchy.
QNN Allocator Lifecycle & Buffer Management
mllm/backends/qnn/QNNAllocator.hpp, mllm/backends/qnn/QNNAllocator.cpp
Added public destructor, free() method, registerQnnTensorToSharedBuffer(Storage*, Qnn_Tensor_t&) returning bool with multi-level fallback logic, deRegisterQnnTensorFromSharedBuffer(), buffer inspection methods (isRegistered(), getRegisteredBufferSize(), getRegisteredBufferStats()), and internal tracking helpers; changed registration signature from void pointer to Storage object.
QNN Tensor Wrapper Allocation
mllm/backends/qnn/QNNUtils.hpp, mllm/backends/qnn/QNNUtils.cpp
Added registeredPtr_ private field, resetAlloc() method, and robust shared-buffer registration with multi-step lifecycle management for static tensors; marked getName() with [[nodiscard]].
QNN Backend Input Handling
mllm/backends/qnn/QNNBackend.cpp
Added defensive input preparation in graphExecute() with byte-size validation, data copying with zero-padding for undersized inputs, wrapper tensor allocation, and QNN allocator registration before execution.
Causal Mask Calculation
mllm/backends/cpu/ops/CausalMaskOp.cpp
Updated copy/fill count arithmetic from min(r+1, D) and D > copy_count ? (D-copy_count) : 0 to copy_count = D - S + r + 1 and fill_count = max(D - copy_count, 0) in both AVX2 and NEON code paths.
Chunked Inference Pipeline
examples/qwen_npu/main.cpp
Refactored to support chunked inference: widened input tensor from {1, 32} to {1, 128}; added chunking logic with chunk_size=128 and prompt_chunks; introduced KV-cache sequence alignment via setKVCacheSeqCnt(chunk_start); implemented two-phase per-chunk process (prefill + decode); added position_ids handling, EOS detection (token_id 151645), per-chunk decode looping, and verbose logging; replaced single-step forward with persistent sequence/position management across chunks.
Const-Correctness Utilities
mllm/nn/Module.hpp
Added const-qualified list() const overload to ModuleList for accessing internal layers_ vector without mutation.

Sequence Diagram(s)

sequenceDiagram
    participant Main as main()
    participant Model as QwenForCausalLM
    participant Cache as KVCache
    participant Allocator as QNNAllocator
    participant Backend as QNNBackend
    
    Main->>Model: setKVCacheSeqCnt(chunk_start)
    Model->>Cache: setCurrentSeqCnt(seq)
    Cache->>Backend: Update sequence counter
    
    loop For each chunk
        Main->>Model: forward(prompt_chunk or position_ids)
        Model->>Cache: Store KV cache state
        Cache->>Allocator: registerQnnTensorToSharedBuffer(Storage*)
        Allocator->>Allocator: Multi-level fallback search
        alt Registration success
            Allocator->>Backend: Update tensor mem_handle
            Backend->>Backend: Copy input data with validation
            Backend->>Model: Execute graph
            Model->>Model: Generate next token
        else Registration fails
            Allocator->>Allocator: Free buffer & restore state
        end
        Main->>Model: Token emit & EOS check
    end
Loading
sequenceDiagram
    participant Input as Input Data
    participant QNNBackend as QNNBackend::graphExecute()
    participant Wrapper as QNNTensorWrapper
    participant Allocator as QNNAllocator
    participant QNN as QNN Runtime
    
    Input->>QNNBackend: Runtime inputs
    QNNBackend->>QNNBackend: Validate input non-nil
    QNNBackend->>Wrapper: Allocate wrapper tensor
    QNNBackend->>Wrapper: Copy data (min bytes or zero-pad)
    Wrapper->>Allocator: registerQnnTensorToSharedBuffer()
    alt Buffer registered
        Allocator->>Allocator: Reuse or create mem_handle
        Allocator->>Wrapper: Update Qnn_Tensor_t
    else Fallback exhausted
        Allocator->>Wrapper: Fail & free buffer
        Wrapper->>QNNBackend: Registration error
    end
    QNNBackend->>QNN: Execute with prepared inputs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

  • QNN Allocator buffer lifecycle management (QNNAllocator.hpp/.cpp): Multi-level fallback registration logic, pointer aliasing detection, and memory deregistration require careful verification of edge cases and state consistency.
  • Chunked inference refactoring (examples/qwen_npu/main.cpp): Significant architectural change with new chunking logic, KV-cache alignment per chunk, position_ids handling, and decode looping; requires validation of token emission, EOS detection, and padding correctness.
  • Cross-layer KVCache API propagation (KVCacheOp, KVCache, QwenText, QwenForCausalLM): Verify consistency of sequence count management throughout the hierarchy and interaction with QNN allocator registration.
  • Input preparation and validation (QNNBackend.cpp): Byte-size mismatch handling and zero-padding logic must align with expected tensor layouts and prevent buffer overruns.

Possibly related PRs

Suggested reviewers

  • liang1232018
  • yirongjie
  • oreomaker

Poem

🐰 Hops with glee through chunks so fine,
KV caches aligned in perfect line,
Buffers freed and registered with care,
Sequences managed everywhere—eared
A feast of frames, decoded right,
The NPU dances through the night!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main objectives of the PR: implementing Qwen NPU decoding support and fixing memory management issues, which align with the substantial changes across decoder APIs, KV cache management, CausalMaskOp, and QNN backend memory handling.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chenghuaWang
Copy link
Copy Markdown
Collaborator

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 20, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
mllm/backends/cpu/ops/CausalMaskOp.cpp (2)

61-61: Critical: Undefined variable row_offset in AVX2 path.

Line 61 references row_offset which is not defined in this scope (the non-sliding-window block). The loop variable is r, and this should match line 59's addressing and the NEON implementation on line 76.

Apply this diff to fix the undefined variable:

-            float* fill_start = o_ptr + row_offset + copy_count;
+            float* fill_start = o_ptr + r * D + copy_count;

146-181: Float16 paths not updated with multi-chunk fix.

The float16 implementations (both AVX2 and NEON) still use the old formula copy_count = s + 1 instead of the corrected D - S + r + 1. This creates an inconsistency where float32 correctly handles multi-chunk scenarios but float16 still assumes D == S.

If float16 dtype is used during multi-chunk decoding, the attention masks will be incorrect.

Apply this diff to update the float16 AVX2 path (lines 149-152):

           for (size_t s = 0; s < S; ++s) {
             const size_t row_offset = s * S;
-            const size_t copy_count = s + 1;
-            const size_t fill_count = S - copy_count;
+            const size_t copy_count = D - S + s + 1;
+            const size_t fill_count = std::max(D - copy_count, (size_t)0);

Apply this diff to update the float16 NEON path (lines 167-170):

           for (size_t s = 0; s < S; ++s) {
             const size_t row_offset = s * S;
-            const size_t copy_count = s + 1;
-            const size_t fill_count = S - copy_count;
+            const size_t copy_count = D - S + s + 1;
+            const size_t fill_count = std::max(D - copy_count, (size_t)0);
🧹 Nitpick comments (7)
mllm/nn/Module.hpp (1)

190-212: Const list() overload looks good; consider marking both overloads [[nodiscard]]

The added const std::vector<T>& list() const is a clean const‑correct accessor and matches existing usage of layers_. To keep clang-tidy happy and make intent explicit that callers shouldn’t silently ignore this, you can mark both overloads as [[nodiscard]]:

[[nodiscard]] std::vector<T>& list() { return layers_; }
[[nodiscard]] const std::vector<T>& list() const { return layers_; }
mllm/backends/cpu/ops/KVCacheOp.hpp (1)

12-28: CPUKVCacheOp seq-count overrides are consistent; consider [[nodiscard]]

The setCurrentSeqCnt / getCurrentSeqCnt overrides align with the new aops::KVCacheOp API and look correct.

To satisfy clang-tidy and make ignoring the result explicit, consider marking the base KVCacheOp::getCurrentSeqCnt() (and thus this override) as [[nodiscard]] so all backends inherit the contract.

mllm/nn/layers/KVCache.hpp (1)

19-28: KVCache seq-count API is reasonable; consider [[nodiscard]] on getter

The added setCurrentSeqCnt / getCurrentSeqCnt accessors give an appropriate hook for higher layers to manage KV cache sequence length.

Given that ignoring the return value from getCurrentSeqCnt() is almost certainly unintended and clang-tidy is already flagging it, consider:

[[nodiscard]] int32_t getCurrentSeqCnt() const;

to make the contract explicit and quiet the warning.

mllm/core/aops/KVCacheOp.hpp (1)

35-49: Base KVCacheOp seq-count hooks are well-scoped; consider [[nodiscard]] on getter (and possibly options())

The new virtuals with safe defaults are a good way to expose seq-count management without forcing all backends to implement it immediately.

Given this is effectively a query API, it’s worth marking getCurrentSeqCnt() as [[nodiscard]] in the base class so callers don’t accidentally drop the value:

virtual [[nodiscard]] int32_t getCurrentSeqCnt() const { return -1; }

You may also want to mark options() as [[nodiscard]] to address the clang-tidy hint, but that’s more stylistic.

mllm/backends/qnn/QNNBackend.cpp (1)

567-580: Guard null src_ptr check by bytes_to_copy to handle 0-byte inputs safely

Right now, a runtime tensor with src_bytes == 0 but a null data pointer will still trigger the error path, even though no bytes need to be copied:

const void* src_ptr = runtime_input.ptr<void>();
size_t bytes_to_copy = std::min(dst_bytes, src_bytes);
if (!src_ptr) {
  MLLM_ERROR("Runtime input tensor {} for graph '{}' has null data pointer", i, graphName);
  return;
}
if (dst_ptr && src_ptr && dst_ptr != src_ptr) {
  if (bytes_to_copy > 0) {
    std::memcpy(dst_ptr, src_ptr, bytes_to_copy);
  }
  ...
}

Consider only enforcing src_ptr != nullptr when bytes_to_copy > 0, e.g.:

-      const void* src_ptr = runtime_input.ptr<void>();
-      size_t bytes_to_copy = std::min(dst_bytes, src_bytes);
-      if (!src_ptr) {
-        MLLM_ERROR("Runtime input tensor {} for graph '{}' has null data pointer", i, graphName);
-        return;
-      }
+      const void* src_ptr = runtime_input.ptr<void>();
+      size_t bytes_to_copy = std::min(dst_bytes, src_bytes);
+      if (bytes_to_copy > 0 && !src_ptr) {
+        MLLM_ERROR("Runtime input tensor {} for graph '{}' has null data pointer", i, graphName);
+        return;
+      }

This keeps the error for real data copies while remaining robust if a zero-length tensor is passed.

mllm/backends/qnn/QNNAllocator.cpp (2)

94-107: Consider nulling storage->ptr_ even when the pointer is not tracked

In QNNAllocator::free(Storage* storage), you early-return when the pointer is not in qnnMemPtrSet_:

if (qnnMemPtrSet_.count(ptr) == 0) {
  QNN_ALLOCATOR_VERBOSE("QNNAllocator::free called for ptr={} that is not in qnnMemPtrSet_, ignoring", ptr);
  return;
}
...
storage->ptr_ = nullptr;

This leaves storage->ptr_ untouched in that path, so callers may still see a non-null pointer after free() if the storage was never allocated by this allocator (or was already cleaned up elsewhere).

For extra safety and clearer semantics (“after free, ptr_ is either owned nowhere or null”), you could null it even on the early-return path:

  if (qnnMemPtrSet_.count(ptr) == 0) {
    QNN_ALLOCATOR_VERBOSE("QNNAllocator::free called for ptr={} that is not in qnnMemPtrSet_, ignoring", ptr);
-    return;
+    storage->ptr_ = nullptr;
+    return;
  }

This shouldn’t affect the normal ownership protocol but makes misuses of the allocator interface less likely to leave stale pointers around.

Also applies to: 181-182


15-23: Macro and naming style issues are non-functional but can be aligned with lint expectations

Static analysis complains about:

  • kVerboseQnnAllocatorLogs naming style.
  • The variadic macro QNN_ALLOCATOR_VERBOSE.

These are stylistic only, but if you want to silence the lints you can:

  • Rename the constant to match your project’s convention (e.g. kVerboseQnnAllocatorLogskVerboseQnnAllocatorLogsEnabled or VERBOSE_QNN_ALLOCATOR_LOGS depending on your rules).
  • Replace the macro with an inline helper:
inline void QnnAllocatorVerboseLog(const char* fmt, auto&&... args) {
  if constexpr (kVerboseQnnAllocatorLogs) {
    MLLM_INFO(fmt, std::forward<decltype(args)>(args)...);
  }
}

and call QnnAllocatorVerboseLog(...) instead of the macro.

Given the flag is currently false, this is low priority and purely to keep tooling quiet.

Also applies to: 492-512

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31ab2ff and 2f6077b.

📒 Files selected for processing (14)
  • examples/qwen_npu/main.cpp (3 hunks)
  • mllm/backends/cpu/ops/CausalMaskOp.cpp (2 hunks)
  • mllm/backends/cpu/ops/KVCacheOp.cpp (1 hunks)
  • mllm/backends/cpu/ops/KVCacheOp.hpp (1 hunks)
  • mllm/backends/qnn/QNNAllocator.cpp (4 hunks)
  • mllm/backends/qnn/QNNAllocator.hpp (4 hunks)
  • mllm/backends/qnn/QNNBackend.cpp (2 hunks)
  • mllm/backends/qnn/QNNUtils.cpp (4 hunks)
  • mllm/backends/qnn/QNNUtils.hpp (2 hunks)
  • mllm/core/aops/KVCacheOp.hpp (1 hunks)
  • mllm/models/qwen_npu/modeling_qwen_npu.hpp (4 hunks)
  • mllm/nn/Module.hpp (1 hunks)
  • mllm/nn/layers/KVCache.cpp (1 hunks)
  • mllm/nn/layers/KVCache.hpp (1 hunks)
🧰 Additional context used
🪛 Clang (14.0.6)
mllm/nn/layers/KVCache.hpp

[error] 27-27: function 'getCurrentSeqCnt' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

mllm/backends/cpu/ops/KVCacheOp.hpp

[error] 24-24: function 'getCurrentSeqCnt' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

mllm/nn/Module.hpp

[error] 211-211: function 'list' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

mllm/backends/qnn/QNNUtils.cpp

[error] 376-376: variable 'currentPtr' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 376-376: invalid case style for variable 'currentPtr'

(readability-identifier-naming,-warnings-as-errors)


[error] 377-377: implicit conversion 'void *' -> bool

(readability-implicit-bool-conversion,-warnings-as-errors)


[error] 387-387: variable 'requiredBytes' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 387-387: invalid case style for variable 'requiredBytes'

(readability-identifier-naming,-warnings-as-errors)


[error] 392-392: implicit conversion 'void *' -> bool

(readability-implicit-bool-conversion,-warnings-as-errors)


[error] 401-401: variable 'registeredBytes' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 401-401: invalid case style for variable 'registeredBytes'

(readability-identifier-naming,-warnings-as-errors)


[error] 416-416: variable 'freshPtr' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 416-416: invalid case style for variable 'freshPtr'

(readability-identifier-naming,-warnings-as-errors)


[error] 417-417: variable 'bytesToCopy' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 417-417: invalid case style for variable 'bytesToCopy'

(readability-identifier-naming,-warnings-as-errors)


[error] 419-419: implicit conversion 'void *' -> bool

(readability-implicit-bool-conversion,-warnings-as-errors)

mllm/core/aops/KVCacheOp.hpp

[error] 43-43: function 'getCurrentSeqCnt' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)


[error] 45-45: function 'options' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)


[error] 48-48: member variable 'options_' has protected visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)

mllm/backends/qnn/QNNAllocator.cpp

[error] 16-16: invalid case style for global constant 'kVerboseQnnAllocatorLogs'

(readability-identifier-naming,-warnings-as-errors)


[error] 19-19: variadic macro 'QNN_ALLOCATOR_VERBOSE' used; consider using a 'constexpr' variadic template function

(cppcoreguidelines-macro-usage,-warnings-as-errors)

mllm/backends/qnn/QNNAllocator.hpp

[error] 77-77: constructor does not initialize these fields: count, total_bytes

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 120-120: constructor does not initialize these fields: tensor_name

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

🔇 Additional comments (6)
mllm/backends/cpu/ops/CausalMaskOp.cpp (1)

56-57: Multi-chunk causal mask formula is mathematically correct.

The updated formula copy_count = D - S + r + 1 correctly handles both single-chunk (D == S → r + 1) and multi-chunk scenarios (D > S → allows attending to all previous chunks). The defensive max(D - copy_count, 0) prevents underflow.

Also applies to: 71-72

mllm/backends/qnn/QNNUtils.hpp (1)

184-222: QNNTensorWrapper header additions look consistent with implementation

The [[nodiscard]] getName(), the new resetAlloc() API, and the registeredPtr_ member cleanly mirror the allocation/registration logic implemented in QNNUtils.cpp. Interface shape and constness look good; no issues from the header side.

mllm/nn/layers/KVCache.cpp (1)

20-32: KVCache seq-count forwarding matches existing delegation pattern

setCurrentSeqCnt and getCurrentSeqCnt correctly forward to the underlying aops::KVCacheOp using the same static_pointer_cast pattern as setLayerIndex and clearCache. No functional issues from this layer wrapper.

mllm/backends/cpu/ops/KVCacheOp.cpp (1)

45-50: Please confirm StaticCache::setCurrentSeqCnt semantics vs per-layer usage

CPUKVCacheOp::getCurrentSeqCnt() queries cache_.getCurrentSeqCnt(options_.layer_idx), while setCurrentSeqCnt(int32_t seq) calls cache_.setCurrentSeqCnt(seq) without the layer index.

If nn::StaticCache tracks sequence counts per layer (as the getter signature suggests), you may want a symmetric API that also keys setCurrentSeqCnt by layer_idx. If instead setCurrentSeqCnt is intentionally global/shared, this is fine—just worth confirming to avoid mixing global and per-layer state by accident.

mllm/backends/qnn/QNNUtils.cpp (1)

14-15: QNNTensorWrapper registered-buffer reuse logic looks consistent and defensive

Including <cstring> for std::memcpy, tracking registeredPtr_ for static tensors, and the expanded alloc() logic collectively give you:

  • Safe reuse of an existing registered buffer when it’s still valid and large enough.
  • A clear path to drop and re-register when the old buffer is too small.
  • Protection against dangling registeredPtr_ via isRegistered() checks.
  • An early return when the current storage is already the registered buffer, avoiding redundant registration.

The separation between isAlloc_ (binding state) and registeredPtr_ (last successful buffer) also aligns with the allocator-level “remember last registration” behavior mentioned in the PR description. From this file alone, the control flow and memory handling look sound.

Also applies to: 352-371, 373-439

mllm/models/qwen_npu/modeling_qwen_npu.hpp (1)

270-272: KV cache accessors and seq-count plumbing look consistent

The new const getKVCache overloads, QwenText::setKVCacheSeqCnt / getKVCacheSeqCnt, and the corresponding QwenForCausalLM forwards form a clean, minimal surface for external KV cache sequence management, with reasonable bounds checking on layer_idx. No issues from this header-level change.

Also applies to: 446-455, 469-473

Comment on lines +162 to +200
while (!reached_eos && current_chunk_len < chunk_size) {
total_decode_steps++;

// Calculate absolute sequence length from the start of the entire sequence
const int absolute_seq_len = chunk_start + current_chunk_len;

// MLLM_INFO("--- Chunk {} Decode Step {} ---", chunk_index, total_decode_steps);
// MLLM_INFO("Current chunk length: {} (relative), Absolute sequence length: {} (absolute)", current_chunk_len, absolute_seq_len);

// Keep padding clean for the remaining area
for (int i = current_chunk_len; i < chunk_size; ++i) { sequence_ptr[i] = -1; }

// Set KV cache to absolute sequence length (where the next token will be written)
// [Maybe Wrong]
model.setKVCacheSeqCnt(chunk_start);
// MLLM_INFO("KV cache seq_cnt set to: {} (relative position)", chunk_start);

// Prepare decode input with position_ids from previous step
mllm::models::ARGenerationOutputPast decode_inputs{
{"sequence", sequence_tensor},
{"position_ids", position_ids}};

// real_seq should be the effective length in the current input tensor (relative position)
// hidden_states shape is [1, chunk_size, hidden_size], we need to index it with current_chunk_len - 1
auto decode_output = model.forward(
decode_inputs, {{"seq_len", mllm::AnyValue(mllm::any_copy_tag, current_chunk_len)}});

auto& decode_logits = decode_output["sequence"];
next_token = model.sampleGreedy(decode_logits);
decode_logits.delete_();
decode_output.erase("sequence");
decode_output.clear();

emit_token(next_token);
if (reached_eos) { break; }

sequence_ptr[current_chunk_len] = next_token;
current_chunk_len++;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

KV cache seq count is likely wrong in the decode loop (comment already says [Maybe Wrong])

In the per-chunk decode loop you recompute the absolute sequence length:

const int absolute_seq_len = chunk_start + current_chunk_len;
...
// Set KV cache to absolute sequence length (where the next token will be written)
// [Maybe Wrong]
model.setKVCacheSeqCnt(chunk_start);

Using chunk_start here ignores already-consumed tokens in the current chunk and appears inconsistent with the comment and with the new KV cache APIs. This can misalign the KV cache during multi-step decode, especially after the first generated token within the last chunk.

A more consistent approach would be to base the sequence count on the absolute number of valid tokens processed so far, e.g.:

-      // Set KV cache to absolute sequence length (where the next token will be written)
-      // [Maybe Wrong]
-      model.setKVCacheSeqCnt(chunk_start);
+      // Set KV cache to the absolute sequence length so far (prefill + decoded tokens)
+      // The next token will be written after `absolute_seq_len - 1`.
+      model.setKVCacheSeqCnt(absolute_seq_len);

or, if setKVCacheSeqCnt is defined to take “index of next write” rather than “current valid length”, adjust accordingly (e.g. absolute_seq_len - 1).

Please double-check this against nn::KVCache::setCurrentSeqCnt semantics; the current implementation is very likely incorrect for multi-step decode.



---

🏁 Script executed:

```shell
#!/bin/bash
# Find setKVCacheSeqCnt definition and usage
rg -n 'setKVCacheSeqCnt' --type=cpp -B3 -A3

Length of output: 2302


🏁 Script executed:

#!/bin/bash
# Examine the full context of main.cpp around the decode loop (lines 140-210)
sed -n '140,210p' examples/qwen_npu/main.cpp

Length of output: 2654


🏁 Script executed:

#!/bin/bash
# Search for prefill loop or chunked processing setup
rg -n 'chunk_start\|chunk_size\|prefill' examples/qwen_npu/main.cpp --type=cpp -B2 -A2

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Find KVCache class definition
find . -name '*.hpp' -o -name '*.h' | xargs rg -l 'class KVCache\|struct KVCache' 2>/dev/null | head -5

Length of output: 0



---

🏁 Script executed:

```shell
#!/bin/bash
# Get the prefill loop context before the decode loop
sed -n '70,145p' examples/qwen_npu/main.cpp

Length of output: 3484


🏁 Script executed:

#!/bin/bash
# Search for setCurrentSeqCnt definition and usage
rg -n 'setCurrentSeqCnt' --type=cpp --type=hpp -B2 -A5

Length of output: 27


🏁 Script executed:

#!/bin/bash
# Look for any StaticCache or KVCache definition in the mllm directory
find mllm -name '*.hpp' -o -name '*.h' | xargs grep -l 'setCurrentSeqCnt' 2>/dev/null | head -3

Length of output: 128


🏁 Script executed:

#!/bin/bash
# Search for how prefill is called and initialized
rg -n 'forward.*prefill\|prefill.*forward' examples/qwen_npu/main.cpp --type=cpp -B5 -A5

Length of output: 0



---

🏁 Script executed:

```shell
#!/bin/bash
# Find setCurrentSeqCnt definition in KVCacheOp.hpp
cat mllm/backends/cpu/ops/KVCacheOp.hpp | grep -A10 'setCurrentSeqCnt'

Length of output: 326


🏁 Script executed:

#!/bin/bash
# Look for StaticCache or KVCache class definition
rg -n 'class.*KVCache\|class.*StaticCache' mllm/backends/cpu/ops/KVCacheOp.hpp -A15

Length of output: 0



---

🏁 Script executed:

```shell
#!/bin/bash
# Find nn::StaticCache definition to confirm setCurrentSeqCnt semantics
find mllm -name '*.hpp' -o -name '*.h' | xargs grep -l 'class StaticCache\|struct StaticCache' 2>/dev/null

Length of output: 31


🏁 Script executed:

#!/bin/bash
# Get broader context of StaticCache to understand seq_cnt semantics
rg -n 'setCurrentSeqCnt|getCurrentSeqCnt' mllm/backends/cpu/ -B3 -A8 | head -40

Length of output: 2657



---

🏁 Script executed:

```shell
#!/bin/bash
# Check StaticCache implementation
sed -n '1,100p' mllm/nn/lmcache/StaticCache.hpp

Length of output: 3312


🏁 Script executed:

#!/bin/bash
# Search for setCurrentSeqCnt in StaticCache
rg -n 'setCurrentSeqCnt' mllm/nn/lmcache/StaticCache.hpp -B2 -A8

Length of output: 943


KV cache seq count must track absolute sequence length, not just chunk offset

The bug is confirmed. In the decode loop at line 176, you reset the KV cache seq count to chunk_start on every iteration, but current_chunk_len increments. This causes the cache to lose track of tokens decoded within the chunk:

  • Iteration 1: current_chunk_len = chunk_prompt_len + 1, but cache set to chunk_startoff by chunk_prompt_len tokens
  • Iteration 2: current_chunk_len = chunk_prompt_len + 2, but cache still chunk_startoff by chunk_prompt_len + 1 tokens

The computed absolute_seq_len variable (line 163) is never used, indicating incomplete implementation. The fix is to use it:

-      model.setKVCacheSeqCnt(chunk_start);
+      model.setKVCacheSeqCnt(absolute_seq_len);

Additionally, the prefill at line 97 should similarly use chunk_start + chunk_prompt_len (the absolute valid length after prefill) instead of just chunk_start.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
while (!reached_eos && current_chunk_len < chunk_size) {
total_decode_steps++;
// Calculate absolute sequence length from the start of the entire sequence
const int absolute_seq_len = chunk_start + current_chunk_len;
// MLLM_INFO("--- Chunk {} Decode Step {} ---", chunk_index, total_decode_steps);
// MLLM_INFO("Current chunk length: {} (relative), Absolute sequence length: {} (absolute)", current_chunk_len, absolute_seq_len);
// Keep padding clean for the remaining area
for (int i = current_chunk_len; i < chunk_size; ++i) { sequence_ptr[i] = -1; }
// Set KV cache to absolute sequence length (where the next token will be written)
// [Maybe Wrong]
model.setKVCacheSeqCnt(chunk_start);
// MLLM_INFO("KV cache seq_cnt set to: {} (relative position)", chunk_start);
// Prepare decode input with position_ids from previous step
mllm::models::ARGenerationOutputPast decode_inputs{
{"sequence", sequence_tensor},
{"position_ids", position_ids}};
// real_seq should be the effective length in the current input tensor (relative position)
// hidden_states shape is [1, chunk_size, hidden_size], we need to index it with current_chunk_len - 1
auto decode_output = model.forward(
decode_inputs, {{"seq_len", mllm::AnyValue(mllm::any_copy_tag, current_chunk_len)}});
auto& decode_logits = decode_output["sequence"];
next_token = model.sampleGreedy(decode_logits);
decode_logits.delete_();
decode_output.erase("sequence");
decode_output.clear();
emit_token(next_token);
if (reached_eos) { break; }
sequence_ptr[current_chunk_len] = next_token;
current_chunk_len++;
}
while (!reached_eos && current_chunk_len < chunk_size) {
total_decode_steps++;
// Calculate absolute sequence length from the start of the entire sequence
const int absolute_seq_len = chunk_start + current_chunk_len;
// MLLM_INFO("--- Chunk {} Decode Step {} ---", chunk_index, total_decode_steps);
// MLLM_INFO("Current chunk length: {} (relative), Absolute sequence length: {} (absolute)", current_chunk_len, absolute_seq_len);
// Keep padding clean for the remaining area
for (int i = current_chunk_len; i < chunk_size; ++i) { sequence_ptr[i] = -1; }
// Set KV cache to absolute sequence length (where the next token will be written)
// [Maybe Wrong]
model.setKVCacheSeqCnt(absolute_seq_len);
// MLLM_INFO("KV cache seq_cnt set to: {} (relative position)", chunk_start);
// Prepare decode input with position_ids from previous step
mllm::models::ARGenerationOutputPast decode_inputs{
{"sequence", sequence_tensor},
{"position_ids", position_ids}};
// real_seq should be the effective length in the current input tensor (relative position)
// hidden_states shape is [1, chunk_size, hidden_size], we need to index it with current_chunk_len - 1
auto decode_output = model.forward(
decode_inputs, {{"seq_len", mllm::AnyValue(mllm::any_copy_tag, current_chunk_len)}});
auto& decode_logits = decode_output["sequence"];
next_token = model.sampleGreedy(decode_logits);
decode_logits.delete_();
decode_output.erase("sequence");
decode_output.clear();
emit_token(next_token);
if (reached_eos) { break; }
sequence_ptr[current_chunk_len] = next_token;
current_chunk_len++;
}
🤖 Prompt for AI Agents
In examples/qwen_npu/main.cpp around lines 162 to 200, the KV cache seq count is
being set to chunk_start each decode iteration which ignores tokens already
decoded in the current chunk; replace the call to
model.setKVCacheSeqCnt(chunk_start) with
model.setKVCacheSeqCnt(absolute_seq_len) (where absolute_seq_len = chunk_start +
current_chunk_len) so the cache uses the true absolute write position, and also
update the prefill call near line 97 to set the seq count to chunk_start +
chunk_prompt_len (the absolute length after prefill) instead of just
chunk_start.

Comment on lines +77 to +82
struct BufferStats {
size_t count;
size_t total_bytes;
};
[[nodiscard]] BufferStats getRegisteredBufferStats() const;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Initialize BufferStats and LastRegistrationInfo members to satisfy static analysis and avoid UB

BufferStats currently has uninitialized members, and LastRegistrationInfo::tensor_name relies on implicit default construction. Clang-tidy is already flagging these. Adding explicit in-class initializers is cheap and removes any doubt about default state:

-  struct BufferStats {
-    size_t count;
-    size_t total_bytes;
-  };
+  struct BufferStats {
+    size_t count = 0;
+    size_t total_bytes = 0;
+  };

For LastRegistrationInfo, you can also make the initialization explicit:

-  struct LastRegistrationInfo {
-    uint32_t tensor_id = 0;           // Tensor ID of the registered tensor
-    std::string tensor_name;          // Tensor name of the registered tensor
-    void* ptr = nullptr;              // Buffer pointer that was successfully registered
-    Qnn_MemHandle_t mem_handle = nullptr;  // QNN memory handle from successful registration
-    size_t bytes = 0;                 // Size of the registered buffer in bytes
-  };
+  struct LastRegistrationInfo {
+    uint32_t tensor_id = 0;                // Tensor ID of the registered tensor
+    std::string tensor_name{};             // Tensor name of the registered tensor
+    void* ptr = nullptr;                   // Buffer pointer that was successfully registered
+    Qnn_MemHandle_t mem_handle = nullptr;  // QNN memory handle from successful registration
+    size_t bytes = 0;                      // Size of the registered buffer in bytes
+  };

This aligns with the static-analysis expectations and guarantees well-defined default values.

Also applies to: 120-129

🧰 Tools
🪛 Clang (14.0.6)

[error] 77-77: constructor does not initialize these fields: count, total_bytes

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

🤖 Prompt for AI Agents
In mllm/backends/qnn/QNNAllocator.hpp around lines 77-82 (and also apply same
change for lines ~120-129), BufferStats and LastRegistrationInfo have members
that are not explicitly initialized; update the class/struct declarations to
provide in-class default initializers (e.g., set BufferStats::count = 0 and
BufferStats::total_bytes = 0, and initialize LastRegistrationInfo::tensor_name
to an empty string) so that default construction yields well-defined values and
satisfies static analysis.

@oreomaker oreomaker merged commit 58da27e into UbiquitousLearning:v2 Nov 20, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants