feat: Implement Qwen NPU Decoding Support with Memory Management Fixes#537
Conversation
- Add KV cache sequence count management - Implement decode loop with position_ids handling - Add EOS token termination check - Update forward method to support decode phase
- correct multi-chunk decode loop and KV cache sequencing - CausalMaskOp improvement by @oreomaker
WalkthroughThe PR introduces KV cache sequence count management APIs across core and backend layers, significantly enhances QNN allocator buffer lifecycle and registration handling, refactors the Qwen NPU example to support chunked inference with multi-phase prefill/decode, and updates causal mask calculation logic for improved row-wise processing. Changes
Sequence Diagram(s)sequenceDiagram
participant Main as main()
participant Model as QwenForCausalLM
participant Cache as KVCache
participant Allocator as QNNAllocator
participant Backend as QNNBackend
Main->>Model: setKVCacheSeqCnt(chunk_start)
Model->>Cache: setCurrentSeqCnt(seq)
Cache->>Backend: Update sequence counter
loop For each chunk
Main->>Model: forward(prompt_chunk or position_ids)
Model->>Cache: Store KV cache state
Cache->>Allocator: registerQnnTensorToSharedBuffer(Storage*)
Allocator->>Allocator: Multi-level fallback search
alt Registration success
Allocator->>Backend: Update tensor mem_handle
Backend->>Backend: Copy input data with validation
Backend->>Model: Execute graph
Model->>Model: Generate next token
else Registration fails
Allocator->>Allocator: Free buffer & restore state
end
Main->>Model: Token emit & EOS check
end
sequenceDiagram
participant Input as Input Data
participant QNNBackend as QNNBackend::graphExecute()
participant Wrapper as QNNTensorWrapper
participant Allocator as QNNAllocator
participant QNN as QNN Runtime
Input->>QNNBackend: Runtime inputs
QNNBackend->>QNNBackend: Validate input non-nil
QNNBackend->>Wrapper: Allocate wrapper tensor
QNNBackend->>Wrapper: Copy data (min bytes or zero-pad)
Wrapper->>Allocator: registerQnnTensorToSharedBuffer()
alt Buffer registered
Allocator->>Allocator: Reuse or create mem_handle
Allocator->>Wrapper: Update Qnn_Tensor_t
else Fallback exhausted
Allocator->>Wrapper: Fail & free buffer
Wrapper->>QNNBackend: Registration error
end
QNNBackend->>QNN: Execute with prepared inputs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
mllm/backends/cpu/ops/CausalMaskOp.cpp (2)
61-61: Critical: Undefined variablerow_offsetin AVX2 path.Line 61 references
row_offsetwhich is not defined in this scope (the non-sliding-window block). The loop variable isr, and this should match line 59's addressing and the NEON implementation on line 76.Apply this diff to fix the undefined variable:
- float* fill_start = o_ptr + row_offset + copy_count; + float* fill_start = o_ptr + r * D + copy_count;
146-181: Float16 paths not updated with multi-chunk fix.The float16 implementations (both AVX2 and NEON) still use the old formula
copy_count = s + 1instead of the correctedD - S + r + 1. This creates an inconsistency where float32 correctly handles multi-chunk scenarios but float16 still assumes D == S.If float16 dtype is used during multi-chunk decoding, the attention masks will be incorrect.
Apply this diff to update the float16 AVX2 path (lines 149-152):
for (size_t s = 0; s < S; ++s) { const size_t row_offset = s * S; - const size_t copy_count = s + 1; - const size_t fill_count = S - copy_count; + const size_t copy_count = D - S + s + 1; + const size_t fill_count = std::max(D - copy_count, (size_t)0);Apply this diff to update the float16 NEON path (lines 167-170):
for (size_t s = 0; s < S; ++s) { const size_t row_offset = s * S; - const size_t copy_count = s + 1; - const size_t fill_count = S - copy_count; + const size_t copy_count = D - S + s + 1; + const size_t fill_count = std::max(D - copy_count, (size_t)0);
🧹 Nitpick comments (7)
mllm/nn/Module.hpp (1)
190-212: Constlist()overload looks good; consider marking both overloads[[nodiscard]]The added
const std::vector<T>& list() constis a clean const‑correct accessor and matches existing usage oflayers_. To keep clang-tidy happy and make intent explicit that callers shouldn’t silently ignore this, you can mark both overloads as[[nodiscard]]:[[nodiscard]] std::vector<T>& list() { return layers_; } [[nodiscard]] const std::vector<T>& list() const { return layers_; }mllm/backends/cpu/ops/KVCacheOp.hpp (1)
12-28: CPUKVCacheOp seq-count overrides are consistent; consider[[nodiscard]]The
setCurrentSeqCnt/getCurrentSeqCntoverrides align with the newaops::KVCacheOpAPI and look correct.To satisfy clang-tidy and make ignoring the result explicit, consider marking the base
KVCacheOp::getCurrentSeqCnt()(and thus this override) as[[nodiscard]]so all backends inherit the contract.mllm/nn/layers/KVCache.hpp (1)
19-28: KVCache seq-count API is reasonable; consider[[nodiscard]]on getterThe added
setCurrentSeqCnt/getCurrentSeqCntaccessors give an appropriate hook for higher layers to manage KV cache sequence length.Given that ignoring the return value from
getCurrentSeqCnt()is almost certainly unintended and clang-tidy is already flagging it, consider:[[nodiscard]] int32_t getCurrentSeqCnt() const;to make the contract explicit and quiet the warning.
mllm/core/aops/KVCacheOp.hpp (1)
35-49: Base KVCacheOp seq-count hooks are well-scoped; consider[[nodiscard]]on getter (and possiblyoptions())The new virtuals with safe defaults are a good way to expose seq-count management without forcing all backends to implement it immediately.
Given this is effectively a query API, it’s worth marking
getCurrentSeqCnt()as[[nodiscard]]in the base class so callers don’t accidentally drop the value:virtual [[nodiscard]] int32_t getCurrentSeqCnt() const { return -1; }You may also want to mark
options()as[[nodiscard]]to address the clang-tidy hint, but that’s more stylistic.mllm/backends/qnn/QNNBackend.cpp (1)
567-580: Guard nullsrc_ptrcheck bybytes_to_copyto handle 0-byte inputs safelyRight now, a runtime tensor with
src_bytes == 0but a null data pointer will still trigger the error path, even though no bytes need to be copied:const void* src_ptr = runtime_input.ptr<void>(); size_t bytes_to_copy = std::min(dst_bytes, src_bytes); if (!src_ptr) { MLLM_ERROR("Runtime input tensor {} for graph '{}' has null data pointer", i, graphName); return; } if (dst_ptr && src_ptr && dst_ptr != src_ptr) { if (bytes_to_copy > 0) { std::memcpy(dst_ptr, src_ptr, bytes_to_copy); } ... }Consider only enforcing
src_ptr != nullptrwhenbytes_to_copy > 0, e.g.:- const void* src_ptr = runtime_input.ptr<void>(); - size_t bytes_to_copy = std::min(dst_bytes, src_bytes); - if (!src_ptr) { - MLLM_ERROR("Runtime input tensor {} for graph '{}' has null data pointer", i, graphName); - return; - } + const void* src_ptr = runtime_input.ptr<void>(); + size_t bytes_to_copy = std::min(dst_bytes, src_bytes); + if (bytes_to_copy > 0 && !src_ptr) { + MLLM_ERROR("Runtime input tensor {} for graph '{}' has null data pointer", i, graphName); + return; + }This keeps the error for real data copies while remaining robust if a zero-length tensor is passed.
mllm/backends/qnn/QNNAllocator.cpp (2)
94-107: Consider nullingstorage->ptr_even when the pointer is not trackedIn
QNNAllocator::free(Storage* storage), you early-return when the pointer is not inqnnMemPtrSet_:if (qnnMemPtrSet_.count(ptr) == 0) { QNN_ALLOCATOR_VERBOSE("QNNAllocator::free called for ptr={} that is not in qnnMemPtrSet_, ignoring", ptr); return; } ... storage->ptr_ = nullptr;This leaves
storage->ptr_untouched in that path, so callers may still see a non-null pointer afterfree()if the storage was never allocated by this allocator (or was already cleaned up elsewhere).For extra safety and clearer semantics (“after free,
ptr_is either owned nowhere or null”), you could null it even on the early-return path:if (qnnMemPtrSet_.count(ptr) == 0) { QNN_ALLOCATOR_VERBOSE("QNNAllocator::free called for ptr={} that is not in qnnMemPtrSet_, ignoring", ptr); - return; + storage->ptr_ = nullptr; + return; }This shouldn’t affect the normal ownership protocol but makes misuses of the allocator interface less likely to leave stale pointers around.
Also applies to: 181-182
15-23: Macro and naming style issues are non-functional but can be aligned with lint expectationsStatic analysis complains about:
kVerboseQnnAllocatorLogsnaming style.- The variadic macro
QNN_ALLOCATOR_VERBOSE.These are stylistic only, but if you want to silence the lints you can:
- Rename the constant to match your project’s convention (e.g.
kVerboseQnnAllocatorLogs→kVerboseQnnAllocatorLogsEnabledorVERBOSE_QNN_ALLOCATOR_LOGSdepending on your rules).- Replace the macro with an inline helper:
inline void QnnAllocatorVerboseLog(const char* fmt, auto&&... args) { if constexpr (kVerboseQnnAllocatorLogs) { MLLM_INFO(fmt, std::forward<decltype(args)>(args)...); } }and call
QnnAllocatorVerboseLog(...)instead of the macro.Given the flag is currently
false, this is low priority and purely to keep tooling quiet.Also applies to: 492-512
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
examples/qwen_npu/main.cpp(3 hunks)mllm/backends/cpu/ops/CausalMaskOp.cpp(2 hunks)mllm/backends/cpu/ops/KVCacheOp.cpp(1 hunks)mllm/backends/cpu/ops/KVCacheOp.hpp(1 hunks)mllm/backends/qnn/QNNAllocator.cpp(4 hunks)mllm/backends/qnn/QNNAllocator.hpp(4 hunks)mllm/backends/qnn/QNNBackend.cpp(2 hunks)mllm/backends/qnn/QNNUtils.cpp(4 hunks)mllm/backends/qnn/QNNUtils.hpp(2 hunks)mllm/core/aops/KVCacheOp.hpp(1 hunks)mllm/models/qwen_npu/modeling_qwen_npu.hpp(4 hunks)mllm/nn/Module.hpp(1 hunks)mllm/nn/layers/KVCache.cpp(1 hunks)mllm/nn/layers/KVCache.hpp(1 hunks)
🧰 Additional context used
🪛 Clang (14.0.6)
mllm/nn/layers/KVCache.hpp
[error] 27-27: function 'getCurrentSeqCnt' should be marked [[nodiscard]]
(modernize-use-nodiscard,-warnings-as-errors)
mllm/backends/cpu/ops/KVCacheOp.hpp
[error] 24-24: function 'getCurrentSeqCnt' should be marked [[nodiscard]]
(modernize-use-nodiscard,-warnings-as-errors)
mllm/nn/Module.hpp
[error] 211-211: function 'list' should be marked [[nodiscard]]
(modernize-use-nodiscard,-warnings-as-errors)
mllm/backends/qnn/QNNUtils.cpp
[error] 376-376: variable 'currentPtr' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 376-376: invalid case style for variable 'currentPtr'
(readability-identifier-naming,-warnings-as-errors)
[error] 377-377: implicit conversion 'void *' -> bool
(readability-implicit-bool-conversion,-warnings-as-errors)
[error] 387-387: variable 'requiredBytes' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 387-387: invalid case style for variable 'requiredBytes'
(readability-identifier-naming,-warnings-as-errors)
[error] 392-392: implicit conversion 'void *' -> bool
(readability-implicit-bool-conversion,-warnings-as-errors)
[error] 401-401: variable 'registeredBytes' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 401-401: invalid case style for variable 'registeredBytes'
(readability-identifier-naming,-warnings-as-errors)
[error] 416-416: variable 'freshPtr' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 416-416: invalid case style for variable 'freshPtr'
(readability-identifier-naming,-warnings-as-errors)
[error] 417-417: variable 'bytesToCopy' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 417-417: invalid case style for variable 'bytesToCopy'
(readability-identifier-naming,-warnings-as-errors)
[error] 419-419: implicit conversion 'void *' -> bool
(readability-implicit-bool-conversion,-warnings-as-errors)
mllm/core/aops/KVCacheOp.hpp
[error] 43-43: function 'getCurrentSeqCnt' should be marked [[nodiscard]]
(modernize-use-nodiscard,-warnings-as-errors)
[error] 45-45: function 'options' should be marked [[nodiscard]]
(modernize-use-nodiscard,-warnings-as-errors)
[error] 48-48: member variable 'options_' has protected visibility
(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)
mllm/backends/qnn/QNNAllocator.cpp
[error] 16-16: invalid case style for global constant 'kVerboseQnnAllocatorLogs'
(readability-identifier-naming,-warnings-as-errors)
[error] 19-19: variadic macro 'QNN_ALLOCATOR_VERBOSE' used; consider using a 'constexpr' variadic template function
(cppcoreguidelines-macro-usage,-warnings-as-errors)
mllm/backends/qnn/QNNAllocator.hpp
[error] 77-77: constructor does not initialize these fields: count, total_bytes
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 120-120: constructor does not initialize these fields: tensor_name
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
🔇 Additional comments (6)
mllm/backends/cpu/ops/CausalMaskOp.cpp (1)
56-57: Multi-chunk causal mask formula is mathematically correct.The updated formula
copy_count = D - S + r + 1correctly handles both single-chunk (D == S → r + 1) and multi-chunk scenarios (D > S → allows attending to all previous chunks). The defensivemax(D - copy_count, 0)prevents underflow.Also applies to: 71-72
mllm/backends/qnn/QNNUtils.hpp (1)
184-222: QNNTensorWrapper header additions look consistent with implementationThe
[[nodiscard]] getName(), the newresetAlloc()API, and theregisteredPtr_member cleanly mirror the allocation/registration logic implemented inQNNUtils.cpp. Interface shape and constness look good; no issues from the header side.mllm/nn/layers/KVCache.cpp (1)
20-32: KVCache seq-count forwarding matches existing delegation pattern
setCurrentSeqCntandgetCurrentSeqCntcorrectly forward to the underlyingaops::KVCacheOpusing the samestatic_pointer_castpattern assetLayerIndexandclearCache. No functional issues from this layer wrapper.mllm/backends/cpu/ops/KVCacheOp.cpp (1)
45-50: Please confirmStaticCache::setCurrentSeqCntsemantics vs per-layer usage
CPUKVCacheOp::getCurrentSeqCnt()queriescache_.getCurrentSeqCnt(options_.layer_idx), whilesetCurrentSeqCnt(int32_t seq)callscache_.setCurrentSeqCnt(seq)without the layer index.If
nn::StaticCachetracks sequence counts per layer (as the getter signature suggests), you may want a symmetric API that also keyssetCurrentSeqCntbylayer_idx. If insteadsetCurrentSeqCntis intentionally global/shared, this is fine—just worth confirming to avoid mixing global and per-layer state by accident.mllm/backends/qnn/QNNUtils.cpp (1)
14-15: QNNTensorWrapper registered-buffer reuse logic looks consistent and defensiveIncluding
<cstring>forstd::memcpy, trackingregisteredPtr_for static tensors, and the expandedalloc()logic collectively give you:
- Safe reuse of an existing registered buffer when it’s still valid and large enough.
- A clear path to drop and re-register when the old buffer is too small.
- Protection against dangling
registeredPtr_viaisRegistered()checks.- An early return when the current storage is already the registered buffer, avoiding redundant registration.
The separation between
isAlloc_(binding state) andregisteredPtr_(last successful buffer) also aligns with the allocator-level “remember last registration” behavior mentioned in the PR description. From this file alone, the control flow and memory handling look sound.Also applies to: 352-371, 373-439
mllm/models/qwen_npu/modeling_qwen_npu.hpp (1)
270-272: KV cache accessors and seq-count plumbing look consistentThe new const
getKVCacheoverloads,QwenText::setKVCacheSeqCnt/getKVCacheSeqCnt, and the correspondingQwenForCausalLMforwards form a clean, minimal surface for external KV cache sequence management, with reasonable bounds checking onlayer_idx. No issues from this header-level change.Also applies to: 446-455, 469-473
| while (!reached_eos && current_chunk_len < chunk_size) { | ||
| total_decode_steps++; | ||
|
|
||
| // Calculate absolute sequence length from the start of the entire sequence | ||
| const int absolute_seq_len = chunk_start + current_chunk_len; | ||
|
|
||
| // MLLM_INFO("--- Chunk {} Decode Step {} ---", chunk_index, total_decode_steps); | ||
| // MLLM_INFO("Current chunk length: {} (relative), Absolute sequence length: {} (absolute)", current_chunk_len, absolute_seq_len); | ||
|
|
||
| // Keep padding clean for the remaining area | ||
| for (int i = current_chunk_len; i < chunk_size; ++i) { sequence_ptr[i] = -1; } | ||
|
|
||
| // Set KV cache to absolute sequence length (where the next token will be written) | ||
| // [Maybe Wrong] | ||
| model.setKVCacheSeqCnt(chunk_start); | ||
| // MLLM_INFO("KV cache seq_cnt set to: {} (relative position)", chunk_start); | ||
|
|
||
| // Prepare decode input with position_ids from previous step | ||
| mllm::models::ARGenerationOutputPast decode_inputs{ | ||
| {"sequence", sequence_tensor}, | ||
| {"position_ids", position_ids}}; | ||
|
|
||
| // real_seq should be the effective length in the current input tensor (relative position) | ||
| // hidden_states shape is [1, chunk_size, hidden_size], we need to index it with current_chunk_len - 1 | ||
| auto decode_output = model.forward( | ||
| decode_inputs, {{"seq_len", mllm::AnyValue(mllm::any_copy_tag, current_chunk_len)}}); | ||
|
|
||
| auto& decode_logits = decode_output["sequence"]; | ||
| next_token = model.sampleGreedy(decode_logits); | ||
| decode_logits.delete_(); | ||
| decode_output.erase("sequence"); | ||
| decode_output.clear(); | ||
|
|
||
| emit_token(next_token); | ||
| if (reached_eos) { break; } | ||
|
|
||
| sequence_ptr[current_chunk_len] = next_token; | ||
| current_chunk_len++; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
KV cache seq count is likely wrong in the decode loop (comment already says [Maybe Wrong])
In the per-chunk decode loop you recompute the absolute sequence length:
const int absolute_seq_len = chunk_start + current_chunk_len;
...
// Set KV cache to absolute sequence length (where the next token will be written)
// [Maybe Wrong]
model.setKVCacheSeqCnt(chunk_start);Using chunk_start here ignores already-consumed tokens in the current chunk and appears inconsistent with the comment and with the new KV cache APIs. This can misalign the KV cache during multi-step decode, especially after the first generated token within the last chunk.
A more consistent approach would be to base the sequence count on the absolute number of valid tokens processed so far, e.g.:
- // Set KV cache to absolute sequence length (where the next token will be written)
- // [Maybe Wrong]
- model.setKVCacheSeqCnt(chunk_start);
+ // Set KV cache to the absolute sequence length so far (prefill + decoded tokens)
+ // The next token will be written after `absolute_seq_len - 1`.
+ model.setKVCacheSeqCnt(absolute_seq_len);or, if setKVCacheSeqCnt is defined to take “index of next write” rather than “current valid length”, adjust accordingly (e.g. absolute_seq_len - 1).
Please double-check this against nn::KVCache::setCurrentSeqCnt semantics; the current implementation is very likely incorrect for multi-step decode.
---
🏁 Script executed:
```shell
#!/bin/bash
# Find setKVCacheSeqCnt definition and usage
rg -n 'setKVCacheSeqCnt' --type=cpp -B3 -A3
Length of output: 2302
🏁 Script executed:
#!/bin/bash
# Examine the full context of main.cpp around the decode loop (lines 140-210)
sed -n '140,210p' examples/qwen_npu/main.cppLength of output: 2654
🏁 Script executed:
#!/bin/bash
# Search for prefill loop or chunked processing setup
rg -n 'chunk_start\|chunk_size\|prefill' examples/qwen_npu/main.cpp --type=cpp -B2 -A2Length of output: 0
🏁 Script executed:
#!/bin/bash
# Find KVCache class definition
find . -name '*.hpp' -o -name '*.h' | xargs rg -l 'class KVCache\|struct KVCache' 2>/dev/null | head -5Length of output: 0
---
🏁 Script executed:
```shell
#!/bin/bash
# Get the prefill loop context before the decode loop
sed -n '70,145p' examples/qwen_npu/main.cpp
Length of output: 3484
🏁 Script executed:
#!/bin/bash
# Search for setCurrentSeqCnt definition and usage
rg -n 'setCurrentSeqCnt' --type=cpp --type=hpp -B2 -A5Length of output: 27
🏁 Script executed:
#!/bin/bash
# Look for any StaticCache or KVCache definition in the mllm directory
find mllm -name '*.hpp' -o -name '*.h' | xargs grep -l 'setCurrentSeqCnt' 2>/dev/null | head -3Length of output: 128
🏁 Script executed:
#!/bin/bash
# Search for how prefill is called and initialized
rg -n 'forward.*prefill\|prefill.*forward' examples/qwen_npu/main.cpp --type=cpp -B5 -A5Length of output: 0
---
🏁 Script executed:
```shell
#!/bin/bash
# Find setCurrentSeqCnt definition in KVCacheOp.hpp
cat mllm/backends/cpu/ops/KVCacheOp.hpp | grep -A10 'setCurrentSeqCnt'
Length of output: 326
🏁 Script executed:
#!/bin/bash
# Look for StaticCache or KVCache class definition
rg -n 'class.*KVCache\|class.*StaticCache' mllm/backends/cpu/ops/KVCacheOp.hpp -A15Length of output: 0
---
🏁 Script executed:
```shell
#!/bin/bash
# Find nn::StaticCache definition to confirm setCurrentSeqCnt semantics
find mllm -name '*.hpp' -o -name '*.h' | xargs grep -l 'class StaticCache\|struct StaticCache' 2>/dev/null
Length of output: 31
🏁 Script executed:
#!/bin/bash
# Get broader context of StaticCache to understand seq_cnt semantics
rg -n 'setCurrentSeqCnt|getCurrentSeqCnt' mllm/backends/cpu/ -B3 -A8 | head -40Length of output: 2657
---
🏁 Script executed:
```shell
#!/bin/bash
# Check StaticCache implementation
sed -n '1,100p' mllm/nn/lmcache/StaticCache.hpp
Length of output: 3312
🏁 Script executed:
#!/bin/bash
# Search for setCurrentSeqCnt in StaticCache
rg -n 'setCurrentSeqCnt' mllm/nn/lmcache/StaticCache.hpp -B2 -A8Length of output: 943
KV cache seq count must track absolute sequence length, not just chunk offset
The bug is confirmed. In the decode loop at line 176, you reset the KV cache seq count to chunk_start on every iteration, but current_chunk_len increments. This causes the cache to lose track of tokens decoded within the chunk:
- Iteration 1:
current_chunk_len = chunk_prompt_len + 1, but cache set tochunk_start→ off bychunk_prompt_lentokens - Iteration 2:
current_chunk_len = chunk_prompt_len + 2, but cache stillchunk_start→ off bychunk_prompt_len + 1tokens
The computed absolute_seq_len variable (line 163) is never used, indicating incomplete implementation. The fix is to use it:
- model.setKVCacheSeqCnt(chunk_start);
+ model.setKVCacheSeqCnt(absolute_seq_len);Additionally, the prefill at line 97 should similarly use chunk_start + chunk_prompt_len (the absolute valid length after prefill) instead of just chunk_start.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| while (!reached_eos && current_chunk_len < chunk_size) { | |
| total_decode_steps++; | |
| // Calculate absolute sequence length from the start of the entire sequence | |
| const int absolute_seq_len = chunk_start + current_chunk_len; | |
| // MLLM_INFO("--- Chunk {} Decode Step {} ---", chunk_index, total_decode_steps); | |
| // MLLM_INFO("Current chunk length: {} (relative), Absolute sequence length: {} (absolute)", current_chunk_len, absolute_seq_len); | |
| // Keep padding clean for the remaining area | |
| for (int i = current_chunk_len; i < chunk_size; ++i) { sequence_ptr[i] = -1; } | |
| // Set KV cache to absolute sequence length (where the next token will be written) | |
| // [Maybe Wrong] | |
| model.setKVCacheSeqCnt(chunk_start); | |
| // MLLM_INFO("KV cache seq_cnt set to: {} (relative position)", chunk_start); | |
| // Prepare decode input with position_ids from previous step | |
| mllm::models::ARGenerationOutputPast decode_inputs{ | |
| {"sequence", sequence_tensor}, | |
| {"position_ids", position_ids}}; | |
| // real_seq should be the effective length in the current input tensor (relative position) | |
| // hidden_states shape is [1, chunk_size, hidden_size], we need to index it with current_chunk_len - 1 | |
| auto decode_output = model.forward( | |
| decode_inputs, {{"seq_len", mllm::AnyValue(mllm::any_copy_tag, current_chunk_len)}}); | |
| auto& decode_logits = decode_output["sequence"]; | |
| next_token = model.sampleGreedy(decode_logits); | |
| decode_logits.delete_(); | |
| decode_output.erase("sequence"); | |
| decode_output.clear(); | |
| emit_token(next_token); | |
| if (reached_eos) { break; } | |
| sequence_ptr[current_chunk_len] = next_token; | |
| current_chunk_len++; | |
| } | |
| while (!reached_eos && current_chunk_len < chunk_size) { | |
| total_decode_steps++; | |
| // Calculate absolute sequence length from the start of the entire sequence | |
| const int absolute_seq_len = chunk_start + current_chunk_len; | |
| // MLLM_INFO("--- Chunk {} Decode Step {} ---", chunk_index, total_decode_steps); | |
| // MLLM_INFO("Current chunk length: {} (relative), Absolute sequence length: {} (absolute)", current_chunk_len, absolute_seq_len); | |
| // Keep padding clean for the remaining area | |
| for (int i = current_chunk_len; i < chunk_size; ++i) { sequence_ptr[i] = -1; } | |
| // Set KV cache to absolute sequence length (where the next token will be written) | |
| // [Maybe Wrong] | |
| model.setKVCacheSeqCnt(absolute_seq_len); | |
| // MLLM_INFO("KV cache seq_cnt set to: {} (relative position)", chunk_start); | |
| // Prepare decode input with position_ids from previous step | |
| mllm::models::ARGenerationOutputPast decode_inputs{ | |
| {"sequence", sequence_tensor}, | |
| {"position_ids", position_ids}}; | |
| // real_seq should be the effective length in the current input tensor (relative position) | |
| // hidden_states shape is [1, chunk_size, hidden_size], we need to index it with current_chunk_len - 1 | |
| auto decode_output = model.forward( | |
| decode_inputs, {{"seq_len", mllm::AnyValue(mllm::any_copy_tag, current_chunk_len)}}); | |
| auto& decode_logits = decode_output["sequence"]; | |
| next_token = model.sampleGreedy(decode_logits); | |
| decode_logits.delete_(); | |
| decode_output.erase("sequence"); | |
| decode_output.clear(); | |
| emit_token(next_token); | |
| if (reached_eos) { break; } | |
| sequence_ptr[current_chunk_len] = next_token; | |
| current_chunk_len++; | |
| } |
🤖 Prompt for AI Agents
In examples/qwen_npu/main.cpp around lines 162 to 200, the KV cache seq count is
being set to chunk_start each decode iteration which ignores tokens already
decoded in the current chunk; replace the call to
model.setKVCacheSeqCnt(chunk_start) with
model.setKVCacheSeqCnt(absolute_seq_len) (where absolute_seq_len = chunk_start +
current_chunk_len) so the cache uses the true absolute write position, and also
update the prefill call near line 97 to set the seq count to chunk_start +
chunk_prompt_len (the absolute length after prefill) instead of just
chunk_start.
| struct BufferStats { | ||
| size_t count; | ||
| size_t total_bytes; | ||
| }; | ||
| [[nodiscard]] BufferStats getRegisteredBufferStats() const; | ||
|
|
There was a problem hiding this comment.
Initialize BufferStats and LastRegistrationInfo members to satisfy static analysis and avoid UB
BufferStats currently has uninitialized members, and LastRegistrationInfo::tensor_name relies on implicit default construction. Clang-tidy is already flagging these. Adding explicit in-class initializers is cheap and removes any doubt about default state:
- struct BufferStats {
- size_t count;
- size_t total_bytes;
- };
+ struct BufferStats {
+ size_t count = 0;
+ size_t total_bytes = 0;
+ };For LastRegistrationInfo, you can also make the initialization explicit:
- struct LastRegistrationInfo {
- uint32_t tensor_id = 0; // Tensor ID of the registered tensor
- std::string tensor_name; // Tensor name of the registered tensor
- void* ptr = nullptr; // Buffer pointer that was successfully registered
- Qnn_MemHandle_t mem_handle = nullptr; // QNN memory handle from successful registration
- size_t bytes = 0; // Size of the registered buffer in bytes
- };
+ struct LastRegistrationInfo {
+ uint32_t tensor_id = 0; // Tensor ID of the registered tensor
+ std::string tensor_name{}; // Tensor name of the registered tensor
+ void* ptr = nullptr; // Buffer pointer that was successfully registered
+ Qnn_MemHandle_t mem_handle = nullptr; // QNN memory handle from successful registration
+ size_t bytes = 0; // Size of the registered buffer in bytes
+ };This aligns with the static-analysis expectations and guarantees well-defined default values.
Also applies to: 120-129
🧰 Tools
🪛 Clang (14.0.6)
[error] 77-77: constructor does not initialize these fields: count, total_bytes
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
🤖 Prompt for AI Agents
In mllm/backends/qnn/QNNAllocator.hpp around lines 77-82 (and also apply same
change for lines ~120-129), BufferStats and LastRegistrationInfo have members
that are not explicitly initialized; update the class/struct declarations to
provide in-class default initializers (e.g., set BufferStats::count = 0 and
BufferStats::total_bytes = 0, and initialize LastRegistrationInfo::tensor_name
to an empty string) so that default construction yields well-defined values and
satisfies static analysis.
Summary
This PR implements complete decoding support for Qwen NPU models on QNN backend, including both single-chunk and multi-chunk decoding capabilities. It also fixes critical memory management issues encountered during decode phase and improves CausalMaskOp for multi-chunk scenarios.
Features Implemented
1. Single-Chunk Decoding Support
Implemented basic decoding functionality for input sequences shorter than chunk size (128 tokens):
KV Cache Sequence Management: Added
setKVCacheSeqCnt()andgetKVCacheSeqCnt()methods across the KV cache hierarchyaops::KVCacheOp: Added virtualsetCurrentSeqCnt()andgetCurrentSeqCnt()methodsCPUKVCacheOp: Implemented sequence count management usingStaticCachenn::KVCache: Added layer interface for sequence count controlQwenTextandQwenForCausalLM: Added model-level APIs for KV cache managementDecode Loop Implementation:
examples/qwen_npu/main.cppForward Method Updates:
QwenForCausalLM::forward()to support decode phase2. Multi-Chunk Decoding Support
Extended decoding to handle long input sequences that exceed chunk size:
3. CausalMaskOp Improvement
CausalMaskOp improvement by @oreomaker.
Fixed causal mask calculation for multi-chunk decoding scenarios:
Problem: Original mask calculation
copy_count = std::min(r + 1, (size_t)D)was incorrect for multi-chunk scenarios where sequence length (S) and dimension (D) differ.Solution: Changed to
copy_count = D - S + r + 1to correctly handle cases where S < D (multi-chunk scenarios with padding).4. Memory Management Fixes
Fixed critical memory management issues in QNN backend during decode phase:
Problem 1: Failed to find memHandle 0x1
Problem 2: FastRPC Memory Mapping Failures
Problem 3: memDeRegister Failures
Key Changes
Core KV Cache Interface (
mllm/core/aops/KVCacheOp.hpp,mllm/backends/cpu/ops/KVCacheOp.{hpp,cpp})setCurrentSeqCnt(int32_t seq)virtual method toaops::KVCacheOpgetCurrentSeqCnt()const method toaops::KVCacheOpCPUKVCacheOpusingStaticCacheAPILayer Interface (
mllm/nn/layers/KVCache.{hpp,cpp})setCurrentSeqCnt(int32_t seq)methodgetCurrentSeqCnt(int32_t layer_idx)const methodModel Interface (
mllm/models/qwen_npu/modeling_qwen_npu.hpp)setKVCacheSeqCnt(int32_t seq)toQwenTextandQwenForCausalLMgetKVCacheSeqCnt(int32_t layer_idx)const methodforward()method to handle decode phase with position_idsQNN Backend Memory Management (
mllm/backends/qnn/QNNAllocator.{hpp,cpp})tensorIdToPtrMap_andtensorNameToPtrMap_for buffer lookup by tensor identityreuseExistingBuffer()lambda with multi-level fallback:LastRegistrationInfostructure to track last successful registrationeraseTensorMappingsForPtr(): Clean up tensor ID/name mappingsrememberLastRegistration(): Track successful registrationsclearLastRegistrationIfMatches(): Clean up last registration infofree()method with alias detection and reference countingregisterQnnTensorToSharedBuffer()when registration failsQNN Backend Execution (
mllm/backends/qnn/QNNBackend.cpp)graphExecute()QNN Utils (
mllm/backends/qnn/QNNUtils.cpp)QNNTensorWrapper::alloc()CausalMaskOp (
mllm/backends/cpu/ops/CausalMaskOp.cpp)copy_count = std::min(r + 1, (size_t)D)copy_count = D - S + r + 1Example Implementation (
examples/qwen_npu/main.cpp)Related Commits
This PR consolidates the following commits:
1d5d253: feat: implement Qwen NPU simple single chunk decoding support and Memory management fixesb438b3d: implement Qwen NPU simple muti-chunk decoding supporte26b11b: fix: stabilize QNN multi-chunk decoding (including CausalMaskOp improvement)Co-authors
This PR is a collaborative effort:
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.