feat(service): implement interactive chat loop in qwen3 service #462
Merged
chenghuaWang merged 8 commits intoUbiquitousLearning:v2from Oct 10, 2025
Merged
feat(service): implement interactive chat loop in qwen3 service #462chenghuaWang merged 8 commits intoUbiquitousLearning:v2from
chenghuaWang merged 8 commits intoUbiquitousLearning:v2from
Conversation
… references - Rename directory and files from `paged_attn_x` to `radix_attn` - Update namespace from `mllm::cpu::paged_attn_x` to `mllm::cpu::radix_attn` - Remove unused includes and context-related code in fwd_bshd.hpp - Add new ops: RadixAttnOp and Scatter2ShardsOp with CPU implementations - Introduce tensor creation utilities: fromVector and refVectorData - Add kRadixAttn and kScatter2Shards to OpTypes enum and optype2Str - Remove kPrefixCache from PagedAttnImplType - Add necessary headers and fix tensor ptr attribute to [[nodiscard]]
- Register `RadixAttnOp` and `Scatter2ShardsOp` in CPU backend - Implement forward logic for `RadixAttnOp` with architecture-specific kernels - Update `RadixAttnOpOptions` to include head count parameters - Fix const-correctness and template usage in radix attention kernel - Remove unnecessary `.to(kFloat16)` calls in Qwen3 attention module - Adjust function signatures for better type safety in radix attention kernel - Refactor kernel calls to use namespaced static dispatch
The Q tensor layout was incorrectly specified as [B, H_Q, S_Q, D] when it should be [B, S_Q, H_Q, D]. This change updates the documentation and adjusts the indexing logic accordingly. A FIXME comment is added to indicate potential performance improvements. Also adds a TODO comment indicating that the kernel's layout needs further review. feat(cpu): add input layout support for RoPE operation Introduces support for different input layouts (BHSD and BSHD) in the RoPE operation. This change modifies the forward method to accept an input_layout_type parameter, allowing the RoPE operation to correctly process tensors with varying memory layouts. The implementation includes updates to both float32 and float16 data types, ensuring compatibility across different architectures (x86 and ARM). Vectorized processing is preserved for both layout types. Additionally, this commit: - Adds enum class RoPEOpOptionsInputType to specify layout types - Updates RoPEOpOptions to include input_type configuration - Modifies Qwen3Attention to use BSHD layout for RoPE operations - Adjusts tensor scattering indices in Qwen3Attention from 2 to 1 - Implements RadixAttn layer and registers it in Qwen3Attention - Updates Qwen3Session::applyChatTemplate to use nlohmann::json fix(include): update prefix cache include directive Updates the include directive in PrefixCache.hpp to properly export the cache header, ensuring correct usage throughout the codebase.
- Add FilledWithConst struct template in arch.hpp - Implement FilledWithConst specializations for __AnyArchTag and __ArmArchTag - Replace manual loop initialization with FilledWithConst in fwd_bshd.hpp - Update Q tensor shape indexing in RadixAttnOp.cpp - Improve VectorDotProduct, MulFromConst and FMAConstArray ARM implementations - Fix softmax computation logic in radix attention forward pass - Add proper mask handling in RadixAttnKernel test - Add Scatter2ShardsKernelTest for shard scattering validation - Fix random state management in Context class - Add RandomStatesTest for verifying random seed behavior
- Add new `qwen3_service` example with CMake build support - Introduce `startService`, `stopService`, and `insertSession` APIs for simplified service control - Refactor `Session` class to remove dependency on `ARGeneration` and improve flexibility - Enhance `Qwen3Session` with thinking token handling and improved cache management - Improve error messages in model loading and extend rotary positional embedding support - Update radix attention kernel comments and parallelization hints - Remove obsolete PyPI README content The changes enable better service deployment and model session handling, especially for Qwen3-based models with thinking token capabilities.
Ensure that `k_cache_addresses` and `v_cache_addresses` are properly resized to match the number of transformer blocks during radix tree search failure. fix(qwen3): correct cache address gathering and physical address mapping Replace `std::ranges::copy` with `insert` for efficient appending of cache addresses. Fix incorrect use of `k_cache_addr` where `v_cache_addr` should be used when mapping physical addresses for value cache.
- Replace hardcoded request with interactive user input loop - Add support for multi-turn conversations with history tracking - Integrate thinking state visualization using fmt library - Handle graceful exit with /exit or /quit commands - Improve response formatting with proper JSON structure fix(cpu): support BSHD layout in RoPE operation - Extend RoPE operator to handle both BHSD and BSHD tensor layouts - Add layout type checking and dimension mapping logic - Fix assertion to allow multiple input layout types refactor(service): improve request pool shutdown handling - Change RequestPool::pop() to return std::optional<RequestItem> - Update service worker loop to handle optional requests - Ensure worker threads join properly during shutdown - Add timestamp to response payloads - Modify response format to follow chat completion chunk structure perf(cache): optimize ZenFS blob size calculation - Simplify bit calculation using literal 64 instead of sizeof(uint64_t) - Correct blob size computation by using consistent dtype lanes fix(qwen3): correct loop initialization and template processing - Initialize loop counters properly in attention mechanisms - Refine chat template processing for better tool and thinking support - Adjust tensor shapes for sequence and position IDs - Update default sampling parameters and EOS token ID - Change cache data types from float16 to float32 for better precision style(qwen3): reformat function signature for readability - Break long function declaration into multiple lines - Improve code formatting and parameter alignment
Delete outdated GitHub Actions workflows for macOS and x86 builds that are no longer needed in the current CI/CD pipeline configuration.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix(cpu): support BSHD layout in RoPE operation
refactor(service): improve request pool shutdown handling
perf(cache): optimize ZenFS blob size calculation
fix(qwen3): correct loop initialization and template processing
style(qwen3): reformat function signature for readability