feat(service): implement interactive chat loop in qwen3 service by chenghuaWang · Pull Request #462 · UbiquitousLearning/mllm

chenghuaWang · 2025-10-10T07:53:13Z

Replace hardcoded request with interactive user input loop
Add support for multi-turn conversations with history tracking
Integrate thinking state visualization using fmt library
Handle graceful exit with /exit or /quit commands
Improve response formatting with proper JSON structure

fix(cpu): support BSHD layout in RoPE operation

Extend RoPE operator to handle both BHSD and BSHD tensor layouts
Add layout type checking and dimension mapping logic
Fix assertion to allow multiple input layout types

refactor(service): improve request pool shutdown handling

Change RequestPool::pop() to return std::optional
Update service worker loop to handle optional requests
Ensure worker threads join properly during shutdown
Add timestamp to response payloads
Modify response format to follow chat completion chunk structure

perf(cache): optimize ZenFS blob size calculation

Simplify bit calculation using literal 64 instead of sizeof(uint64_t)
Correct blob size computation by using consistent dtype lanes

fix(qwen3): correct loop initialization and template processing

Initialize loop counters properly in attention mechanisms
Refine chat template processing for better tool and thinking support
Adjust tensor shapes for sequence and position IDs
Update default sampling parameters and EOS token ID
Change cache data types from float16 to float32 for better precision

style(qwen3): reformat function signature for readability

Break long function declaration into multiple lines
Improve code formatting and parameter alignment

… references - Rename directory and files from `paged_attn_x` to `radix_attn` - Update namespace from `mllm::cpu::paged_attn_x` to `mllm::cpu::radix_attn` - Remove unused includes and context-related code in fwd_bshd.hpp - Add new ops: RadixAttnOp and Scatter2ShardsOp with CPU implementations - Introduce tensor creation utilities: fromVector and refVectorData - Add kRadixAttn and kScatter2Shards to OpTypes enum and optype2Str - Remove kPrefixCache from PagedAttnImplType - Add necessary headers and fix tensor ptr attribute to [[nodiscard]]

- Register `RadixAttnOp` and `Scatter2ShardsOp` in CPU backend - Implement forward logic for `RadixAttnOp` with architecture-specific kernels - Update `RadixAttnOpOptions` to include head count parameters - Fix const-correctness and template usage in radix attention kernel - Remove unnecessary `.to(kFloat16)` calls in Qwen3 attention module - Adjust function signatures for better type safety in radix attention kernel - Refactor kernel calls to use namespaced static dispatch

The Q tensor layout was incorrectly specified as [B, H_Q, S_Q, D] when it should be [B, S_Q, H_Q, D]. This change updates the documentation and adjusts the indexing logic accordingly. A FIXME comment is added to indicate potential performance improvements. Also adds a TODO comment indicating that the kernel's layout needs further review. feat(cpu): add input layout support for RoPE operation Introduces support for different input layouts (BHSD and BSHD) in the RoPE operation. This change modifies the forward method to accept an input_layout_type parameter, allowing the RoPE operation to correctly process tensors with varying memory layouts. The implementation includes updates to both float32 and float16 data types, ensuring compatibility across different architectures (x86 and ARM). Vectorized processing is preserved for both layout types. Additionally, this commit: - Adds enum class RoPEOpOptionsInputType to specify layout types - Updates RoPEOpOptions to include input_type configuration - Modifies Qwen3Attention to use BSHD layout for RoPE operations - Adjusts tensor scattering indices in Qwen3Attention from 2 to 1 - Implements RadixAttn layer and registers it in Qwen3Attention - Updates Qwen3Session::applyChatTemplate to use nlohmann::json fix(include): update prefix cache include directive Updates the include directive in PrefixCache.hpp to properly export the cache header, ensuring correct usage throughout the codebase.

- Add FilledWithConst struct template in arch.hpp - Implement FilledWithConst specializations for __AnyArchTag and __ArmArchTag - Replace manual loop initialization with FilledWithConst in fwd_bshd.hpp - Update Q tensor shape indexing in RadixAttnOp.cpp - Improve VectorDotProduct, MulFromConst and FMAConstArray ARM implementations - Fix softmax computation logic in radix attention forward pass - Add proper mask handling in RadixAttnKernel test - Add Scatter2ShardsKernelTest for shard scattering validation - Fix random state management in Context class - Add RandomStatesTest for verifying random seed behavior

- Add new `qwen3_service` example with CMake build support - Introduce `startService`, `stopService`, and `insertSession` APIs for simplified service control - Refactor `Session` class to remove dependency on `ARGeneration` and improve flexibility - Enhance `Qwen3Session` with thinking token handling and improved cache management - Improve error messages in model loading and extend rotary positional embedding support - Update radix attention kernel comments and parallelization hints - Remove obsolete PyPI README content The changes enable better service deployment and model session handling, especially for Qwen3-based models with thinking token capabilities.

Ensure that `k_cache_addresses` and `v_cache_addresses` are properly resized to match the number of transformer blocks during radix tree search failure. fix(qwen3): correct cache address gathering and physical address mapping Replace `std::ranges::copy` with `insert` for efficient appending of cache addresses. Fix incorrect use of `k_cache_addr` where `v_cache_addr` should be used when mapping physical addresses for value cache.

- Replace hardcoded request with interactive user input loop - Add support for multi-turn conversations with history tracking - Integrate thinking state visualization using fmt library - Handle graceful exit with /exit or /quit commands - Improve response formatting with proper JSON structure fix(cpu): support BSHD layout in RoPE operation - Extend RoPE operator to handle both BHSD and BSHD tensor layouts - Add layout type checking and dimension mapping logic - Fix assertion to allow multiple input layout types refactor(service): improve request pool shutdown handling - Change RequestPool::pop() to return std::optional<RequestItem> - Update service worker loop to handle optional requests - Ensure worker threads join properly during shutdown - Add timestamp to response payloads - Modify response format to follow chat completion chunk structure perf(cache): optimize ZenFS blob size calculation - Simplify bit calculation using literal 64 instead of sizeof(uint64_t) - Correct blob size computation by using consistent dtype lanes fix(qwen3): correct loop initialization and template processing - Initialize loop counters properly in attention mechanisms - Refine chat template processing for better tool and thinking support - Adjust tensor shapes for sequence and position IDs - Update default sampling parameters and EOS token ID - Change cache data types from float16 to float32 for better precision style(qwen3): reformat function signature for readability - Break long function declaration into multiple lines - Improve code formatting and parameter alignment

Delete outdated GitHub Actions workflows for macOS and x86 builds that are no longer needed in the current CI/CD pipeline configuration.

chenghuaWang and others added 8 commits October 7, 2025 07:06

ci(workflows): remove unused macos and x86 publish workflows

f1e00ed

Delete outdated GitHub Actions workflows for macOS and x86 builds that are no longer needed in the current CI/CD pipeline configuration.

chenghuaWang requested review from oreomaker and yirongjie as code owners October 10, 2025 07:53

chenghuaWang merged commit 6694af8 into UbiquitousLearning:v2 Oct 10, 2025
2 checks passed

chenghuaWang mentioned this pull request Nov 1, 2025

Development Roadmap (2025 H2) #460

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(service): implement interactive chat loop in qwen3 service #462

feat(service): implement interactive chat loop in qwen3 service #462
chenghuaWang merged 8 commits intoUbiquitousLearning:v2from
chenghuaWang:v2

chenghuaWang commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chenghuaWang commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant