Skip to content

feat(service): implement interactive chat loop in qwen3 service #462

Merged
chenghuaWang merged 8 commits intoUbiquitousLearning:v2from
chenghuaWang:v2
Oct 10, 2025
Merged

feat(service): implement interactive chat loop in qwen3 service #462
chenghuaWang merged 8 commits intoUbiquitousLearning:v2from
chenghuaWang:v2

Conversation

@chenghuaWang
Copy link
Copy Markdown
Collaborator

  • Replace hardcoded request with interactive user input loop
  • Add support for multi-turn conversations with history tracking
  • Integrate thinking state visualization using fmt library
  • Handle graceful exit with /exit or /quit commands
  • Improve response formatting with proper JSON structure

fix(cpu): support BSHD layout in RoPE operation

  • Extend RoPE operator to handle both BHSD and BSHD tensor layouts
  • Add layout type checking and dimension mapping logic
  • Fix assertion to allow multiple input layout types

refactor(service): improve request pool shutdown handling

  • Change RequestPool::pop() to return std::optional
  • Update service worker loop to handle optional requests
  • Ensure worker threads join properly during shutdown
  • Add timestamp to response payloads
  • Modify response format to follow chat completion chunk structure

perf(cache): optimize ZenFS blob size calculation

  • Simplify bit calculation using literal 64 instead of sizeof(uint64_t)
  • Correct blob size computation by using consistent dtype lanes

fix(qwen3): correct loop initialization and template processing

  • Initialize loop counters properly in attention mechanisms
  • Refine chat template processing for better tool and thinking support
  • Adjust tensor shapes for sequence and position IDs
  • Update default sampling parameters and EOS token ID
  • Change cache data types from float16 to float32 for better precision

style(qwen3): reformat function signature for readability

  • Break long function declaration into multiple lines
  • Improve code formatting and parameter alignment

chenghuaWang and others added 8 commits October 7, 2025 07:06
… references

- Rename directory and files from `paged_attn_x` to `radix_attn`
- Update namespace from `mllm::cpu::paged_attn_x` to `mllm::cpu::radix_attn`
- Remove unused includes and context-related code in fwd_bshd.hpp
- Add new ops: RadixAttnOp and Scatter2ShardsOp with CPU implementations
- Introduce tensor creation utilities: fromVector and refVectorData
- Add kRadixAttn and kScatter2Shards to OpTypes enum and optype2Str
- Remove kPrefixCache from PagedAttnImplType
- Add necessary headers and fix tensor ptr attribute to [[nodiscard]]
- Register `RadixAttnOp` and `Scatter2ShardsOp` in CPU backend
- Implement forward logic for `RadixAttnOp` with architecture-specific kernels
- Update `RadixAttnOpOptions` to include head count parameters
- Fix const-correctness and template usage in radix attention kernel
- Remove unnecessary `.to(kFloat16)` calls in Qwen3 attention module
- Adjust function signatures for better type safety in radix attention kernel
- Refactor kernel calls to use namespaced static dispatch
The Q tensor layout was incorrectly specified as [B, H_Q, S_Q, D] when it should be
[B, S_Q, H_Q, D]. This change updates the documentation and adjusts the indexing logic
accordingly. A FIXME comment is added to indicate potential performance improvements.

Also adds a TODO comment indicating that the kernel's layout needs further review.

feat(cpu): add input layout support for RoPE operation

Introduces support for different input layouts (BHSD and BSHD) in the RoPE operation.
This change modifies the forward method to accept an input_layout_type parameter,
allowing the RoPE operation to correctly process tensors with varying memory layouts.

The implementation includes updates to both float32 and float16 data types, ensuring
compatibility across different architectures (x86 and ARM). Vectorized processing
is preserved for both layout types.

Additionally, this commit:
- Adds enum class RoPEOpOptionsInputType to specify layout types
- Updates RoPEOpOptions to include input_type configuration
- Modifies Qwen3Attention to use BSHD layout for RoPE operations
- Adjusts tensor scattering indices in Qwen3Attention from 2 to 1
- Implements RadixAttn layer and registers it in Qwen3Attention
- Updates Qwen3Session::applyChatTemplate to use nlohmann::json

fix(include): update prefix cache include directive

Updates the include directive in PrefixCache.hpp to properly export the cache header,
ensuring correct usage throughout the codebase.
- Add FilledWithConst struct template in arch.hpp
- Implement FilledWithConst specializations for __AnyArchTag and __ArmArchTag
- Replace manual loop initialization with FilledWithConst in fwd_bshd.hpp
- Update Q tensor shape indexing in RadixAttnOp.cpp
- Improve VectorDotProduct, MulFromConst and FMAConstArray ARM implementations
- Fix softmax computation logic in radix attention forward pass
- Add proper mask handling in RadixAttnKernel test
- Add Scatter2ShardsKernelTest for shard scattering validation
- Fix random state management in Context class
- Add RandomStatesTest for verifying random seed behavior
- Add new `qwen3_service` example with CMake build support
- Introduce `startService`, `stopService`, and `insertSession` APIs for simplified service control
- Refactor `Session` class to remove dependency on `ARGeneration` and improve flexibility
- Enhance `Qwen3Session` with thinking token handling and improved cache management
- Improve error messages in model loading and extend rotary positional embedding support
- Update radix attention kernel comments and parallelization hints
- Remove obsolete PyPI README content

The changes enable better service deployment and model session handling, especially for Qwen3-based models with thinking token capabilities.
Ensure that `k_cache_addresses` and `v_cache_addresses` are properly resized
to match the number of transformer blocks during radix tree search failure.

fix(qwen3): correct cache address gathering and physical address mapping

Replace `std::ranges::copy` with `insert` for efficient appending of cache
addresses. Fix incorrect use of `k_cache_addr` where `v_cache_addr` should
be used when mapping physical addresses for value cache.
- Replace hardcoded request with interactive user input loop
- Add support for multi-turn conversations with history tracking
- Integrate thinking state visualization using fmt library
- Handle graceful exit with /exit or /quit commands
- Improve response formatting with proper JSON structure

fix(cpu): support BSHD layout in RoPE operation

- Extend RoPE operator to handle both BHSD and BSHD tensor layouts
- Add layout type checking and dimension mapping logic
- Fix assertion to allow multiple input layout types

refactor(service): improve request pool shutdown handling

- Change RequestPool::pop() to return std::optional<RequestItem>
- Update service worker loop to handle optional requests
- Ensure worker threads join properly during shutdown
- Add timestamp to response payloads
- Modify response format to follow chat completion chunk structure

perf(cache): optimize ZenFS blob size calculation

- Simplify bit calculation using literal 64 instead of sizeof(uint64_t)
- Correct blob size computation by using consistent dtype lanes

fix(qwen3): correct loop initialization and template processing

- Initialize loop counters properly in attention mechanisms
- Refine chat template processing for better tool and thinking support
- Adjust tensor shapes for sequence and position IDs
- Update default sampling parameters and EOS token ID
- Change cache data types from float16 to float32 for better precision

style(qwen3): reformat function signature for readability

- Break long function declaration into multiple lines
- Improve code formatting and parameter alignment
Delete outdated GitHub Actions workflows for macOS and x86 builds that are no longer
needed in the current CI/CD pipeline configuration.
@chenghuaWang chenghuaWang merged commit 6694af8 into UbiquitousLearning:v2 Oct 10, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant