feat(search): add CJK character bigram tokenization | 添加 CJK Bigram 分词#224
feat(search): add CJK character bigram tokenization | 添加 CJK Bigram 分词#224mechanic-Q wants to merge 1 commit into
Conversation
Extract overlapping 2-character bigrams from CJK Unified Ideograph sequences (U+4E00-U+9FFF, U+3400-U+4DBF, U+F900-U+FAFF) so that Chinese/Japanese/Korean text is searchable by character pairs. This fixes the issue where CJK content was effectively invisible to the BM25 index because the standard whitespace tokenizer treats entire CJK sentences as single tokens that never match query terms. 添加 CJK 统一表意文字二字组(Bigram)分词支持,使中/日/韩文本在 BM25 全文索引中可被正常检索。
|
@mechanic-Q is attempting to deploy a commit to the rohitg00's projects Team on Vercel. A member of the Team first needs to authorize it. |
📝 WalkthroughWalkthroughThe ChangesSearch Index CJK Enhancement
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/state/search-index.ts (1)
227-229: 💤 Low valueRemove the WHAT-style comments.
These comments restate the implementation and conflict with the repo guideline for
src/**/*.ts: “Avoid code comments explaining WHAT — use clear naming instead.” The regex and loop naming already make the intent clear.Proposed cleanup
- // CJK character bigrams: for languages without word boundaries - // (Chinese, Japanese, Korean), extract overlapping 2-character - // sequences so queries like "人工智能" match "人工" and "智能". const cjkRange = /[\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+/g;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/state/search-index.ts` around lines 227 - 229, Remove the WHAT-style explanatory comment block about "CJK character bigrams" that sits immediately above the regex and the loop which extract overlapping 2-character sequences; the regex and loop names already convey intent, so delete those three comment lines (the block explaining Chinese/Japanese/Korean bigrams) to comply with the src/**/*.ts guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/state/search-index.ts`:
- Around line 227-229: Remove the WHAT-style explanatory comment block about
"CJK character bigrams" that sits immediately above the regex and the loop which
extract overlapping 2-character sequences; the regex and loop names already
convey intent, so delete those three comment lines (the block explaining
Chinese/Japanese/Korean bigrams) to comply with the src/**/*.ts guideline.
|
Reviewed — clean addition. The CJK Unicode ranges ( A few observations, all non-blocking:
Approving for merge — holding for @rohitg00 to land. Real value-add for any non-English-language agentmemory user. |
Summary | 概述
Add CJK (Chinese / Japanese / Korean) character bigram tokenization to the BM25 search index. This enables CJK text to be searchable by character pairs, matching the same granularity that the whitespace-based tokenizer achieves for Latin-script languages.
为 BM25 搜索索引添加 CJK 字符 Bigram 分词支持,使中/日/韩文本可按字符对进行检索。
Motivation | 动机
In production use with Chinese-language content (news articles, technical documents, daily notes), we observed that
memory_searchandmemory_smart_searchconsistently returned zero results for Chinese queries — even when the relevant memories existed. The root cause was the tokenizer: without whitespace between words, the entire Chinese sentence "人工智能技术发展迅速" became a single index token that never matched partial queries like "人工智能" or "技术".This effectively made agentmemory's BM25 index useless for CJK content. Since agentmemory supports multiple languages via its embedding layer (configurable model), the text search layer should not be the bottleneck. This is a well-known problem in information retrieval — virtually all production search engines (Elasticsearch, Lucene, Meilisearch) include CJK bigram/ngram tokenizers for exactly this reason.
在实际使用中,中文记忆内容完全无法被 BM25 搜索命中。原因:分词器将整句中文视为单个 token,查询"人工智能"永远匹配不到含有"人工智能技术"的记忆。这是信息检索领域的经典问题,Elasticsearch/Lucene/Meilisearch 等生产搜索引擎均内置 CJK ngram 分词器。
Problem | 问题
The current tokenizer splits text on whitespace and filters tokens shorter than 2 characters. For CJK text (which typically has no whitespace between words), entire sentences are treated as single tokens that never match multi-character queries. A memory containing "人工智能发展迅速" would not match a query for "人工智能".
Solution | 方案
After the standard whitespace tokenization, detect CJK Unified Ideograph sequences (U+4E00-U+9FFF, U+3400-U+4DBF, U+F900-U+FAFF) and extract overlapping 2-character bigrams. These bigrams are added to the term list alongside English tokens and go through the same stemming pipeline.
在标准空白分词后,检测 CJK 统一表意文字序列并提取重叠的二字组(Bigram),与英文词条一同进入 BM25 索引。
Changes | 改动
src/state/search-index.ts: Modifiedtokenize()to extract CJK bigramsextractTerms) and query (search) since both calltokenize()Example | 示例
Performance | 性能
Overlapping bigrams on a CJK segment of length N produce N-1 additional terms. For typical observation narratives (a few hundred characters), this adds negligible overhead to the BM25 index.
Backwards Compatibility | 向后兼容
Existing English-only indexes are unaffected. The change adds terms but does not remove or modify existing ones. No configuration changes needed.
Scope | 范围
Covers the core CJK Unified Ideographs block plus Ext-A and Compatibility Ideographs. Does not cover Hangul (Korean alphabet), Kana (Japanese syllabaries), or Bopomofo — those are typically whitespace-delimited and handled by the standard path.