Skip to content

feat(search): add CJK character bigram tokenization | 添加 CJK Bigram 分词#224

Open
mechanic-Q wants to merge 1 commit into
rohitg00:mainfrom
mechanic-Q:feature/cjk-tokenizer
Open

feat(search): add CJK character bigram tokenization | 添加 CJK Bigram 分词#224
mechanic-Q wants to merge 1 commit into
rohitg00:mainfrom
mechanic-Q:feature/cjk-tokenizer

Conversation

@mechanic-Q
Copy link
Copy Markdown

@mechanic-Q mechanic-Q commented May 2, 2026

Summary | 概述

Add CJK (Chinese / Japanese / Korean) character bigram tokenization to the BM25 search index. This enables CJK text to be searchable by character pairs, matching the same granularity that the whitespace-based tokenizer achieves for Latin-script languages.

为 BM25 搜索索引添加 CJK 字符 Bigram 分词支持,使中/日/韩文本可按字符对进行检索。

Motivation | 动机

In production use with Chinese-language content (news articles, technical documents, daily notes), we observed that memory_search and memory_smart_search consistently returned zero results for Chinese queries — even when the relevant memories existed. The root cause was the tokenizer: without whitespace between words, the entire Chinese sentence "人工智能技术发展迅速" became a single index token that never matched partial queries like "人工智能" or "技术".

This effectively made agentmemory's BM25 index useless for CJK content. Since agentmemory supports multiple languages via its embedding layer (configurable model), the text search layer should not be the bottleneck. This is a well-known problem in information retrieval — virtually all production search engines (Elasticsearch, Lucene, Meilisearch) include CJK bigram/ngram tokenizers for exactly this reason.

在实际使用中,中文记忆内容完全无法被 BM25 搜索命中。原因:分词器将整句中文视为单个 token,查询"人工智能"永远匹配不到含有"人工智能技术"的记忆。这是信息检索领域的经典问题,Elasticsearch/Lucene/Meilisearch 等生产搜索引擎均内置 CJK ngram 分词器。

Problem | 问题

The current tokenizer splits text on whitespace and filters tokens shorter than 2 characters. For CJK text (which typically has no whitespace between words), entire sentences are treated as single tokens that never match multi-character queries. A memory containing "人工智能发展迅速" would not match a query for "人工智能".

Solution | 方案

After the standard whitespace tokenization, detect CJK Unified Ideograph sequences (U+4E00-U+9FFF, U+3400-U+4DBF, U+F900-U+FAFF) and extract overlapping 2-character bigrams. These bigrams are added to the term list alongside English tokens and go through the same stemming pipeline.

在标准空白分词后,检测 CJK 统一表意文字序列并提取重叠的二字组(Bigram),与英文词条一同进入 BM25 索引。

Changes | 改动

  • src/state/search-index.ts: Modified tokenize() to extract CJK bigrams
  • Additive change — existing English tokenization path is unchanged
  • Affects both indexing (extractTerms) and query (search) since both call tokenize()

Example | 示例

Input: "人工智能发展迅速"
Before: ["人工智能发展迅速"]  (single token, never matches queries)
After:  ["人工智能发展迅速", "人工", "工智", "智能", "能发", "发展", "展迅", "迅速"]
       + standard English tokens

Query "人工智能" → matches via bigrams "人工" and "智能"
Query "发展"     → matches via bigram "发展"

Performance | 性能

Overlapping bigrams on a CJK segment of length N produce N-1 additional terms. For typical observation narratives (a few hundred characters), this adds negligible overhead to the BM25 index.

Backwards Compatibility | 向后兼容

Existing English-only indexes are unaffected. The change adds terms but does not remove or modify existing ones. No configuration changes needed.

Scope | 范围

Covers the core CJK Unified Ideographs block plus Ext-A and Compatibility Ideographs. Does not cover Hangul (Korean alphabet), Kana (Japanese syllabaries), or Bopomofo — those are typically whitespace-delimited and handled by the standard path.

Extract overlapping 2-character bigrams from CJK Unified Ideograph
sequences (U+4E00-U+9FFF, U+3400-U+4DBF, U+F900-U+FAFF) so that
Chinese/Japanese/Korean text is searchable by character pairs.

This fixes the issue where CJK content was effectively invisible to
the BM25 index because the standard whitespace tokenizer treats entire
CJK sentences as single tokens that never match query terms.

添加 CJK 统一表意文字二字组(Bigram)分词支持,使中/日/韩文本在 BM25
全文索引中可被正常检索。
@vercel
Copy link
Copy Markdown

vercel Bot commented May 2, 2026

@mechanic-Q is attempting to deploy a commit to the rohitg00's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

📝 Walkthrough

Walkthrough

The tokenize() method in SearchIndex now extracts overlapping 2-character bigrams from CJK text segments and appends their stemmed forms to the token list, enhancing search indexing for character-based languages alongside the existing whitespace-delimited word tokenization.

Changes

Search Index CJK Enhancement

Layer / File(s) Summary
Core Tokenization
src/state/search-index.ts
tokenize() adds CJK bigram extraction: detects CJK character ranges via Unicode regex, generates overlapping 2-character bigrams for each segment, stems each bigram, and appends to the token results.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A search that once stumbled on characters so tight,
Now glides through CJK bigrams with linguistic delight,
Two-character whispers, each stemmed with care,
In indices that dance through Asian text layers rare! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding CJK character bigram tokenization to the search indexing functionality, which is the core modification to src/state/search-index.ts.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/state/search-index.ts (1)

227-229: 💤 Low value

Remove the WHAT-style comments.

These comments restate the implementation and conflict with the repo guideline for src/**/*.ts: “Avoid code comments explaining WHAT — use clear naming instead.” The regex and loop naming already make the intent clear.

Proposed cleanup
-    // CJK character bigrams: for languages without word boundaries
-    // (Chinese, Japanese, Korean), extract overlapping 2-character
-    // sequences so queries like "人工智能" match "人工" and "智能".
     const cjkRange = /[\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+/g;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/state/search-index.ts` around lines 227 - 229, Remove the WHAT-style
explanatory comment block about "CJK character bigrams" that sits immediately
above the regex and the loop which extract overlapping 2-character sequences;
the regex and loop names already convey intent, so delete those three comment
lines (the block explaining Chinese/Japanese/Korean bigrams) to comply with the
src/**/*.ts guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/state/search-index.ts`:
- Around line 227-229: Remove the WHAT-style explanatory comment block about
"CJK character bigrams" that sits immediately above the regex and the loop which
extract overlapping 2-character sequences; the regex and loop names already
convey intent, so delete those three comment lines (the block explaining
Chinese/Japanese/Korean bigrams) to comply with the src/**/*.ts guideline.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a66e2027-95c8-4075-9285-b99819b95c2d

📥 Commits

Reviewing files that changed from the base of the PR and between 94fc119 and f7ae268.

📒 Files selected for processing (1)
  • src/state/search-index.ts

@rohitg00
Copy link
Copy Markdown
Owner

rohitg00 commented May 8, 2026

Reviewed — clean addition. The CJK Unicode ranges (一-鿿 Han, 㐀-䶿 Extension A, 豈-﫿 Compatibility Ideographs) cover the bulk of Chinese / Japanese kanji / Korean hanja content. Bigram fallback is the standard approach for languages without word boundaries — same trick Lucene's CJKAnalyzer uses.

A few observations, all non-blocking:

  • The CJK bigrams pass through stem() along with the Latin tokens. Porter stemming on non-ASCII is effectively a no-op (the algorithm only strips Latin suffixes), so this is safe but slightly wasteful — feel free to skip stemming for bigrams if you want a small perf win.
  • Hiragana / katakana ranges (぀-ゟ, ゠-ヿ) aren't included. Japanese text mixes kana with kanji freely; queries like "東京タワー" would only bigram the kanji portion. Worth a follow-up if you have Japanese users, but not a regression — current behavior is to whitespace-tokenize, and that's what kana would still get.
  • A unit test in test/state/search-index.test.ts (or wherever the existing tokenizer tests live) confirming "人工智能" tokenizes to bigrams 人工, 工智, 智能 would lock the behavior in. Not blocking but appreciated.

Approving for merge — holding for @rohitg00 to land. Real value-add for any non-English-language agentmemory user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants