Skip to content

Conversation

@di72nn
Copy link
Member

@di72nn di72nn commented Oct 25, 2020

Probably fixes #1090.

The currently used FTS tokenizer (unicode61) doesn't know anything about CJK, so it doesn't split words in these languages.
I'm not sure about the quality, but the icu tokenizer seems to do a better job at this (to my understanding unicode61 is still better for latin-based languages, hence it is the default).

Here are some tests I ran on an emulator (Android 8.1):

adb shell
sqlite3

CREATE VIRTUAL TABLE ft3_tokenize_test_unicode USING fts3tokenize(unicode61);
CREATE VIRTUAL TABLE ft3_tokenize_test_icu USING fts3tokenize(icu);
CREATE VIRTUAL TABLE ft3_tokenize_test_icu_cn_simplified USING fts3tokenize(icu, zh_CN);
CREATE VIRTUAL TABLE ft3_tokenize_test_icu_cn_traditional USING fts3tokenize(icu, zh_TW);

SELECT token, start, end, position FROM ft3_tokenize_test_unicode WHERE input='为什么不支持中文 fts test';
SELECT token, start, end, position FROM ft3_tokenize_test_icu WHERE input='为什么不支持中文 fts test';
SELECT token, start, end, position FROM ft3_tokenize_test_icu_cn_simplified WHERE input='为什么不支持中文 fts test';
SELECT token, start, end, position FROM ft3_tokenize_test_icu_cn_traditional WHERE input='为什么不支持中文 fts test';

SELECT token, start, end, position FROM ft3_tokenize_test_unicode WHERE input='据台湾中时新闻网报道,一份最新民调今天(24日)出炉 fts test 2';
SELECT token, start, end, position FROM ft3_tokenize_test_icu WHERE input='据台湾中时新闻网报道,一份最新民调今天(24日)出炉 fts test 2';
SELECT token, start, end, position FROM ft3_tokenize_test_icu_cn_simplified WHERE input='据台湾中时新闻网报道,一份最新民调今天(24日)出炉 fts test 2';
SELECT token, start, end, position FROM ft3_tokenize_test_icu_cn_traditional WHERE input='据台湾中时新闻网报道,一份最新民调今天(24日)出炉 fts test 2';

icu, icu zh_CN and icu zh_TW produced the same result in this case.

I also tried to find this article using this query: 据台湾.

@di72nn di72nn added this to the 2.4.2 milestone Nov 25, 2020
@tcitworld tcitworld merged commit 3664c2c into master Dec 1, 2020
@tcitworld tcitworld deleted the fts_icu branch December 1, 2020 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"The search cannot be completed when inputting Chinese"

4 participants