diff --git a/docs/src/format/table/index/scalar/fts.md b/docs/src/format/table/index/scalar/fts.md index 15f3133210e..5af36d294b8 100644 --- a/docs/src/format/table/index/scalar/fts.md +++ b/docs/src/format/table/index/scalar/fts.md @@ -129,6 +129,66 @@ Token filters are applied in sequence after the base tokenizer: For stemming and stop word removal, the following languages are supported: Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish +## Document Type +Lance supports 2 kinds of documents: text and json. Different document types have different tokenization rules, and +parse tokens in different format. + +### Text Type +Text type includes text and list of text. Tokens are generated by base_tokenizer. + +The example below shows how text document is parsed into tokens. +```text +Tom lives in San Francisco. +``` + +The tokens are below. +```text +Tom +lives +in +San +Francisco +``` + +### Json Type +Json is a nested structure, lance breaks down json document into tokens in triplet format `path,type,value`. The valid +types are: str, number, bool, null. + +In scenarios where the triplet value is a str, the text value will be further tokenized using the base_tokenizer, +resulting in multiple triplet tokens. + +During querying, the Json Tokenizer uses the triplet format instead of the json format, which simplifies the query +syntax. + +The example below shows how the json document is tokenized. Assume we have the following json document: +```json +{ + "name": "Lance", + "legal.age": 30, + "address": { + "city": "San Francisco", + "zip:us": 94102 + } +} +``` + +After parsing, the document will be tokenized into the following tokens: +``` +name,str,Lance +legal.age,number,30 +address.city,str,San +address.city,str,Francisco +address.zip:us,number,94102 +``` + +Then we do full text search in triplet format. To search for "San Francisco," we can search with one of the triplets +below: +``` +address.city:San Francisco +address.city:San +address.city:Francisco +``` + ## Accelerated Queries Lance SDKs provide dedicated full text search APIs to leverage the FTS index capabilities.