Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions docs/src/format/table/index/scalar/fts.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,66 @@ Token filters are applied in sequence after the base tokenizer:
For stemming and stop word removal, the following languages are supported:
Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish

## Document Type
Lance supports 2 kinds of documents: text and json. Different document types have different tokenization rules, and
parse tokens in different format.

### Text Type
Text type includes text and list of text. Tokens are generated by base_tokenizer.

The example below shows how text document is parsed into tokens.
```text
Tom lives in San Francisco.
```

The tokens are below.
```text
Tom
lives
in
San
Francisco
```

### Json Type
Json is a nested structure, lance breaks down json document into tokens in triplet format `path,type,value`. The valid
types are: str, number, bool, null.

In scenarios where the triplet value is a str, the text value will be further tokenized using the base_tokenizer,
resulting in multiple triplet tokens.

During querying, the Json Tokenizer uses the triplet format instead of the json format, which simplifies the query
syntax.

The example below shows how the json document is tokenized. Assume we have the following json document:
```json
{
"name": "Lance",
"legal.age": 30,
"address": {
"city": "San Francisco",
"zip:us": 94102
}
}
```

After parsing, the document will be tokenized into the following tokens:
```
name,str,Lance
legal.age,number,30
address.city,str,San
address.city,str,Francisco
address.zip:us,number,94102
```

Then we do full text search in triplet format. To search for "San Francisco," we can search with one of the triplets
below:
```
address.city:San Francisco
address.city:San
address.city:Francisco
```

## Accelerated Queries

Lance SDKs provide dedicated full text search APIs to leverage the FTS index capabilities.
Expand Down