feat: inverted index for contains_tokens by wojiaodoubao · Pull Request #4489 · lance-format/lance

wojiaodoubao · 2025-08-16T15:52:29Z

This is the second step of #3855

BubbleCal

LGTM, just a minor question

BubbleCal · 2025-08-17T02:15:09Z

-        ));
+        let query = query.as_any().downcast_ref::<TextQuery>().unwrap();
+        match query {
+            TextQuery::StringContains(text) => {


should we add a new variant (TextQuery::TokenContains) for this? cc @westonpace

Yes, I think that would be a good idea. Or StringContains could take a boolean parameter.

Actually, we may need a separate query entirely 🤔

Right now the logic assumes "if scalar index supports query X and we can parse query X then scalar index can be used"

So if we re-use TextQuery that will mean that an inverted index will be chosen for both StringContains and TokenContains which isn't quite right (we only want it used for TokenContains)

BubbleCal · 2025-08-17T02:17:16Z

-            // Create the contains_tokens UDF
            if func.name() == "contains_tokens" {
                let query = TextQuery::StringContains(scalar_str);
                Some(IndexedExpression::index_query(


it would be cool if we can add more params for this UDF in the future (e.g. AND op, fuzziness, prefix-matching)

Sorry for my late response. I think supporting the AND operator, fuzziness, and prefix-matching is an excellent idea!!! I have a few questions about this idea:

I'm not sure if my understanding is correct. The contains_tokens in Allow FTS indices to be used in filters #3855 is functionally the same as contains, the difference is it leverages FTS instead of n-grams. If we support fuzziness..., the query might look something like this:SELECT * FROM dataset WHERE contains_tokens(text, "+cat~2 and -dog")
I'm wondering if the semantics of contains_tokens would shift to "full-text search" rather than "whether it contains a specific substring." This makes me a bit confused.

Shall we consider adding a fts() UDF to facilitate full-text search through SQL? The fts() UDF would accept a JSON representation of a full_text_query and return a boolean indicating whether the record matches. The query might look like this.

let full_text_query = r#" { "query": { "bool": { "must": [ { "match": { "content": "asia" } }, { "match": { "content": { "value": "Felidae", "fuzziness": 2 } } } ], "should": [ { "phrase": { "title": "wild animal" } }, { "phrase": { "title": "home pet" } }, ] } } } "#; let sql = format!("SELECT author FROM dataset WHERE fts({})", full_text_query);

If a column in the full_text_query does not have an inverted index created, the query will fail instead of attempting to read the record for matching.

There are other approaches for full_text_query, such as the MATCH AGAINST syntax used by MySQL or Lucene's syntax. However, JSON has stronger expressive capabilities and can fully represent FtsQuery, so I think using JSON to express full_text_query might be a good idea.

if I understand correctly, contains_tokens is some different from contains, the users should expect contains_token would ignore the punctuations and white spaces (depends on the FTS params).

e.g. the text is "hello the world", then contains_tokens("text", "hello world") should match it but contains not. On the other hand, contains_tokens("text", "wor") can't match anything but contains would. cc @westonpace

yes we evaluated Lucene's syntax, the FtsQuery object is organized in a very similar way, i think it's not hard to convert a FtsQuery object into JSON. in fact I implemented this converting logic ago, we are not using it because most users are using lance via the SDK, and query via constructing FtsQuery object is easier and more efficient (see https://lancedb.com/docs/search/full-text-search/#boolean-queries).

But yes it's a different story for SQL, I think it's a good idea to have the JSON syntax for SQL, we just need to parse it into the FtsQuery object

contains_tokens is some different from contains, the users should expect contains_token would ignore the punctuations and white spaces (depends on the FTS params).

Hi @BubbleCal , can I understand it this way: The functionality of contains_tokens is equivalent to FTS MatchQuery (with fuzziness disabled, Operator::And, and using the simple tokenizer). This semantic makes sense to me. I previously implemented a version of the contains_tokens(#4420), and if we ultimately decide on this semantic, the UDF will need some adjustments.

Another point is that if the functionality of contains_tokens is substring matching, then FTS might introduce false negatives. However, if it follows the version described above, there would be no ambiguity at the semantic level.

I'm also wondering: do we need to support configuring tokenizers, fuzziness, prefix-matching and operator. If we make them as params of contains_tokens, the UDF might become too bloated. I think we can consider supporting an fts() UDF in the future, where more advanced full-text search capabilities can be covered.

Hi @westonpace , there are some different understandings regarding the syntax of the contains_tokens method. Could you help take a look?

I think there is agreement.

contains_tokens can be used for "best effort filtering" and will be accelerated by an inverted (FTS) index

The exact definition of "best effort" is a little vague

contains is for "exact filtering" and will be accelerated by ngram but not inverted index.

FTS search is a different thing (not filtering) and it would be nice to expose in the SQL but that is a different task entirely.

I've opened #4520 for discussion on FTS in SQL

BubbleCal · 2025-08-17T02:27:00Z

+                    .downcast_ref::<UInt64Array>()
+                    .unwrap();
+                let row_ids = row_ids.iter().flatten().collect_vec();
+                Ok(SearchResult::AtMost(RowIdTreeMap::from_iter(row_ids)))


prob we can do this better in the future, if the tokenizer wouldn't change the original texts, we can return AtLeast directly

codecov-commenter · 2025-08-18T08:34:08Z

Codecov Report

❌ Patch coverage is 93.54839% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.19%. Comparing base (811ea9f) to head (f066ce6).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar.rs	76.19%	5 Missing ⚠️
rust/lance-index/src/scalar/inverted/index.rs	86.11%	3 Missing and 2 partials ⚠️
rust/lance/src/dataset.rs	97.79%	1 Missing and 2 partials ⚠️
rust/lance-datafusion/src/udf.rs	93.33%	2 Missing ⚠️
rust/lance/src/index.rs	90.90%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4489      +/-   ##
==========================================
- Coverage   82.24%   81.19%   -1.06%     
==========================================
  Files         308      308              
  Lines      127678   114079   -13599     
  Branches   127678   114079   -13599     
==========================================
- Hits       105010    92623   -12387     
+ Misses      18715    18199     -516     
+ Partials     3953     3257     -696

Flag	Coverage Δ
unittests	`81.19% <93.54%> (-1.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wojiaodoubao · 2025-08-21T09:51:50Z

Hi @BubbleCal @westonpace , could you help review this when you have time, thanks very much !

westonpace

What is our goal?

If the goal is to approximate "string contains" with the knowledge that there may sometimes be false negatives then we should return AtMost (which this PR does) and use value.contains(scalar_str.value) (which this PR does not do).

If the goal is to create a "contains tokens" function then we shouldn't define "token" very precisely. This PR defines it as "separated by punctuation and white space" but that is not how the tokenizer is neccesarily configured. It might be ngrams or apply stemming, and stop words may be removed. In this case I think we should return exact (which this PR does not do) and use the collect_tokens based filter (which this PR does do)

westonpace · 2025-08-22T04:01:09Z

"The query results are a superset of contains_tokens" can be checked by condition below.

Ah, I missed this part, thank you!

westonpace

A few more small tasks. There is a bit of glue we need to connect the query to the index. I think we will need a test verifying the index is actually used too.

westonpace · 2025-08-22T04:06:41Z

+#[derive(Debug, Clone, PartialEq)]
+pub enum TokenQuery {
+    /// Retrieve all row ids where the text contains all tokens parsed from given string. The tokens


There might be a few pieces of glue missing. I think we need a TokenQueryParser (see https://github.com/lancedb/lance/blob/60711f360b7f8692df44a0e84c98c8fdff2897a3/rust/lance-index/src/scalar/expression.rs#L348 for the TextQueryParser).

We also need to register the token query parser here: https://github.com/lancedb/lance/blob/60711f360b7f8692df44a0e84c98c8fdff2897a3/rust/lance/src/index.rs#L1361

Can we call is_query_allowed in the registration function (scalar_index_info)? This way we can skip the scalar index entirely if it is not eligible. Returning AtLeast with zero rows might lead to bad performance (the planner will think we are doing a scalar index optimized search and make certain decisions based on that)

I think we need a TokenQueryParser

We already have a FtsQueryParser which parses contains_tokens into TokenQuery::TokensContains. Actually you implemented it (^-^). Shall we can just rely on FtsQueryParser?

Can we call is_query_allowed in the registration function ...

Thanks your nice suggestion, let me fix it.

wojiaodoubao · 2025-08-22T04:10:46Z

Sorry, I accidentally deleted the comment. I'm reposting it here.

What is our goal?

The current goal is to create a "contains tokens" function.

In this case I think we should return exact (which this PR does not do) and use the collect_tokens based filter (which this PR does do).

The current functionality of contains_tokens is equivalent to the following FTS bm25_search:

Tokenizer Configuration

InvertedIndexParams {
  base_tokenizer: "simple",
  language: English,
  with_position: false,
  max_token_length: None,
  lower_case: false,
  stem: false,
  remove_stop_words: false,
  ascii_folding: false,
  min_ngram_length: MAX u32, // disable ngram
  max_ngram_length: MAX u32, // disable ngram
  prefix_only: false,
}

fuzziness: None
Operator: And

Before using InvertedIndex to optimize query, we must check the InvertedIndexParams. If the FTS index's InvertedIndexParams can guarantee that "the query results are a superset of contains_tokens query results," then fts index is used and AtMost is returned. Otherwise, AtLeast with zero rowid is returned.

"The query results are a superset of contains_tokens" can be checked by condition below.

    /// Whether the query can use the current index.
    fn is_query_allowed(&self, query: &TokenQuery) -> bool {
        match query {
            TokenQuery::TokensContains(_) => {
                self.params.base_tokenizer == "simple"
                    && self.params.max_token_length.is_none()
                    && self.params.language == Language::English
                    && !self.params.stem
            }
        }
    }

The prerequisite for returning Exact is that the FTS index's InvertedIndexParams configuration is identical to the contains_tokens. I think this could be considered as an optimization.

wojiaodoubao · 2025-08-23T05:05:58Z

Hi @westonpace , this pr is ready for review, please help when you have time ,thanks very much!

westonpace

Thanks for bearing with the reviews!

westonpace · 2025-08-28T12:54:44Z

+                        let fts_index =
+                            lance_index::scalar::expression::ScalarIndexLoader::load_index(
+                                self,
+                                &field.name,
+                                &index.name,
+                                &NoOpMetricsCollector,
+                            )
+                            .await?;


This is a bit unique but I think it'll be fine for now. It means we will always load the FTS index on every query, even if it isn't used. Maybe once #4584 we can find someway to handle is_query_allowed in the plugin (perhaps based on details) so we don't have to load an index unless we know we can use it.

This is the second step of lance-format#3855 --------- Co-authored-by: lijinglun <lijinglun@bytedance.com>

github-actions Bot added the enhancement New feature or request label Aug 16, 2025

wojiaodoubao force-pushed the contains_tokens_use_index branch from f652ecd to efa3111 Compare August 16, 2025 15:53

BubbleCal reviewed Aug 17, 2025

View reviewed changes

wojiaodoubao force-pushed the contains_tokens_use_index branch from efa3111 to f8dd633 Compare August 18, 2025 07:55

jackye1995 requested a review from BubbleCal August 19, 2025 22:44

lijinglun added 3 commits August 21, 2025 11:08

feat: pushdown contains_tokens using inverted index

98f7081

update udf contains_tokens syntax

0571e12

push contains_tokens to inerted index

0fc8a1d

wojiaodoubao force-pushed the contains_tokens_use_index branch 2 times, most recently from 8138623 to d47f403 Compare August 21, 2025 08:16

BubbleCal approved these changes Aug 21, 2025

View reviewed changes

westonpace reviewed Aug 21, 2025

View reviewed changes

wojiaodoubao force-pushed the contains_tokens_use_index branch from d47f403 to 0250ece Compare August 22, 2025 03:38

fmt and clippy

80a2c66

wojiaodoubao force-pushed the contains_tokens_use_index branch from 0250ece to 80a2c66 Compare August 22, 2025 03:48

westonpace reviewed Aug 22, 2025

View reviewed changes

follow comments

f066ce6

westonpace approved these changes Aug 28, 2025

View reviewed changes

westonpace merged commit 1a48b24 into lance-format:main Aug 28, 2025
26 checks passed

jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026

feat: inverted index for contains_tokens (lance-format#4489)

4ec9d48

This is the second step of lance-format#3855 --------- Co-authored-by: lijinglun <lijinglun@bytedance.com>

Conversation

wojiaodoubao commented Aug 16, 2025

Uh oh!

BubbleCal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BubbleCal Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojiaodoubao Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BubbleCal Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojiaodoubao Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wojiaodoubao commented Aug 21, 2025

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace commented Aug 22, 2025

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojiaodoubao Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojiaodoubao commented Aug 22, 2025

Uh oh!

wojiaodoubao commented Aug 23, 2025

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BubbleCal Aug 17, 2025 •

edited

Loading

wojiaodoubao Aug 18, 2025 •

edited

Loading

BubbleCal Aug 18, 2025 •

edited

Loading

wojiaodoubao Aug 20, 2025 •

edited

Loading

codecov-commenter commented Aug 18, 2025 •

edited

Loading

wojiaodoubao Aug 22, 2025 •

edited

Loading