Skip to content

feat: inverted index for contains_tokens#4489

Merged
westonpace merged 5 commits intolance-format:mainfrom
wojiaodoubao:contains_tokens_use_index
Aug 28, 2025
Merged

feat: inverted index for contains_tokens#4489
westonpace merged 5 commits intolance-format:mainfrom
wojiaodoubao:contains_tokens_use_index

Conversation

@wojiaodoubao
Copy link
Copy Markdown
Contributor

This is the second step of #3855

@github-actions github-actions Bot added the enhancement New feature or request label Aug 16, 2025
@wojiaodoubao wojiaodoubao force-pushed the contains_tokens_use_index branch from f652ecd to efa3111 Compare August 16, 2025 15:53
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor question

));
let query = query.as_any().downcast_ref::<TextQuery>().unwrap();
match query {
TextQuery::StringContains(text) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a new variant (TextQuery::TokenContains) for this? cc @westonpace

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that would be a good idea. Or StringContains could take a boolean parameter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we may need a separate query entirely 🤔

Right now the logic assumes "if scalar index supports query X and we can parse query X then scalar index can be used"

So if we re-use TextQuery that will mean that an inverted index will be chosen for both StringContains and TokenContains which isn't quite right (we only want it used for TokenContains)

// Create the contains_tokens UDF
if func.name() == "contains_tokens" {
let query = TextQuery::StringContains(scalar_str);
Some(IndexedExpression::index_query(
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal Aug 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be cool if we can add more params for this UDF in the future (e.g. AND op, fuzziness, prefix-matching)

Copy link
Copy Markdown
Contributor Author

@wojiaodoubao wojiaodoubao Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my late response. I think supporting the AND operator, fuzziness, and prefix-matching is an excellent idea!!! I have a few questions about this idea:

  1. I'm not sure if my understanding is correct. The contains_tokens in Allow FTS indices to be used in filters #3855 is functionally the same as contains, the difference is it leverages FTS instead of n-grams. If we support fuzziness..., the query might look something like this:SELECT * FROM dataset WHERE contains_tokens(text, "+cat~2 and -dog")
    I'm wondering if the semantics of contains_tokens would shift to "full-text search" rather than "whether it contains a specific substring." This makes me a bit confused.

  2. Shall we consider adding a fts() UDF to facilitate full-text search through SQL? The fts() UDF would accept a JSON representation of a full_text_query and return a boolean indicating whether the record matches. The query might look like this.

let full_text_query = r#"
{
  "query": {
    "bool": {
      "must": [
        { "match": { "content": "asia" } },
        { "match": { "content": { "value": "Felidae", "fuzziness": 2 } } }
      ],
      "should": [
        { "phrase": { "title": "wild animal" } },
        { "phrase": { "title": "home pet" } },
      ]
    }
  }
}
"#;

let sql = format!("SELECT author FROM dataset WHERE fts({})", full_text_query);

If a column in the full_text_query does not have an inverted index created, the query will fail instead of attempting to read the record for matching.

There are other approaches for full_text_query, such as the MATCH AGAINST syntax used by MySQL or Lucene's syntax. However, JSON has stronger expressive capabilities and can fully represent FtsQuery, so I think using JSON to express full_text_query might be a good idea.

Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. if I understand correctly, contains_tokens is some different from contains, the users should expect contains_token would ignore the punctuations and white spaces (depends on the FTS params).

e.g. the text is "hello the world", then contains_tokens("text", "hello world") should match it but contains not. On the other hand, contains_tokens("text", "wor") can't match anything but contains would. cc @westonpace

  1. yes we evaluated Lucene's syntax, the FtsQuery object is organized in a very similar way, i think it's not hard to convert a FtsQuery object into JSON. in fact I implemented this converting logic ago, we are not using it because most users are using lance via the SDK, and query via constructing FtsQuery object is easier and more efficient (see https://lancedb.com/docs/search/full-text-search/#boolean-queries).

But yes it's a different story for SQL, I think it's a good idea to have the JSON syntax for SQL, we just need to parse it into the FtsQuery object

Copy link
Copy Markdown
Contributor Author

@wojiaodoubao wojiaodoubao Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contains_tokens is some different from contains, the users should expect contains_token would ignore the punctuations and white spaces (depends on the FTS params).

Hi @BubbleCal , can I understand it this way: The functionality of contains_tokens is equivalent to FTS MatchQuery (with fuzziness disabled, Operator::And, and using the simple tokenizer). This semantic makes sense to me. I previously implemented a version of the contains_tokens(#4420), and if we ultimately decide on this semantic, the UDF will need some adjustments.

Another point is that if the functionality of contains_tokens is substring matching, then FTS might introduce false negatives. However, if it follows the version described above, there would be no ambiguity at the semantic level.

I'm also wondering: do we need to support configuring tokenizers, fuzziness, prefix-matching and operator. If we make them as params of contains_tokens, the UDF might become too bloated. I think we can consider supporting an fts() UDF in the future, where more advanced full-text search capabilities can be covered.

Hi @westonpace , there are some different understandings regarding the syntax of the contains_tokens method. Could you help take a look?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is agreement.

  • contains_tokens can be used for "best effort filtering" and will be accelerated by an inverted (FTS) index
    • The exact definition of "best effort" is a little vague
  • contains is for "exact filtering" and will be accelerated by ngram but not inverted index.
  • FTS search is a different thing (not filtering) and it would be nice to expose in the SQL but that is a different task entirely.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #4520 for discussion on FTS in SQL

.downcast_ref::<UInt64Array>()
.unwrap();
let row_ids = row_ids.iter().flatten().collect_vec();
Ok(SearchResult::AtMost(RowIdTreeMap::from_iter(row_ids)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob we can do this better in the future, if the tokenizer wouldn't change the original texts, we can return AtLeast directly

@wojiaodoubao wojiaodoubao force-pushed the contains_tokens_use_index branch from efa3111 to f8dd633 Compare August 18, 2025 07:55
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 18, 2025

Codecov Report

❌ Patch coverage is 93.54839% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.19%. Comparing base (811ea9f) to head (f066ce6).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar.rs 76.19% 5 Missing ⚠️
rust/lance-index/src/scalar/inverted/index.rs 86.11% 3 Missing and 2 partials ⚠️
rust/lance/src/dataset.rs 97.79% 1 Missing and 2 partials ⚠️
rust/lance-datafusion/src/udf.rs 93.33% 2 Missing ⚠️
rust/lance/src/index.rs 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4489      +/-   ##
==========================================
- Coverage   82.24%   81.19%   -1.06%     
==========================================
  Files         308      308              
  Lines      127678   114079   -13599     
  Branches   127678   114079   -13599     
==========================================
- Hits       105010    92623   -12387     
+ Misses      18715    18199     -516     
+ Partials     3953     3257     -696     
Flag Coverage Δ
unittests 81.19% <93.54%> (-1.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jackye1995 jackye1995 requested a review from BubbleCal August 19, 2025 22:44
@wojiaodoubao wojiaodoubao force-pushed the contains_tokens_use_index branch 2 times, most recently from 8138623 to d47f403 Compare August 21, 2025 08:16
@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

Hi @BubbleCal @westonpace , could you help review this when you have time, thanks very much !

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is our goal?

If the goal is to approximate "string contains" with the knowledge that there may sometimes be false negatives then we should return AtMost (which this PR does) and use value.contains(scalar_str.value) (which this PR does not do).

If the goal is to create a "contains tokens" function then we shouldn't define "token" very precisely. This PR defines it as "separated by punctuation and white space" but that is not how the tokenizer is neccesarily configured. It might be ngrams or apply stemming, and stop words may be removed. In this case I think we should return exact (which this PR does not do) and use the collect_tokens based filter (which this PR does do)

@wojiaodoubao wojiaodoubao force-pushed the contains_tokens_use_index branch from d47f403 to 0250ece Compare August 22, 2025 03:38
@wojiaodoubao wojiaodoubao force-pushed the contains_tokens_use_index branch from 0250ece to 80a2c66 Compare August 22, 2025 03:48
@westonpace
Copy link
Copy Markdown
Member

"The query results are a superset of contains_tokens" can be checked by condition below.

Ah, I missed this part, thank you!

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more small tasks. There is a bit of glue we need to connect the query to the index. I think we will need a test verifying the index is actually used too.

Comment thread rust/lance/src/dataset.rs
Comment on lines +559 to +561
#[derive(Debug, Clone, PartialEq)]
pub enum TokenQuery {
/// Retrieve all row ids where the text contains all tokens parsed from given string. The tokens
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a few pieces of glue missing. I think we need a TokenQueryParser (see https://github.com/lancedb/lance/blob/60711f360b7f8692df44a0e84c98c8fdff2897a3/rust/lance-index/src/scalar/expression.rs#L348 for the TextQueryParser).

We also need to register the token query parser here: https://github.com/lancedb/lance/blob/60711f360b7f8692df44a0e84c98c8fdff2897a3/rust/lance/src/index.rs#L1361

Can we call is_query_allowed in the registration function (scalar_index_info)? This way we can skip the scalar index entirely if it is not eligible. Returning AtLeast with zero rows might lead to bad performance (the planner will think we are doing a scalar index optimized search and make certain decisions based on that)

Copy link
Copy Markdown
Contributor Author

@wojiaodoubao wojiaodoubao Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a TokenQueryParser

We already have a FtsQueryParser which parses contains_tokens into TokenQuery::TokensContains. Actually you implemented it (^-^). Shall we can just rely on FtsQueryParser?

Can we call is_query_allowed in the registration function ...

Thanks your nice suggestion, let me fix it.

@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

Sorry, I accidentally deleted the comment. I'm reposting it here.

What is our goal?

The current goal is to create a "contains tokens" function.

In this case I think we should return exact (which this PR does not do) and use the collect_tokens based filter (which this PR does do).

The current functionality of contains_tokens is equivalent to the following FTS bm25_search:

  • Tokenizer Configuration
InvertedIndexParams {
  base_tokenizer: "simple",
  language: English,
  with_position: false,
  max_token_length: None,
  lower_case: false,
  stem: false,
  remove_stop_words: false,
  ascii_folding: false,
  min_ngram_length: MAX u32, // disable ngram
  max_ngram_length: MAX u32, // disable ngram
  prefix_only: false,
}
  • fuzziness: None
  • Operator: And

Before using InvertedIndex to optimize query, we must check the InvertedIndexParams. If the FTS index's InvertedIndexParams can guarantee that "the query results are a superset of contains_tokens query results," then fts index is used and AtMost is returned. Otherwise, AtLeast with zero rowid is returned.

"The query results are a superset of contains_tokens" can be checked by condition below.

    /// Whether the query can use the current index.
    fn is_query_allowed(&self, query: &TokenQuery) -> bool {
        match query {
            TokenQuery::TokensContains(_) => {
                self.params.base_tokenizer == "simple"
                    && self.params.max_token_length.is_none()
                    && self.params.language == Language::English
                    && !self.params.stem
            }
        }
    }

The prerequisite for returning Exact is that the FTS index's InvertedIndexParams configuration is identical to the contains_tokens. I think this could be considered as an optimization.

@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

Hi @westonpace , this pr is ready for review, please help when you have time ,thanks very much!

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bearing with the reviews!

Comment thread rust/lance/src/index.rs
Comment on lines +1378 to +1385
let fts_index =
lance_index::scalar::expression::ScalarIndexLoader::load_index(
self,
&field.name,
&index.name,
&NoOpMetricsCollector,
)
.await?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit unique but I think it'll be fine for now. It means we will always load the FTS index on every query, even if it isn't used. Maybe once #4584 we can find someway to handle is_query_allowed in the plugin (perhaps based on details) so we don't have to load an index unless we know we can use it.

@westonpace westonpace merged commit 1a48b24 into lance-format:main Aug 28, 2025
26 checks passed
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
This is the second step of lance-format#3855

---------

Co-authored-by: lijinglun <lijinglun@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants