Skip to content

feat: support fuzzy query and boost query#3610

Merged
BubbleCal merged 31 commits intolance-format:mainfrom
BubbleCal:boosting-query
Mar 31, 2025
Merged

feat: support fuzzy query and boost query#3610
BubbleCal merged 31 commits intolance-format:mainfrom
BubbleCal:boosting-query

Conversation

@BubbleCal
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal commented Mar 26, 2025

This introduces fuzzy query, boost query and changes how we build the FTS query

this introduces fst lib for implementing fuzzy query:

generally, fst is like an immutable Map<String, u64>, but supports kinds of string queries (e.g. fuzzy search, prefix-match, substring, not equal)
when building the FTS index, we stores the tokens in a HashMap because we require mutability
when loading the FTS for serving queries, we load the tokens into fst so that we can support fuzzy query, and probably more kinds of queries in the future
Another impacts:

fst uses less memory, especially there are many similar tokens
fst is slower than HashMap for getting the token id, but for FTS most time is spent on searching over posting lists so this doesn't make any visible impacts for query latency

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@github-actions github-actions Bot added enhancement New feature or request python labels Mar 26, 2025
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…-query

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal changed the title feat: support boost query feat: support fuzzy query and boost query Mar 27, 2025
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fts Query defines in rust, each of them would have a corresponding execution plan impl in rust/lance/src/io/exec/fts.rs (except MultiMatch, it's a little bit special, we can plan it into multiple MatchQuery)

Comment thread python/python/lance/query.py Outdated
from typing import Optional


class Query:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FTS query object in python, we org it as a nested map, and parse it back to Query object in rust, see python/src/utils.rs

@BubbleCal BubbleCal requested a review from Copilot March 27, 2025 10:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces fuzzy and boost full-text search (FTS) support and refactors FTS query building in both Rust and Python.

  • Introduces new FTS query parameters and functions (e.g. fuzzy query via Levenshtein automaton and boost queries)
  • Refactors query planning and execution in the Rust scanner and inverted index modules
  • Updates tests and Python APIs to support the new FTS query types

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated no comments.

Show a summary per file
File Description
rust/lance/src/io/exec/utils.rs Adds new imports and helper functions for dataset prefilter building
rust/lance/src/dataset/scanner.rs Refactors full text search query planning including field validation and boost query handling
rust/lance/src/dataset.rs Adds a new integration test for fuzzy query functionality
rust/lance/examples/full_text_search.rs Updates example usage to reflect new FTS query parameter handling
rust/lance-index/* Several files refactored to support fuzzy query expansion and updated BM25 search signature
python/src/utils.rs, python/src/dataset.rs, python/python/* Update Python API and tests to parse and handle new compound FTS queries
Comments suppressed due to low confidence (2)

rust/lance/src/dataset/scanner.rs:1711

  • Using unwrap() on query.field can lead to a potential panic if the field is not set. Consider using a pattern match or an explicit error handling to ensure the field is present.
let index = self.dataset.load_scalar_index_for_column(query.field.as_ref().unwrap()).await?

rust/lance-index/src/scalar/inverted/index.rs:145

  • [nitpick] The error message could be more descriptive by clarifying the context of why tokens are expected to be an fst map at search time. Consider including guidance on how the index should be recreated if not.
return Err(Error::Index { message: "tokens is not fst, which is not expected".to_owned(), location: location!() });

@BubbleCal BubbleCal requested a review from westonpace March 27, 2025 10:25
@BubbleCal BubbleCal requested a review from wjones127 March 27, 2025 10:25
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the overall structure and approach are good. I have a bunch of questions and nits but no major concerns.

Comment thread python/python/lance/query.py
Comment on lines +131 to +132
query : str | list[Query]
If a string, the query string to match against.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what happens if it is list[Query]? Do each of the queries need to be a match query? Or can it be a mix of queries?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i think it must be str, will remove list[Query] there

]
}
)
# spellchecker:<on>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you just disabled one line. Is this needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment thread python/python/lance/query.py Outdated


class Query:
def __init__(self, query: dict):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What advantage do we get from using dict instead of a union of classes that all implement some base class with a query_type method?

e.g.

query: MatchQuery | PhraseQuery | BoostQuery | MultiMatchQuery

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using dict because finally all the queries need to be convert to dict so that we can pass them into rust.

But you method sounds a plan, we can convert them into dict at the time of calling rust code

Comment thread python/python/lance/dataset.py Outdated
Parameters
----------
query : str | Query
If str, the query string to search for.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I just provide a string what kind of query am I doing? A match query? Can we clarify this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's match query, just added comments for this

Comment thread rust/lance/src/dataset/scanner.rs Outdated
let planner = Planner::new(scan_node.schema());
let physical_refine_expr = planner.create_physical_expr(expr)?;
scan_node = Arc::new(FilterExec::try_new(physical_refine_expr, scan_node)?);
Arc::new(MatchQueryExec::new_flat(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a combined search where some fragments are indexed and some fragments are flat? Right now it looks like, if any fragments are unindexed, we fall back to a flat search of all data?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it, it should combine the indexed & unindexed results and then sort with fetch

Comment on lines +1779 to +1790
let positive_exec = Box::pin(self.plan_fts(
&query.positive,
&unlimited_params,
filter_plan,
prefilter_source,
));
let negative_exec = Box::pin(self.plan_fts(
&query.negative,
&unlimited_params,
filter_plan,
prefilter_source,
));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does an unlimited search work? Are the results still limited in some way (e.g. we will only return results that match at least one token?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends, but for now, yes.

the positive/negative query can be any type of FullTextQuery, the worst case is MatchQuery then the results must match at least one token, if it's PhraseQuery then it must match the phrase

));
}
};
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we verify self.prefilter_source is None in the else branch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Comment thread rust/lance/src/io/exec/fts.rs Outdated
Comment on lines +123 to +125
fn schema(&self) -> SchemaRef {
FTS_SCHEMA.clone()
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not need to be overloaded. The default implementation will use plan properties.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, remove these

}

impl ExecutionPlan for FtsExec {
impl ExecutionPlan for MatchQueryExec {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be simpler to have MatchQueryExec and FlatMatchQueryExec instead of both behaviors in one node?

I'm ok either way. Just a thought.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was thinking this as well, probably better to have a FlatMatchQueryExec, will add it

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Comment thread rust/lance/src/dataset.rs
assert_eq!(row_ids, &[0]);
}

async fn create_fts_dataset<Offset: arrow::array::OffsetSizeTrait>(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the lines below are tests moved from inverted/index.rs because the query now is built into execution plans, we can't directly query the index

@BubbleCal BubbleCal requested a review from westonpace March 28, 2025 09:28
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 70.52385% with 377 lines in your changes missing coverage. Please review.

Project coverage is 78.60%. Comparing base (40142fb) to head (441dd7b).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/io/exec/fts.rs 52.95% 135 Missing and 24 partials ⚠️
rust/lance-index/src/scalar/inverted/query.rs 45.31% 137 Missing and 3 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs 69.56% 30 Missing and 5 partials ⚠️
rust/lance/src/dataset/scanner.rs 87.75% 2 Missing and 22 partials ⚠️
rust/lance-index/src/scalar.rs 76.47% 7 Missing and 1 partial ⚠️
rust/lance/src/io/exec/utils.rs 71.42% 6 Missing ⚠️
rust/lance-index/src/scalar/inverted/wand.rs 64.28% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3610      +/-   ##
==========================================
- Coverage   78.71%   78.60%   -0.11%     
==========================================
  Files         258      260       +2     
  Lines       96900    97699     +799     
  Branches    96900    97699     +799     
==========================================
+ Hits        76271    76798     +527     
- Misses      17556    17803     +247     
- Partials     3073     3098      +25     
Flag Coverage Δ
unittests 78.60% <70.52%> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@BubbleCal BubbleCal marked this pull request as ready for review March 28, 2025 09:55
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the feedback! Sorry for the delay in review.

@BubbleCal BubbleCal merged commit 1aa9d5a into lance-format:main Mar 31, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants