feat: support fuzzy query and boost query#3610
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…-query Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
There was a problem hiding this comment.
Fts Query defines in rust, each of them would have a corresponding execution plan impl in rust/lance/src/io/exec/fts.rs (except MultiMatch, it's a little bit special, we can plan it into multiple MatchQuery)
| from typing import Optional | ||
|
|
||
|
|
||
| class Query: |
There was a problem hiding this comment.
The FTS query object in python, we org it as a nested map, and parse it back to Query object in rust, see python/src/utils.rs
There was a problem hiding this comment.
Pull Request Overview
This PR introduces fuzzy and boost full-text search (FTS) support and refactors FTS query building in both Rust and Python.
- Introduces new FTS query parameters and functions (e.g. fuzzy query via Levenshtein automaton and boost queries)
- Refactors query planning and execution in the Rust scanner and inverted index modules
- Updates tests and Python APIs to support the new FTS query types
Reviewed Changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| rust/lance/src/io/exec/utils.rs | Adds new imports and helper functions for dataset prefilter building |
| rust/lance/src/dataset/scanner.rs | Refactors full text search query planning including field validation and boost query handling |
| rust/lance/src/dataset.rs | Adds a new integration test for fuzzy query functionality |
| rust/lance/examples/full_text_search.rs | Updates example usage to reflect new FTS query parameter handling |
| rust/lance-index/* | Several files refactored to support fuzzy query expansion and updated BM25 search signature |
| python/src/utils.rs, python/src/dataset.rs, python/python/* | Update Python API and tests to parse and handle new compound FTS queries |
Comments suppressed due to low confidence (2)
rust/lance/src/dataset/scanner.rs:1711
- Using unwrap() on query.field can lead to a potential panic if the field is not set. Consider using a pattern match or an explicit error handling to ensure the field is present.
let index = self.dataset.load_scalar_index_for_column(query.field.as_ref().unwrap()).await?
rust/lance-index/src/scalar/inverted/index.rs:145
- [nitpick] The error message could be more descriptive by clarifying the context of why tokens are expected to be an fst map at search time. Consider including guidance on how the index should be recreated if not.
return Err(Error::Index { message: "tokens is not fst, which is not expected".to_owned(), location: location!() });
westonpace
left a comment
There was a problem hiding this comment.
I think the overall structure and approach are good. I have a bunch of questions and nits but no major concerns.
| query : str | list[Query] | ||
| If a string, the query string to match against. |
There was a problem hiding this comment.
Can you explain what happens if it is list[Query]? Do each of the queries need to be a match query? Or can it be a mix of queries?
There was a problem hiding this comment.
oh i think it must be str, will remove list[Query] there
| ] | ||
| } | ||
| ) | ||
| # spellchecker:<on> |
There was a problem hiding this comment.
Looks like you just disabled one line. Is this needed?
|
|
||
|
|
||
| class Query: | ||
| def __init__(self, query: dict): |
There was a problem hiding this comment.
What advantage do we get from using dict instead of a union of classes that all implement some base class with a query_type method?
e.g.
query: MatchQuery | PhraseQuery | BoostQuery | MultiMatchQuery
There was a problem hiding this comment.
I'm using dict because finally all the queries need to be convert to dict so that we can pass them into rust.
But you method sounds a plan, we can convert them into dict at the time of calling rust code
| Parameters | ||
| ---------- | ||
| query : str | Query | ||
| If str, the query string to search for. |
There was a problem hiding this comment.
If I just provide a string what kind of query am I doing? A match query? Can we clarify this?
There was a problem hiding this comment.
yeah it's match query, just added comments for this
| let planner = Planner::new(scan_node.schema()); | ||
| let physical_refine_expr = planner.create_physical_expr(expr)?; | ||
| scan_node = Arc::new(FilterExec::try_new(physical_refine_expr, scan_node)?); | ||
| Arc::new(MatchQueryExec::new_flat( |
There was a problem hiding this comment.
Do we have a combined search where some fragments are indexed and some fragments are flat? Right now it looks like, if any fragments are unindexed, we fall back to a flat search of all data?
There was a problem hiding this comment.
fixed it, it should combine the indexed & unindexed results and then sort with fetch
| let positive_exec = Box::pin(self.plan_fts( | ||
| &query.positive, | ||
| &unlimited_params, | ||
| filter_plan, | ||
| prefilter_source, | ||
| )); | ||
| let negative_exec = Box::pin(self.plan_fts( | ||
| &query.negative, | ||
| &unlimited_params, | ||
| filter_plan, | ||
| prefilter_source, | ||
| )); |
There was a problem hiding this comment.
How does an unlimited search work? Are the results still limited in some way (e.g. we will only return results that match at least one token?)
There was a problem hiding this comment.
It depends, but for now, yes.
the positive/negative query can be any type of FullTextQuery, the worst case is MatchQuery then the results must match at least one token, if it's PhraseQuery then it must match the phrase
| )); | ||
| } | ||
| }; | ||
| } |
There was a problem hiding this comment.
Should we verify self.prefilter_source is None in the else branch?
| fn schema(&self) -> SchemaRef { | ||
| FTS_SCHEMA.clone() | ||
| } |
There was a problem hiding this comment.
This does not need to be overloaded. The default implementation will use plan properties.
There was a problem hiding this comment.
nice, remove these
| } | ||
|
|
||
| impl ExecutionPlan for FtsExec { | ||
| impl ExecutionPlan for MatchQueryExec { |
There was a problem hiding this comment.
I wonder if it would be simpler to have MatchQueryExec and FlatMatchQueryExec instead of both behaviors in one node?
I'm ok either way. Just a thought.
There was a problem hiding this comment.
Yes I was thinking this as well, probably better to have a FlatMatchQueryExec, will add it
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
| assert_eq!(row_ids, &[0]); | ||
| } | ||
|
|
||
| async fn create_fts_dataset<Offset: arrow::array::OffsetSizeTrait>( |
There was a problem hiding this comment.
all the lines below are tests moved from inverted/index.rs because the query now is built into execution plans, we can't directly query the index
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3610 +/- ##
==========================================
- Coverage 78.71% 78.60% -0.11%
==========================================
Files 258 260 +2
Lines 96900 97699 +799
Branches 96900 97699 +799
==========================================
+ Hits 76271 76798 +527
- Misses 17556 17803 +247
- Partials 3073 3098 +25
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
westonpace
left a comment
There was a problem hiding this comment.
Thanks for addressing the feedback! Sorry for the delay in review.
This introduces fuzzy query, boost query and changes how we build the FTS query
this introduces fst lib for implementing fuzzy query:
generally, fst is like an immutable Map<String, u64>, but supports kinds of string queries (e.g. fuzzy search, prefix-match, substring, not equal)
when building the FTS index, we stores the tokens in a HashMap because we require mutability
when loading the FTS for serving queries, we load the tokens into fst so that we can support fuzzy query, and probably more kinds of queries in the future
Another impacts:
fst uses less memory, especially there are many similar tokens
fst is slower than HashMap for getting the token id, but for FTS most time is spent on searching over posting lists so this doesn't make any visible impacts for query latency