feat: support fuzzy query and boost query by BubbleCal · Pull Request #3610 · lance-format/lance

BubbleCal · 2025-03-26T14:31:15Z

This introduces fuzzy query, boost query and changes how we build the FTS query

this introduces fst lib for implementing fuzzy query:

generally, fst is like an immutable Map<String, u64>, but supports kinds of string queries (e.g. fuzzy search, prefix-match, substring, not equal)
when building the FTS index, we stores the tokens in a HashMap because we require mutability
when loading the FTS for serving queries, we load the tokens into fst so that we can support fuzzy query, and probably more kinds of queries in the future
Another impacts:

fst uses less memory, especially there are many similar tokens
fst is slower than HashMap for getting the token id, but for FTS most time is spent on searching over posting lists so this doesn't make any visible impacts for query latency

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

…-query Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal · 2025-03-27T10:19:36Z

Fts Query defines in rust, each of them would have a corresponding execution plan impl in rust/lance/src/io/exec/fts.rs (except MultiMatch, it's a little bit special, we can plan it into multiple MatchQuery)

BubbleCal · 2025-03-27T10:23:39Z

+from typing import Optional
+
+
+class Query:


The FTS query object in python, we org it as a nested map, and parse it back to Query object in rust, see python/src/utils.rs

Copilot

Pull Request Overview

This PR introduces fuzzy and boost full-text search (FTS) support and refactors FTS query building in both Rust and Python.

Introduces new FTS query parameters and functions (e.g. fuzzy query via Levenshtein automaton and boost queries)
Refactors query planning and execution in the Rust scanner and inverted index modules
Updates tests and Python APIs to support the new FTS query types

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
rust/lance/src/io/exec/utils.rs	Adds new imports and helper functions for dataset prefilter building
rust/lance/src/dataset/scanner.rs	Refactors full text search query planning including field validation and boost query handling
rust/lance/src/dataset.rs	Adds a new integration test for fuzzy query functionality
rust/lance/examples/full_text_search.rs	Updates example usage to reflect new FTS query parameter handling
rust/lance-index/*	Several files refactored to support fuzzy query expansion and updated BM25 search signature
python/src/utils.rs, python/src/dataset.rs, python/python/*	Update Python API and tests to parse and handle new compound FTS queries

Comments suppressed due to low confidence (2)

rust/lance/src/dataset/scanner.rs:1711

Using unwrap() on query.field can lead to a potential panic if the field is not set. Consider using a pattern match or an explicit error handling to ensure the field is present.

let index = self.dataset.load_scalar_index_for_column(query.field.as_ref().unwrap()).await?

rust/lance-index/src/scalar/inverted/index.rs:145

[nitpick] The error message could be more descriptive by clarifying the context of why tokens are expected to be an fst map at search time. Consider including guidance on how the index should be recreated if not.

return Err(Error::Index { message: "tokens is not fst, which is not expected".to_owned(), location: location!() });

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace

I think the overall structure and approach are good. I have a bunch of questions and nits but no major concerns.

westonpace · 2025-03-27T13:30:45Z

+        query : str | list[Query]
+            If a string, the query string to match against.


Can you explain what happens if it is list[Query]? Do each of the queries need to be a match query? Or can it be a mix of queries?

oh i think it must be str, will remove list[Query] there

westonpace · 2025-03-27T13:31:12Z

+            ]
+        }
+    )
+    # spellchecker:<on>


Looks like you just disabled one line. Is this needed?

westonpace · 2025-03-27T13:37:43Z

+
+
+class Query:
+    def __init__(self, query: dict):


What advantage do we get from using dict instead of a union of classes that all implement some base class with a query_type method?

e.g.

query: MatchQuery | PhraseQuery | BoostQuery | MultiMatchQuery

I'm using dict because finally all the queries need to be convert to dict so that we can pass them into rust.

But you method sounds a plan, we can convert them into dict at the time of calling rust code

westonpace · 2025-03-27T13:38:08Z

+        Parameters
+        ----------
+        query : str | Query
+            If str, the query string to search for.


If I just provide a string what kind of query am I doing? A match query? Can we clarify this?

yeah it's match query, just added comments for this

westonpace · 2025-03-27T14:11:17Z

-                    let planner = Planner::new(scan_node.schema());
-                    let physical_refine_expr = planner.create_physical_expr(expr)?;
-                    scan_node = Arc::new(FilterExec::try_new(physical_refine_expr, scan_node)?);
+                        Arc::new(MatchQueryExec::new_flat(


Do we have a combined search where some fragments are indexed and some fragments are flat? Right now it looks like, if any fragments are unindexed, we fall back to a flat search of all data?

fixed it, it should combine the indexed & unindexed results and then sort with fetch

westonpace · 2025-03-27T14:12:48Z

+                let positive_exec = Box::pin(self.plan_fts(
+                    &query.positive,
+                    &unlimited_params,
+                    filter_plan,
+                    prefilter_source,
+                ));
+                let negative_exec = Box::pin(self.plan_fts(
+                    &query.negative,
+                    &unlimited_params,
+                    filter_plan,
+                    prefilter_source,
+                ));


How does an unlimited search work? Are the results still limited in some way (e.g. we will only return results that match at least one token?)

It depends, but for now, yes.

the positive/negative query can be any type of FullTextQuery, the worst case is MatchQuery then the results must match at least one token, if it's PhraseQuery then it must match the phrase

westonpace · 2025-03-27T14:24:05Z

+                            ));
+                        }
+                    };
+                }


Should we verify self.prefilter_source is None in the else branch?

westonpace · 2025-03-27T14:24:57Z

+    fn schema(&self) -> SchemaRef {
+        FTS_SCHEMA.clone()
+    }


This does not need to be overloaded. The default implementation will use plan properties.

nice, remove these

westonpace · 2025-03-27T14:27:51Z

 }

-impl ExecutionPlan for FtsExec {
+impl ExecutionPlan for MatchQueryExec {


I wonder if it would be simpler to have MatchQueryExec and FlatMatchQueryExec instead of both behaviors in one node?

I'm ok either way. Just a thought.

Yes I was thinking this as well, probably better to have a FlatMatchQueryExec, will add it

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal · 2025-03-28T09:14:34Z

        assert_eq!(row_ids, &[0]);
    }

+    async fn create_fts_dataset<Offset: arrow::array::OffsetSizeTrait>(


all the lines below are tests moved from inverted/index.rs because the query now is built into execution plans, we can't directly query the index

codecov-commenter · 2025-03-28T09:46:51Z

Codecov Report

Attention: Patch coverage is 70.52385% with 377 lines in your changes missing coverage. Please review.

Project coverage is 78.60%. Comparing base (40142fb) to head (441dd7b).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance/src/io/exec/fts.rs	52.95%	135 Missing and 24 partials ⚠️
rust/lance-index/src/scalar/inverted/query.rs	45.31%	137 Missing and 3 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs	69.56%	30 Missing and 5 partials ⚠️
rust/lance/src/dataset/scanner.rs	87.75%	2 Missing and 22 partials ⚠️
rust/lance-index/src/scalar.rs	76.47%	7 Missing and 1 partial ⚠️
rust/lance/src/io/exec/utils.rs	71.42%	6 Missing ⚠️
rust/lance-index/src/scalar/inverted/wand.rs	64.28%	4 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3610      +/-   ##
==========================================
- Coverage   78.71%   78.60%   -0.11%     
==========================================
  Files         258      260       +2     
  Lines       96900    97699     +799     
  Branches    96900    97699     +799     
==========================================
+ Hits        76271    76798     +527     
- Misses      17556    17803     +247     
- Partials     3073     3098      +25

Flag	Coverage Δ
unittests	`78.60% <70.52%> (-0.11%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

westonpace

Thanks for addressing the feedback! Sorry for the delay in review.

BubbleCal added 14 commits March 19, 2025 17:59

feat: support fuzzy query for FTS

3ea4746

fix

ecf498a

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

9e0cd2f

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

f720614

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

165ba4b

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

update Cargo.lock

45e3d70

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

support fuzzy query in python

a139b66

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Merge branch 'main' of https://github.com/lancedb/lance into fuzzy-query

1277188

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Merge branch 'main' of https://github.com/lancedb/lance into fuzzy-query

76585b8

use simple tokenizer for fuzzy query

3601880

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

resolve conflicts

a956ecb

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

more query types

4ef8347

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

flat

dd94203

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

impl exec plan

e756237

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

github-actions Bot added enhancement New feature or request python labels Mar 26, 2025

BubbleCal added 6 commits March 27, 2025 15:12

done

0d309b9

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

remove unused imports

63e85cc

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Merge branch 'main' of https://github.com/lancedb/lance into boosting…

14dbf79

…-query Signed-off-by: BubbleCal <bubble-cal@outlook.com>

add missing file

1132e21

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

78e7a52

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix example

89e80c0

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal mentioned this pull request Mar 27, 2025

feat: support fuzzy query for FTS #3567

Closed

BubbleCal changed the title ~~feat: support boost query~~ feat: support fuzzy query and boost query Mar 27, 2025

BubbleCal added 2 commits March 27, 2025 18:15

python API

74d3a45

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

add missing file

4a5abe5

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal commented Mar 27, 2025

View reviewed changes

BubbleCal requested a review from Copilot March 27, 2025 10:24

Copilot AI reviewed Mar 27, 2025

View reviewed changes

BubbleCal requested a review from westonpace March 27, 2025 10:25

BubbleCal requested a review from wjones127 March 27, 2025 10:25

fmt

86a1565

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace reviewed Mar 27, 2025

View reviewed changes

BubbleCal added 8 commits March 28, 2025 12:54

more tests

a26b144

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

remove unused code

597eca6

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

2b6fb3e

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

7acbff3

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

refine

cfa7c9d

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

f575cae

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

c13cd31

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

move tests

441dd7b

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal commented Mar 28, 2025

View reviewed changes

BubbleCal requested a review from westonpace March 28, 2025 09:28

BubbleCal marked this pull request as ready for review March 28, 2025 09:55

westonpace approved these changes Mar 30, 2025

View reviewed changes

BubbleCal merged commit 1aa9d5a into lance-format:main Mar 31, 2025
28 checks passed

		query : str \| list[Query]
		If a string, the query string to match against.

Conversation

BubbleCal commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 28, 2025

Codecov Report

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BubbleCal commented Mar 26, 2025 •

edited

Loading