Skip to content

feat: support using FTS as a filter in vector search#4928

Merged
BubbleCal merged 1 commit intolance-format:mainfrom
wojiaodoubao:fts-as-vector-filter
Dec 15, 2025
Merged

feat: support using FTS as a filter in vector search#4928
BubbleCal merged 1 commit intolance-format:mainfrom
wojiaodoubao:fts-as-vector-filter

Conversation

@wojiaodoubao
Copy link
Copy Markdown
Contributor

Close #4927

@github-actions github-actions Bot added the enhancement New feature or request label Oct 10, 2025
@BubbleCal BubbleCal self-requested a review October 10, 2025 07:22
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Oct 10, 2025

@wojiaodoubao wojiaodoubao force-pushed the fts-as-vector-filter branch 3 times, most recently from 7ebe852 to a1d7add Compare October 11, 2025 12:43
@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

Hi @BubbleCal , sorry to bother. Could you help review when you have time, thanks very much!

Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature is cool, but I think we still need to figure out the API and semantic things:

this PR assumes if both FTS and vector query are provided, then one of them would be as the filter of the other one (depends on prefilter param), I'd think it would be better to always apply filter by filter() method, we can change this method to take an enum param which can be one of string, FTSQuery, VectorQuery.

Comment thread rust/lance/src/dataset.rs Outdated
.unwrap();

// Case 1: search with prefilter=true
let query_vector = Float32Array::from(vec![300f32, 300f32, 300f32, 300f32]);
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this new feature work if the query contains filter, FTS and vector query all?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If prefilter=true, scanner will first perform the filter, then perform fts based on filter result, then perform vector search based on fts result.

If prefilter=false, scanner will first perform vector search, then filter the result with fts, then filter the result with filter.

Comment thread rust/lance/src/dataset.rs Outdated
.unwrap();

let results = stream.try_collect::<Vec<_>>().await.unwrap();
let batch = concat_batches(&results[0].schema(), &results).unwrap();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect if the FTS is a post-filter, then we should get vector search results (mean there should be '_distance' column), does it work like this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all 3 cases in the unit tests, the result schema is (id, vector, text, _distance, _score). So yes it contains '_distance', but it shouldn't contain '_score' since fts is used as filter. This is related to the semantic and api issue above, let me fix them together.

@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

This feature is cool, but I think we still need to figure out the API and semantic things:
this PR assumes if both FTS and vector query are provided, then one of them would be as the filter of the other one (depends on prefilter param), I'd think it would be better to always apply filter by filter() method, we can change this method to take an enum param which can be one of string, FTSQuery, VectorQuery.

@BubbleCal Thanks for your nice suggestion! The current implementation does have some semantic confusion. I previously thought "using FTS as a filter for vector search" and "using vector search as a filter for FTS" were equivalent. Specifically, vector search with prefilter=true and FTS as filter is equivalent to full text search with prefilter=false and vector search as filter.

However, they are actually different because the returned schema is not the same. The results of an FTS query should include _score instead of _distance, while the results of a vector search should include _distance instead of _score.

I like your idea "apply filter by filter() method, we can change this method to take an enum param which can be one of string, FTSQuery, VectorQuery". This approach allows us to distinguish between filter and search. Additionally, I think the enum param can be slightly modified to support cases where both string & fts or string & vector are used as filters simultaneously.

enum FilterParam {
    Sql(String),
    Fts(FullTextSearchQuery),
    Vector(Query),
    SqlAndFts(String, FullTextSearchQuery),
    SqlAndVector(String, Query)
}

Let me fix it.

@BubbleCal
Copy link
Copy Markdown
Contributor

BubbleCal commented Oct 14, 2025

@wjones127 I'm under the impression that we plan to make some changes to the query builder like:
ds.filter("x>0").filter("x<10").nearest(vec) is equal to ds.filter("x>0 and x<10").nearest(vec)

Could you confirm? If so I think it's a good time to implement this, and then we can have the filter enum simpler:

enum Filter {
    Sql(String),
    Fts(FullTextSearchQuery),
    Vector(Query),
}

cc @wojiaodoubao

@BubbleCal BubbleCal requested a review from wjones127 October 14, 2025 03:10
@wojiaodoubao wojiaodoubao marked this pull request as draft October 15, 2025 15:30
@wojiaodoubao wojiaodoubao marked this pull request as ready for review October 16, 2025 07:42
@github-actions github-actions Bot added the java label Oct 16, 2025
@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

Hi @BubbleCal @wjones127 , please help review when you have time, thanks very much!

@wjones127
Copy link
Copy Markdown
Contributor

@wjones127 I'm under the impression that we plan to make some changes to the query builder like: ds.filter("x>0").filter("x<10").nearest(vec) is equal to ds.filter("x>0 and x<10").nearest(vec)

Could you confirm?

Yeah I think we are getting to the place where we need to replace the Scanner implementation entirely. What we really need is a logical plan builder, so users can compose arbitrary plans together.

That's a larger project. I'm personally okay doing something basic here to support this, with the idea in mind that we will soon replace the scanner API entirely. But that's just my opinion, I don't know what others think.

@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

... we are getting to the place where we need to replace the Scanner implementation entirely. What we really need is a logical plan builder, so users can compose arbitrary plans together.

Hi @wjones127 , great to see that lance is going to work on the logical plan! I remembered there was an issue about this: #1782. I'd be happy to help if there is anything I can do.

Using fts/vector as a filter is useful, so I'd like to have a basic impl if it is possible.

@wojiaodoubao wojiaodoubao force-pushed the fts-as-vector-filter branch 2 times, most recently from b1001a6 to 9652531 Compare November 3, 2025 12:53
@wojiaodoubao wojiaodoubao force-pushed the fts-as-vector-filter branch 2 times, most recently from e7bf2a9 to 3f413c9 Compare December 12, 2025 08:34
@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

Hi @wjones127 @westonpace @BubbleCal , sorry to bother you. Could you help review this when you have time, thanks very much!

Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!
Let's create a follow-up issue for adding this to python API!

@BubbleCal BubbleCal merged commit 03517ec into lance-format:main Dec 15, 2025
28 checks passed
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
BubbleCal pushed a commit that referenced this pull request Feb 11, 2026
Adding python api to support using fts as a filter for vector search, or
using vector_query as a filter for fts search. Related to
#4928.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Using FTS as a filter in vector search

4 participants