feat: support using FTS as a filter in vector search#4928
feat: support using FTS as a filter in vector search#4928BubbleCal merged 1 commit intolance-format:mainfrom
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
7ebe852 to
a1d7add
Compare
|
Hi @BubbleCal , sorry to bother. Could you help review when you have time, thanks very much! |
There was a problem hiding this comment.
This feature is cool, but I think we still need to figure out the API and semantic things:
this PR assumes if both FTS and vector query are provided, then one of them would be as the filter of the other one (depends on prefilter param), I'd think it would be better to always apply filter by filter() method, we can change this method to take an enum param which can be one of string, FTSQuery, VectorQuery.
| .unwrap(); | ||
|
|
||
| // Case 1: search with prefilter=true | ||
| let query_vector = Float32Array::from(vec![300f32, 300f32, 300f32, 300f32]); |
There was a problem hiding this comment.
How would this new feature work if the query contains filter, FTS and vector query all?
There was a problem hiding this comment.
If prefilter=true, scanner will first perform the filter, then perform fts based on filter result, then perform vector search based on fts result.
If prefilter=false, scanner will first perform vector search, then filter the result with fts, then filter the result with filter.
| .unwrap(); | ||
|
|
||
| let results = stream.try_collect::<Vec<_>>().await.unwrap(); | ||
| let batch = concat_batches(&results[0].schema(), &results).unwrap(); |
There was a problem hiding this comment.
I'd expect if the FTS is a post-filter, then we should get vector search results (mean there should be '_distance' column), does it work like this?
There was a problem hiding this comment.
For all 3 cases in the unit tests, the result schema is (id, vector, text, _distance, _score). So yes it contains '_distance', but it shouldn't contain '_score' since fts is used as filter. This is related to the semantic and api issue above, let me fix them together.
@BubbleCal Thanks for your nice suggestion! The current implementation does have some semantic confusion. I previously thought "using FTS as a filter for vector search" and "using vector search as a filter for FTS" were equivalent. Specifically, vector search with prefilter=true and FTS as filter is equivalent to full text search with prefilter=false and vector search as filter. However, they are actually different because the returned schema is not the same. The results of an FTS query should include _score instead of _distance, while the results of a vector search should include _distance instead of _score. I like your idea "apply filter by filter() method, we can change this method to take an enum param which can be one of string, FTSQuery, VectorQuery". This approach allows us to distinguish between filter and search. Additionally, I think the enum param can be slightly modified to support cases where both string & fts or string & vector are used as filters simultaneously. Let me fix it. |
|
@wjones127 I'm under the impression that we plan to make some changes to the query builder like: Could you confirm? If so I think it's a good time to implement this, and then we can have the filter enum simpler: |
a1d7add to
4c2eae6
Compare
4c2eae6 to
b8ae42a
Compare
b8ae42a to
362e4a0
Compare
362e4a0 to
e26efcd
Compare
|
Hi @BubbleCal @wjones127 , please help review when you have time, thanks very much! |
Yeah I think we are getting to the place where we need to replace the Scanner implementation entirely. What we really need is a logical plan builder, so users can compose arbitrary plans together. That's a larger project. I'm personally okay doing something basic here to support this, with the idea in mind that we will soon replace the scanner API entirely. But that's just my opinion, I don't know what others think. |
e26efcd to
2a8627b
Compare
Hi @wjones127 , great to see that lance is going to work on the logical plan! I remembered there was an issue about this: #1782. I'd be happy to help if there is anything I can do. Using fts/vector as a filter is useful, so I'd like to have a basic impl if it is possible. |
b1001a6 to
9652531
Compare
e7bf2a9 to
3f413c9
Compare
3f413c9 to
861599b
Compare
|
Hi @wjones127 @westonpace @BubbleCal , sorry to bother you. Could you help review this when you have time, thanks very much! |
BubbleCal
left a comment
There was a problem hiding this comment.
Great work!
Let's create a follow-up issue for adding this to python API!
Close lance-format#4927 Co-authored-by: lijinglun <lijinglun@bytedance.com>
Adding python api to support using fts as a filter for vector search, or using vector_query as a filter for fts search. Related to #4928.
Close #4927