feat: inverted index for contains_tokens#4489
Conversation
f652ecd to
efa3111
Compare
BubbleCal
left a comment
There was a problem hiding this comment.
LGTM, just a minor question
| )); | ||
| let query = query.as_any().downcast_ref::<TextQuery>().unwrap(); | ||
| match query { | ||
| TextQuery::StringContains(text) => { |
There was a problem hiding this comment.
should we add a new variant (TextQuery::TokenContains) for this? cc @westonpace
There was a problem hiding this comment.
Yes, I think that would be a good idea. Or StringContains could take a boolean parameter.
There was a problem hiding this comment.
Actually, we may need a separate query entirely 🤔
Right now the logic assumes "if scalar index supports query X and we can parse query X then scalar index can be used"
So if we re-use TextQuery that will mean that an inverted index will be chosen for both StringContains and TokenContains which isn't quite right (we only want it used for TokenContains)
| // Create the contains_tokens UDF | ||
| if func.name() == "contains_tokens" { | ||
| let query = TextQuery::StringContains(scalar_str); | ||
| Some(IndexedExpression::index_query( |
There was a problem hiding this comment.
it would be cool if we can add more params for this UDF in the future (e.g. AND op, fuzziness, prefix-matching)
There was a problem hiding this comment.
Sorry for my late response. I think supporting the AND operator, fuzziness, and prefix-matching is an excellent idea!!! I have a few questions about this idea:
-
I'm not sure if my understanding is correct. The
contains_tokensin Allow FTS indices to be used in filters #3855 is functionally the same ascontains, the difference is it leverages FTS instead of n-grams. If we supportfuzziness..., the query might look something like this:SELECT * FROM dataset WHERE contains_tokens(text, "+cat~2 and -dog")
I'm wondering if the semantics ofcontains_tokenswould shift to "full-text search" rather than "whether it contains a specific substring." This makes me a bit confused. -
Shall we consider adding a
fts()UDF to facilitate full-text search through SQL? Thefts()UDF would accept a JSON representation of a full_text_query and return a boolean indicating whether the record matches. The query might look like this.
let full_text_query = r#"
{
"query": {
"bool": {
"must": [
{ "match": { "content": "asia" } },
{ "match": { "content": { "value": "Felidae", "fuzziness": 2 } } }
],
"should": [
{ "phrase": { "title": "wild animal" } },
{ "phrase": { "title": "home pet" } },
]
}
}
}
"#;
let sql = format!("SELECT author FROM dataset WHERE fts({})", full_text_query);
If a column in the full_text_query does not have an inverted index created, the query will fail instead of attempting to read the record for matching.
There are other approaches for full_text_query, such as the MATCH AGAINST syntax used by MySQL or Lucene's syntax. However, JSON has stronger expressive capabilities and can fully represent FtsQuery, so I think using JSON to express full_text_query might be a good idea.
There was a problem hiding this comment.
- if I understand correctly,
contains_tokensis some different fromcontains, the users should expectcontains_tokenwould ignore the punctuations and white spaces (depends on the FTS params).
e.g. the text is "hello the world", then contains_tokens("text", "hello world") should match it but contains not. On the other hand, contains_tokens("text", "wor") can't match anything but contains would. cc @westonpace
- yes we evaluated Lucene's syntax, the FtsQuery object is organized in a very similar way, i think it's not hard to convert a FtsQuery object into JSON. in fact I implemented this converting logic ago, we are not using it because most users are using lance via the SDK, and query via constructing FtsQuery object is easier and more efficient (see https://lancedb.com/docs/search/full-text-search/#boolean-queries).
But yes it's a different story for SQL, I think it's a good idea to have the JSON syntax for SQL, we just need to parse it into the FtsQuery object
There was a problem hiding this comment.
contains_tokens is some different from contains, the users should expect contains_token would ignore the punctuations and white spaces (depends on the FTS params).
Hi @BubbleCal , can I understand it this way: The functionality of contains_tokens is equivalent to FTS MatchQuery (with fuzziness disabled, Operator::And, and using the simple tokenizer). This semantic makes sense to me. I previously implemented a version of the contains_tokens(#4420), and if we ultimately decide on this semantic, the UDF will need some adjustments.
Another point is that if the functionality of contains_tokens is substring matching, then FTS might introduce false negatives. However, if it follows the version described above, there would be no ambiguity at the semantic level.
I'm also wondering: do we need to support configuring tokenizers, fuzziness, prefix-matching and operator. If we make them as params of contains_tokens, the UDF might become too bloated. I think we can consider supporting an fts() UDF in the future, where more advanced full-text search capabilities can be covered.
Hi @westonpace , there are some different understandings regarding the syntax of the contains_tokens method. Could you help take a look?
There was a problem hiding this comment.
I think there is agreement.
contains_tokenscan be used for "best effort filtering" and will be accelerated by an inverted (FTS) index- The exact definition of "best effort" is a little vague
containsis for "exact filtering" and will be accelerated by ngram but not inverted index.- FTS search is a different thing (not filtering) and it would be nice to expose in the SQL but that is a different task entirely.
| .downcast_ref::<UInt64Array>() | ||
| .unwrap(); | ||
| let row_ids = row_ids.iter().flatten().collect_vec(); | ||
| Ok(SearchResult::AtMost(RowIdTreeMap::from_iter(row_ids))) |
There was a problem hiding this comment.
prob we can do this better in the future, if the tokenizer wouldn't change the original texts, we can return AtLeast directly
efa3111 to
f8dd633
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4489 +/- ##
==========================================
- Coverage 82.24% 81.19% -1.06%
==========================================
Files 308 308
Lines 127678 114079 -13599
Branches 127678 114079 -13599
==========================================
- Hits 105010 92623 -12387
+ Misses 18715 18199 -516
+ Partials 3953 3257 -696
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
8138623 to
d47f403
Compare
|
Hi @BubbleCal @westonpace , could you help review this when you have time, thanks very much ! |
westonpace
left a comment
There was a problem hiding this comment.
What is our goal?
If the goal is to approximate "string contains" with the knowledge that there may sometimes be false negatives then we should return AtMost (which this PR does) and use value.contains(scalar_str.value) (which this PR does not do).
If the goal is to create a "contains tokens" function then we shouldn't define "token" very precisely. This PR defines it as "separated by punctuation and white space" but that is not how the tokenizer is neccesarily configured. It might be ngrams or apply stemming, and stop words may be removed. In this case I think we should return exact (which this PR does not do) and use the collect_tokens based filter (which this PR does do)
d47f403 to
0250ece
Compare
0250ece to
80a2c66
Compare
Ah, I missed this part, thank you! |
westonpace
left a comment
There was a problem hiding this comment.
A few more small tasks. There is a bit of glue we need to connect the query to the index. I think we will need a test verifying the index is actually used too.
| #[derive(Debug, Clone, PartialEq)] | ||
| pub enum TokenQuery { | ||
| /// Retrieve all row ids where the text contains all tokens parsed from given string. The tokens |
There was a problem hiding this comment.
There might be a few pieces of glue missing. I think we need a TokenQueryParser (see https://github.com/lancedb/lance/blob/60711f360b7f8692df44a0e84c98c8fdff2897a3/rust/lance-index/src/scalar/expression.rs#L348 for the TextQueryParser).
We also need to register the token query parser here: https://github.com/lancedb/lance/blob/60711f360b7f8692df44a0e84c98c8fdff2897a3/rust/lance/src/index.rs#L1361
Can we call is_query_allowed in the registration function (scalar_index_info)? This way we can skip the scalar index entirely if it is not eligible. Returning AtLeast with zero rows might lead to bad performance (the planner will think we are doing a scalar index optimized search and make certain decisions based on that)
There was a problem hiding this comment.
I think we need a TokenQueryParser
We already have a FtsQueryParser which parses contains_tokens into TokenQuery::TokensContains. Actually you implemented it (^-^). Shall we can just rely on FtsQueryParser?
Can we call is_query_allowed in the registration function ...
Thanks your nice suggestion, let me fix it.
|
Sorry, I accidentally deleted the comment. I'm reposting it here.
The current goal is to create a "contains tokens" function.
The current functionality of contains_tokens is equivalent to the following FTS bm25_search:
Before using InvertedIndex to optimize query, we must check the InvertedIndexParams. If the FTS index's InvertedIndexParams can guarantee that "the query results are a superset of contains_tokens query results," then fts index is used and AtMost is returned. Otherwise, AtLeast with zero rowid is returned. "The query results are a superset of contains_tokens" can be checked by condition below. The prerequisite for returning Exact is that the FTS index's InvertedIndexParams configuration is identical to the contains_tokens. I think this could be considered as an optimization. |
|
Hi @westonpace , this pr is ready for review, please help when you have time ,thanks very much! |
westonpace
left a comment
There was a problem hiding this comment.
Thanks for bearing with the reviews!
| let fts_index = | ||
| lance_index::scalar::expression::ScalarIndexLoader::load_index( | ||
| self, | ||
| &field.name, | ||
| &index.name, | ||
| &NoOpMetricsCollector, | ||
| ) | ||
| .await?; |
There was a problem hiding this comment.
This is a bit unique but I think it'll be fine for now. It means we will always load the FTS index on every query, even if it isn't used. Maybe once #4584 we can find someway to handle is_query_allowed in the plugin (perhaps based on details) so we don't have to load an index unless we know we can use it.
This is the second step of lance-format#3855 --------- Co-authored-by: lijinglun <lijinglun@bytedance.com>
This is the second step of #3855