feat: add contains_tokens udf#4420
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4420 +/- ##
=======================================
Coverage 81.89% 81.89%
=======================================
Files 304 305 +1
Lines 124072 124147 +75
Branches 124072 124147 +75
=======================================
+ Hits 101610 101675 +65
- Misses 18647 18652 +5
- Partials 3815 3820 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
895b703 to
f1647be
Compare
|
Hi @westonpace , could you help review this when you have time, thanks very much ! |
westonpace
left a comment
There was a problem hiding this comment.
Some minor suggestions but the UDF part looks good. I suppose we will need to document our UDFs somewhere as we start to build up a library but we can save that for a future PR.
Leaving in "request changes" until we can address the get_session_context suggestion.
|
Feel free to re-request review when ready, thanks! |
5a9ad3e to
3a50ca2
Compare
|
|
||
| pub async fn build(self) -> lance_core::Result<SqlQuery> { | ||
| let ctx = SessionContext::new(); | ||
| let ctx = new_session_context(&LanceExecutionOptions::default()); |
There was a problem hiding this comment.
We need to use a new session context to avoid `register table conflicts', so here is new_session_context, not ```get_session_context```.
There was a problem hiding this comment.
The get_session_context returns a clone of a static session context. They share the same registered tables. So if we register tables with the same name, it run into a 'register table conflicts error'.
There was a problem hiding this comment.
Ah, I see the problem. Actually, it is a bit weird that we have to register the table on the session context (which is per-process). We can get away with registering it on the state (which is per-query) if we do something like this...
pub async fn build(self) -> lance_core::Result<SqlQuery> {
let ctx = get_session_context(&LanceExecutionOptions::default());
let row_id = self.with_row_id;
let row_addr = self.with_row_addr;
let state = ctx.state();
let table_ref: TableReference = self.table_name.clone().into();
let table = table_ref.table().to_owned();
state.schema_for_ref(table_ref)?.register_table(
table,
Arc::new(LanceTableProvider::new(
self.dataset.clone(),
row_id,
row_addr,
)),
)?;
let plan = state.create_logical_plan(&self.sql).await?;
SQLOptions::new().verify_plan(&plan)?;
let df = DataFrame::new(state, plan);
Ok(SqlQuery::new(df))
}
This way we can use the default session context since we are no longer modifying it.
There was a problem hiding this comment.
@wojiaodoubao I was commenting in the other PR about we might want to rethink a bit about the whole SQL query feature in Lance given the issue you hit. But regardless I think the contains_tokens UDF is worth merging even outside the context of using it through the SQL interface. Shall we do that first in this PR?
There was a problem hiding this comment.
@jackye1995 Thanks, I change it back to new_session_context.
There was a problem hiding this comment.
oh I was saying that we can remove all the changes in sql.rs, so we can independently merge the addition of the UDF, and we use the other open PR to see what we can do for the dataset SQL experience.
There was a problem hiding this comment.
haha, I misunderstood. Fix it.
|
Hi @westonpace @jackye1995 , this pr is ready for review now, please help when you have time, thanks very much ! |
3a50ca2 to
a842889
Compare
westonpace
left a comment
There was a problem hiding this comment.
One more suggestion. Sorry to be annoying but I think it's a good cleanup opportunity to keep a single session context. We have found that this helps quite a bit for performance (creating a session context can be costly) especially for small queries.
|
|
||
| pub async fn build(self) -> lance_core::Result<SqlQuery> { | ||
| let ctx = SessionContext::new(); | ||
| let ctx = new_session_context(&LanceExecutionOptions::default()); |
There was a problem hiding this comment.
Ah, I see the problem. Actually, it is a bit weird that we have to register the table on the session context (which is per-process). We can get away with registering it on the state (which is per-query) if we do something like this...
pub async fn build(self) -> lance_core::Result<SqlQuery> {
let ctx = get_session_context(&LanceExecutionOptions::default());
let row_id = self.with_row_id;
let row_addr = self.with_row_addr;
let state = ctx.state();
let table_ref: TableReference = self.table_name.clone().into();
let table = table_ref.table().to_owned();
state.schema_for_ref(table_ref)?.register_table(
table,
Arc::new(LanceTableProvider::new(
self.dataset.clone(),
row_id,
row_addr,
)),
)?;
let plan = state.create_logical_plan(&self.sql).await?;
SQLOptions::new().verify_plan(&plan)?;
let df = DataFrame::new(state, plan);
Ok(SqlQuery::new(df))
}
This way we can use the default session context since we are no longer modifying it.
a842889 to
e623a5b
Compare
Thanks @westonpace for your comments! The single session context makes sense to me, your suggestion is definitely not annoying. I tried the 'state register table way' and still got the 'table already exists error'. I think it is because the 'registered tables' are still shared even using the session state. I think the single session context can be considered as a separate issue. There are some potential solutions we can discuss in #4464 (comment). Once it is resolved, we can come back and continue pushing this PR forward. |
091c9eb to
7a213d7
Compare
Works for me |
First step of #3855