Skip to content

feat: add contains_tokens udf#4420

Merged
westonpace merged 4 commits intolance-format:mainfrom
wojiaodoubao:contains-tokens-udf
Aug 15, 2025
Merged

feat: add contains_tokens udf#4420
westonpace merged 4 commits intolance-format:mainfrom
wojiaodoubao:contains-tokens-udf

Conversation

@wojiaodoubao
Copy link
Copy Markdown
Contributor

First step of #3855

@github-actions github-actions Bot added the enhancement New feature or request label Aug 10, 2025
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 10, 2025

Codecov Report

❌ Patch coverage is 96.10390% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.89%. Comparing base (63ebcbb) to head (7a213d7).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-datafusion/src/udf.rs 95.83% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4420   +/-   ##
=======================================
  Coverage   81.89%   81.89%           
=======================================
  Files         304      305    +1     
  Lines      124072   124147   +75     
  Branches   124072   124147   +75     
=======================================
+ Hits       101610   101675   +65     
- Misses      18647    18652    +5     
- Partials     3815     3820    +5     
Flag Coverage Δ
unittests 81.89% <96.10%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wojiaodoubao wojiaodoubao force-pushed the contains-tokens-udf branch 2 times, most recently from 895b703 to f1647be Compare August 10, 2025 14:06
@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

Hi @westonpace , could you help review this when you have time, thanks very much !

Comment thread rust/lance/src/dataset/sql.rs Outdated
Comment thread rust/lance/src/dataset/sql.rs Outdated
Comment thread rust/lance/src/dataset/sql.rs Outdated
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions but the UDF part looks good. I suppose we will need to document our UDFs somewhere as we start to build up a library but we can save that for a future PR.

Leaving in "request changes" until we can address the get_session_context suggestion.

Comment thread rust/lance/src/dataset/sql.rs Outdated
Comment thread rust/lance-index/src/scalar/inverted/index.rs
Comment thread rust/lance-index/src/scalar/inverted/index.rs Outdated
@westonpace
Copy link
Copy Markdown
Member

Feel free to re-request review when ready, thanks!

@wojiaodoubao wojiaodoubao force-pushed the contains-tokens-udf branch 4 times, most recently from 5a9ad3e to 3a50ca2 Compare August 12, 2025 13:05
Comment thread rust/lance/src/dataset/sql.rs Outdated

pub async fn build(self) -> lance_core::Result<SqlQuery> {
let ctx = SessionContext::new();
let ctx = new_session_context(&LanceExecutionOptions::default());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use a new session context to avoid `register table conflicts', so here is new_session_context, not ```get_session_context```.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_session_context returns a clone of a static session context. They share the same registered tables. So if we register tables with the same name, it run into a 'register table conflicts error'.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the problem. Actually, it is a bit weird that we have to register the table on the session context (which is per-process). We can get away with registering it on the state (which is per-query) if we do something like this...

    pub async fn build(self) -> lance_core::Result<SqlQuery> {
        let ctx = get_session_context(&LanceExecutionOptions::default());
        let row_id = self.with_row_id;
        let row_addr = self.with_row_addr;
        let state = ctx.state();
        let table_ref: TableReference = self.table_name.clone().into();
        let table = table_ref.table().to_owned();
        state.schema_for_ref(table_ref)?.register_table(
            table,
            Arc::new(LanceTableProvider::new(
                self.dataset.clone(),
                row_id,
                row_addr,
            )),
        )?;
        let plan = state.create_logical_plan(&self.sql).await?;
        SQLOptions::new().verify_plan(&plan)?;

        let df = DataFrame::new(state, plan);
        Ok(SqlQuery::new(df))
    }

This way we can use the default session context since we are no longer modifying it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojiaodoubao I was commenting in the other PR about we might want to rethink a bit about the whole SQL query feature in Lance given the issue you hit. But regardless I think the contains_tokens UDF is worth merging even outside the context of using it through the SQL interface. Shall we do that first in this PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 Thanks, I change it back to new_session_context.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I was saying that we can remove all the changes in sql.rs, so we can independently merge the addition of the UDF, and we use the other open PR to see what we can do for the dataset SQL experience.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, I misunderstood. Fix it.

@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

Hi @westonpace @jackye1995 , this pr is ready for review now, please help when you have time, thanks very much !

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more suggestion. Sorry to be annoying but I think it's a good cleanup opportunity to keep a single session context. We have found that this helps quite a bit for performance (creating a session context can be costly) especially for small queries.

Comment thread rust/lance/src/dataset/sql.rs Outdated

pub async fn build(self) -> lance_core::Result<SqlQuery> {
let ctx = SessionContext::new();
let ctx = new_session_context(&LanceExecutionOptions::default());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the problem. Actually, it is a bit weird that we have to register the table on the session context (which is per-process). We can get away with registering it on the state (which is per-query) if we do something like this...

    pub async fn build(self) -> lance_core::Result<SqlQuery> {
        let ctx = get_session_context(&LanceExecutionOptions::default());
        let row_id = self.with_row_id;
        let row_addr = self.with_row_addr;
        let state = ctx.state();
        let table_ref: TableReference = self.table_name.clone().into();
        let table = table_ref.table().to_owned();
        state.schema_for_ref(table_ref)?.register_table(
            table,
            Arc::new(LanceTableProvider::new(
                self.dataset.clone(),
                row_id,
                row_addr,
            )),
        )?;
        let plan = state.create_logical_plan(&self.sql).await?;
        SQLOptions::new().verify_plan(&plan)?;

        let df = DataFrame::new(state, plan);
        Ok(SqlQuery::new(df))
    }

This way we can use the default session context since we are no longer modifying it.

@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

One more suggestion. Sorry to be annoying but I think it's a good cleanup opportunity to keep a single session context. We have found that this helps quite a bit for performance (creating a session context can be costly) especially for small queries.

Thanks @westonpace for your comments! The single session context makes sense to me, your suggestion is definitely not annoying. I tried the 'state register table way' and still got the 'table already exists error'. I think it is because the 'registered tables' are still shared even using the session state.

I think the single session context can be considered as a separate issue. There are some potential solutions we can discuss in #4464 (comment). Once it is resolved, we can come back and continue pushing this PR forward.

Comment thread rust/lance-datafusion/src/udf.rs
@westonpace
Copy link
Copy Markdown
Member

I think the single session context can be considered as a separate issue. There are some potential solutions we can discuss in #4464 (comment). Once it is resolved, we can come back and continue pushing this PR forward.

Works for me

@westonpace westonpace merged commit f1a85bc into lance-format:main Aug 15, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants