feat: use single session context in dataset sql by wojiaodoubao · Pull Request #4464 · lance-format/lance

wojiaodoubao · 2025-08-14T09:20:42Z

This is related to #4420 (review). We should use single session context for better performance.

The single session context causes a register table conflicts error. Regardless of whether we use the session context or the session state to register tables, we encounter this error because registered tables are shared. To address this problem, I introduced a table placeholder {{DATASET}}. In most cases, users should not set the table name manually but instead use the placeholder in their SQL queries, something like 'SELECT * FROM {{DATASET}}'. The SqlQueryBuilder will generate a unique table name for each query, and deregister the table after query is done. However, if users can confidently ensure that the table name will not conflict, they can manually set the table name.

There is another approach to solve this issue, which allows users to freely set table names without requiring the use of placeholders. In this approach, we first parse the SQL into an AST, traverse the AST to find all user-defined table names, and replace them with table names that include a UUID suffix. Then, we generate a logical plan from the modified AST and execute it, followed by deregistering the table. What do you think? @westonpace @jackye1995 @yanghua

codecov-commenter · 2025-08-14T13:21:26Z

Codecov Report

❌ Patch coverage is 96.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.90%. Comparing base (63ebcbb) to head (3c6f167).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/sql.rs	96.00%	0 Missing and 3 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4464   +/-   ##
=======================================
  Coverage   81.89%   81.90%           
=======================================
  Files         304      304           
  Lines      124072   124128   +56     
  Branches   124072   124128   +56     
=======================================
+ Hits       101610   101662   +52     
  Misses      18647    18647           
- Partials     3815     3819    +4

Flag	Coverage Δ
unittests	`81.90% <96.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jackye1995 · 2025-08-14T16:18:44Z

+   * <p>Example
+   *
+   * <pre>
+   * String sql = "SELECT * FROM {{DATASET}} WHERE age > 20";


hmm this experience seems very cumbersome... I am starting to wonder if this is not the right place to expose SQL. Maybe we should do that at namespace level, so we can use the actual table name of the dataset to run SQL.

Maybe we should do that at namespace level...

Do you mean https://lancedb.github.io/lance-namespace/spec/operations/query-table/ ? This interface makes sense to me. The only thing to note is that we cannot use table names like ns0.ns1.table, as the . will be parsed by DataFusion. We might need to use name like ns0_ns1_table, or force user quote "ns0.ns1.table" in their sql. This can be considered further when we actually start implementing it.

I am starting to wonder if this is not the right place to expose SQL.

I am slightly inclined to keep the SQL interface. It has the advantage of making it easier to use SQL based on the dataset api. Directly using the dataset is still an important scenario.

Hi @yanghua do you have more input ?

I also think using a placeholder as a table name is not experience-friendly.

@wojiaodoubao Can you make the table with this pattern {user-specified}-{uuid}? IMO, we can find a way to solve this.

I am slightly inclined to keep the SQL interface. It has the advantage of making it easier to use SQL based on the dataset api. Directly using the dataset is still an important scenario.

+1, we have used this API in our production. The user experience and performance(for less than hundreds of millions) are both very good. And very light-weight (no other dependency on catalog or distributed sql engine).

Do you mean https://lancedb.github.io/lance-namespace/spec/operations/query-table/ ?

No, what I am thinking is just to assign the table with actual name under the context that makes sense. If it is in a directory, then it can just use the table's folder name there is no need to have any placeholder or uuid.

For example, consider if you have folder:

- tmp - t1.lance - t2.lance

import lance_namespace as ln ns = ln.connect("dir", root="/tmp") ns.sql("select t1.* from t1, t2 where t1.c1 = t2.c2")

So it's basically the same interface and almost the same implementation, just lifted to a higher level. You can do SQL with the actual table name, just try to register the table provider with the table's actual name/id, and if it already exists then it is guaranteed to be the same table. It also unlocks features like join. Seems pretty nice to me, and I don't think there is any perf tradeoff we are making here.

For other more complicated namespace implementations, I think DataFusion can support up to 3 levels of TableReference. It should work for most namespaces. If an implementation somehow has more levels (e.g. Gravitino), you can also always define it to connect to the third level from the bottom.

@jackye1995 Thanks your explanation. Agree 'namespace level sql api' is a good idea, much better than placeholder.

Although, of the available approaches, if we are going to stay in pylance, I like the {{DATASET}} approach best.

that will push the feature into lancedb, not lance

We can add the impl in lance_namespace, and bring it to lancedb through that layer. I am doing some work right now regarding it.

Another approach someone mentioned, instead of a bunch of allow_row_id, allow_row_addr, etc. flags is to just have a single include_meta_cols flag. If that is set the the full schema (with all metadata columns) is made available to Datafusion.

I think the problem is more about if we have registered a table without the metadata columns for one query, and then the same table is reused in another query that need metadata columns, how do we do that. Do we always have to check and re-register the table? I am not super familiar with all DF features, @westonpace is there any good way?

We can add the impl in lance_namespace, and bring it to lancedb through that layer. I am doing some work right now regarding it.

That works.

I think the problem is more about if we have registered a table without the metadata columns for one query, and then the same table is reused in another query that need metadata columns, how do we do that. Do we always have to check and re-register the table? I am not super familiar with all DF features, @westonpace is there any good way?

It's a lance feature, DF has no real concept of metadata columns that I'm aware of. The problem boils down to the fact that DF expects tables to return a schema with all available columns. So these flags just control whether or not those columns show up.

So one simple approach is to just always include these special columns. Users can always choose whether they want those columns included in the query or not with project (e.g. select). In other words, turning metadata columns on does not guarantee they are on the output, it only guarantees they are in the schema.

Users can always choose whether they want those columns included in the query or not with project (e.g. select)

If we do that, I think if user does a SELECT *, it will basically always expand to include the metadata columns? Maybe that is fine. Ideally I think we want some experience like PG ctid so that user have to explicitly write SELECT ctid, * to get that value.

For another reference, in postgres these are "system columns" and they don't show up in the schema but you can still select them. I like this behavior but we'd need changes to Datafusion to enable it.

github-actions · 2025-11-24T02:15:27Z

Thank you for your contribution. This PR has been inactive for a while, so we're closing it to free up bandwidth. Feel free to reopen it if you still find it useful.

github-actions Bot added the enhancement New feature or request label Aug 14, 2025

wojiaodoubao mentioned this pull request Aug 14, 2025

feat: add contains_tokens udf #4420

Merged

wojiaodoubao force-pushed the sql-session-context branch from 6a071fe to 4b3d8a9 Compare August 14, 2025 09:59

github-actions Bot added the python label Aug 14, 2025

wojiaodoubao force-pushed the sql-session-context branch from 4b3d8a9 to 36ca198 Compare August 14, 2025 11:29

github-actions Bot added the java label Aug 14, 2025

wojiaodoubao force-pushed the sql-session-context branch 2 times, most recently from 80fd167 to d4e6ab0 Compare August 14, 2025 12:45

feat: use single session context in dataset sql

3c6f167

wojiaodoubao force-pushed the sql-session-context branch from d4e6ab0 to 3c6f167 Compare August 14, 2025 13:58

jackye1995 reviewed Aug 14, 2025

View reviewed changes

github-actions Bot added the Stale label Nov 14, 2025

github-actions Bot closed this Nov 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use single session context in dataset sql#4464

feat: use single session context in dataset sql#4464
wojiaodoubao wants to merge 1 commit intolance-format:mainfrom
wojiaodoubao:sql-session-context

wojiaodoubao commented Aug 14, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Aug 14, 2025 •

edited

Loading

Uh oh!

jackye1995 Aug 14, 2025

Uh oh!

wojiaodoubao Aug 15, 2025 •

edited

Loading

Uh oh!

yanghua Aug 15, 2025

Uh oh!

jackye1995 Aug 15, 2025

Uh oh!

wojiaodoubao Aug 15, 2025

Uh oh!

westonpace Aug 15, 2025

Uh oh!

jackye1995 Aug 15, 2025

Uh oh!

westonpace Aug 15, 2025

Uh oh!

jackye1995 Aug 15, 2025

Uh oh!

westonpace Aug 15, 2025

Uh oh!

github-actions Bot commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

wojiaodoubao commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojiaodoubao Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wojiaodoubao commented Aug 14, 2025 •

edited

Loading

codecov-commenter commented Aug 14, 2025 •

edited

Loading

wojiaodoubao Aug 15, 2025 •

edited

Loading