Skip to content

feat: use single session context in dataset sql#4464

Closed
wojiaodoubao wants to merge 1 commit intolance-format:mainfrom
wojiaodoubao:sql-session-context
Closed

feat: use single session context in dataset sql#4464
wojiaodoubao wants to merge 1 commit intolance-format:mainfrom
wojiaodoubao:sql-session-context

Conversation

@wojiaodoubao
Copy link
Copy Markdown
Contributor

@wojiaodoubao wojiaodoubao commented Aug 14, 2025

This is related to #4420 (review). We should use single session context for better performance.

The single session context causes a register table conflicts error. Regardless of whether we use the session context or the session state to register tables, we encounter this error because registered tables are shared. To address this problem, I introduced a table placeholder {{DATASET}}. In most cases, users should not set the table name manually but instead use the placeholder in their SQL queries, something like 'SELECT * FROM {{DATASET}}'. The SqlQueryBuilder will generate a unique table name for each query, and deregister the table after query is done. However, if users can confidently ensure that the table name will not conflict, they can manually set the table name.

There is another approach to solve this issue, which allows users to freely set table names without requiring the use of placeholders. In this approach, we first parse the SQL into an AST, traverse the AST to find all user-defined table names, and replace them with table names that include a UUID suffix. Then, we generate a logical plan from the modified AST and execute it, followed by deregistering the table. What do you think? @westonpace @jackye1995 @yanghua

@github-actions github-actions Bot added the enhancement New feature or request label Aug 14, 2025
@github-actions github-actions Bot added the java label Aug 14, 2025
@wojiaodoubao wojiaodoubao force-pushed the sql-session-context branch 2 times, most recently from 80fd167 to d4e6ab0 Compare August 14, 2025 12:45
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 14, 2025

Codecov Report

❌ Patch coverage is 96.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.90%. Comparing base (63ebcbb) to head (3c6f167).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/dataset/sql.rs 96.00% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4464   +/-   ##
=======================================
  Coverage   81.89%   81.90%           
=======================================
  Files         304      304           
  Lines      124072   124128   +56     
  Branches   124072   124128   +56     
=======================================
+ Hits       101610   101662   +52     
  Misses      18647    18647           
- Partials     3815     3819    +4     
Flag Coverage Δ
unittests 81.90% <96.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

* <p>Example
*
* <pre>
* String sql = "SELECT * FROM {{DATASET}} WHERE age > 20";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this experience seems very cumbersome... I am starting to wonder if this is not the right place to expose SQL. Maybe we should do that at namespace level, so we can use the actual table name of the dataset to run SQL.

Copy link
Copy Markdown
Contributor Author

@wojiaodoubao wojiaodoubao Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should do that at namespace level...

Do you mean https://lancedb.github.io/lance-namespace/spec/operations/query-table/ ? This interface makes sense to me. The only thing to note is that we cannot use table names like ns0.ns1.table, as the . will be parsed by DataFusion. We might need to use name like ns0_ns1_table, or force user quote "ns0.ns1.table" in their sql. This can be considered further when we actually start implementing it.

I am starting to wonder if this is not the right place to expose SQL.

I am slightly inclined to keep the SQL interface. It has the advantage of making it easier to use SQL based on the dataset api. Directly using the dataset is still an important scenario.

Hi @yanghua do you have more input ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think using a placeholder as a table name is not experience-friendly.

@wojiaodoubao Can you make the table with this pattern {user-specified}-{uuid}? IMO, we can find a way to solve this.

I am slightly inclined to keep the SQL interface. It has the advantage of making it easier to use SQL based on the dataset api. Directly using the dataset is still an important scenario.

+1, we have used this API in our production. The user experience and performance(for less than hundreds of millions) are both very good. And very light-weight (no other dependency on catalog or distributed sql engine).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean https://lancedb.github.io/lance-namespace/spec/operations/query-table/ ?

No, what I am thinking is just to assign the table with actual name under the context that makes sense. If it is in a directory, then it can just use the table's folder name there is no need to have any placeholder or uuid.

For example, consider if you have folder:

- tmp
  - t1.lance
  - t2.lance
import lance_namespace as ln

ns = ln.connect("dir", root="/tmp")

ns.sql("select t1.* from t1, t2 where t1.c1 = t2.c2") 

So it's basically the same interface and almost the same implementation, just lifted to a higher level. You can do SQL with the actual table name, just try to register the table provider with the table's actual name/id, and if it already exists then it is guaranteed to be the same table. It also unlocks features like join. Seems pretty nice to me, and I don't think there is any perf tradeoff we are making here.

For other more complicated namespace implementations, I think DataFusion can support up to 3 levels of TableReference. It should work for most namespaces. If an implementation somehow has more levels (e.g. Gravitino), you can also always define it to connect to the third level from the bottom.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 Thanks your explanation. Agree 'namespace level sql api' is a good idea, much better than placeholder.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, of the available approaches, if we are going to stay in pylance, I like the {{DATASET}} approach best.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will push the feature into lancedb, not lance

We can add the impl in lance_namespace, and bring it to lancedb through that layer. I am doing some work right now regarding it.

Another approach someone mentioned, instead of a bunch of allow_row_id, allow_row_addr, etc. flags is to just have a single include_meta_cols flag. If that is set the the full schema (with all metadata columns) is made available to Datafusion.

I think the problem is more about if we have registered a table without the metadata columns for one query, and then the same table is reused in another query that need metadata columns, how do we do that. Do we always have to check and re-register the table? I am not super familiar with all DF features, @westonpace is there any good way?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add the impl in lance_namespace, and bring it to lancedb through that layer. I am doing some work right now regarding it.

That works.

I think the problem is more about if we have registered a table without the metadata columns for one query, and then the same table is reused in another query that need metadata columns, how do we do that. Do we always have to check and re-register the table? I am not super familiar with all DF features, @westonpace is there any good way?

It's a lance feature, DF has no real concept of metadata columns that I'm aware of. The problem boils down to the fact that DF expects tables to return a schema with all available columns. So these flags just control whether or not those columns show up.

So one simple approach is to just always include these special columns. Users can always choose whether they want those columns included in the query or not with project (e.g. select). In other words, turning metadata columns on does not guarantee they are on the output, it only guarantees they are in the schema.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can always choose whether they want those columns included in the query or not with project (e.g. select)

If we do that, I think if user does a SELECT *, it will basically always expand to include the metadata columns? Maybe that is fine. Ideally I think we want some experience like PG ctid so that user have to explicitly write SELECT ctid, * to get that value.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For another reference, in postgres these are "system columns" and they don't show up in the schema but you can still select them. I like this behavior but we'd need changes to Datafusion to enable it.

@github-actions github-actions Bot added the Stale label Nov 14, 2025
@github-actions
Copy link
Copy Markdown
Contributor

Thank you for your contribution. This PR has been inactive for a while, so we're closing it to free up bandwidth. Feel free to reopen it if you still find it useful.

@github-actions github-actions Bot closed this Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants