feat: use single session context in dataset sql#4464
feat: use single session context in dataset sql#4464wojiaodoubao wants to merge 1 commit intolance-format:mainfrom
Conversation
6a071fe to
4b3d8a9
Compare
4b3d8a9 to
36ca198
Compare
80fd167 to
d4e6ab0
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4464 +/- ##
=======================================
Coverage 81.89% 81.90%
=======================================
Files 304 304
Lines 124072 124128 +56
Branches 124072 124128 +56
=======================================
+ Hits 101610 101662 +52
Misses 18647 18647
- Partials 3815 3819 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
d4e6ab0 to
3c6f167
Compare
| * <p>Example | ||
| * | ||
| * <pre> | ||
| * String sql = "SELECT * FROM {{DATASET}} WHERE age > 20"; |
There was a problem hiding this comment.
hmm this experience seems very cumbersome... I am starting to wonder if this is not the right place to expose SQL. Maybe we should do that at namespace level, so we can use the actual table name of the dataset to run SQL.
There was a problem hiding this comment.
Maybe we should do that at namespace level...
Do you mean https://lancedb.github.io/lance-namespace/spec/operations/query-table/ ? This interface makes sense to me. The only thing to note is that we cannot use table names like ns0.ns1.table, as the . will be parsed by DataFusion. We might need to use name like ns0_ns1_table, or force user quote "ns0.ns1.table" in their sql. This can be considered further when we actually start implementing it.
I am starting to wonder if this is not the right place to expose SQL.
I am slightly inclined to keep the SQL interface. It has the advantage of making it easier to use SQL based on the dataset api. Directly using the dataset is still an important scenario.
Hi @yanghua do you have more input ?
There was a problem hiding this comment.
I also think using a placeholder as a table name is not experience-friendly.
@wojiaodoubao Can you make the table with this pattern {user-specified}-{uuid}? IMO, we can find a way to solve this.
I am slightly inclined to keep the SQL interface. It has the advantage of making it easier to use SQL based on the dataset api. Directly using the dataset is still an important scenario.
+1, we have used this API in our production. The user experience and performance(for less than hundreds of millions) are both very good. And very light-weight (no other dependency on catalog or distributed sql engine).
There was a problem hiding this comment.
Do you mean https://lancedb.github.io/lance-namespace/spec/operations/query-table/ ?
No, what I am thinking is just to assign the table with actual name under the context that makes sense. If it is in a directory, then it can just use the table's folder name there is no need to have any placeholder or uuid.
For example, consider if you have folder:
- tmp
- t1.lance
- t2.lance
import lance_namespace as ln
ns = ln.connect("dir", root="/tmp")
ns.sql("select t1.* from t1, t2 where t1.c1 = t2.c2")
So it's basically the same interface and almost the same implementation, just lifted to a higher level. You can do SQL with the actual table name, just try to register the table provider with the table's actual name/id, and if it already exists then it is guaranteed to be the same table. It also unlocks features like join. Seems pretty nice to me, and I don't think there is any perf tradeoff we are making here.
For other more complicated namespace implementations, I think DataFusion can support up to 3 levels of TableReference. It should work for most namespaces. If an implementation somehow has more levels (e.g. Gravitino), you can also always define it to connect to the third level from the bottom.
There was a problem hiding this comment.
@jackye1995 Thanks your explanation. Agree 'namespace level sql api' is a good idea, much better than placeholder.
There was a problem hiding this comment.
Although, of the available approaches, if we are going to stay in pylance, I like the {{DATASET}} approach best.
There was a problem hiding this comment.
that will push the feature into lancedb, not lance
We can add the impl in lance_namespace, and bring it to lancedb through that layer. I am doing some work right now regarding it.
Another approach someone mentioned, instead of a bunch of allow_row_id, allow_row_addr, etc. flags is to just have a single include_meta_cols flag. If that is set the the full schema (with all metadata columns) is made available to Datafusion.
I think the problem is more about if we have registered a table without the metadata columns for one query, and then the same table is reused in another query that need metadata columns, how do we do that. Do we always have to check and re-register the table? I am not super familiar with all DF features, @westonpace is there any good way?
There was a problem hiding this comment.
We can add the impl in lance_namespace, and bring it to lancedb through that layer. I am doing some work right now regarding it.
That works.
I think the problem is more about if we have registered a table without the metadata columns for one query, and then the same table is reused in another query that need metadata columns, how do we do that. Do we always have to check and re-register the table? I am not super familiar with all DF features, @westonpace is there any good way?
It's a lance feature, DF has no real concept of metadata columns that I'm aware of. The problem boils down to the fact that DF expects tables to return a schema with all available columns. So these flags just control whether or not those columns show up.
So one simple approach is to just always include these special columns. Users can always choose whether they want those columns included in the query or not with project (e.g. select). In other words, turning metadata columns on does not guarantee they are on the output, it only guarantees they are in the schema.
There was a problem hiding this comment.
Users can always choose whether they want those columns included in the query or not with project (e.g. select)
If we do that, I think if user does a SELECT *, it will basically always expand to include the metadata columns? Maybe that is fine. Ideally I think we want some experience like PG ctid so that user have to explicitly write SELECT ctid, * to get that value.
There was a problem hiding this comment.
For another reference, in postgres these are "system columns" and they don't show up in the schema but you can still select them. I like this behavior but we'd need changes to Datafusion to enable it.
|
Thank you for your contribution. This PR has been inactive for a while, so we're closing it to free up bandwidth. Feel free to reopen it if you still find it useful. |
This is related to #4420 (review). We should use single session context for better performance.
The single session context causes a register table conflicts error. Regardless of whether we use the session context or the session state to register tables, we encounter this error because registered tables are shared. To address this problem, I introduced a table placeholder {{DATASET}}. In most cases, users should not set the table name manually but instead use the placeholder in their SQL queries, something like 'SELECT * FROM {{DATASET}}'. The SqlQueryBuilder will generate a unique table name for each query, and deregister the table after query is done. However, if users can confidently ensure that the table name will not conflict, they can manually set the table name.
There is another approach to solve this issue, which allows users to freely set table names without requiring the use of placeholders. In this approach, we first parse the SQL into an AST, traverse the AST to find all user-defined table names, and replace them with table names that include a UUID suffix. Then, we generate a logical plan from the modified AST and execute it, followed by deregistering the table. What do you think? @westonpace @jackye1995 @yanghua