-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30001][SQL] ResolveRelations should handle both V1 and V2 tables. #26684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
|
Test build #114480 has finished for PR 26684 at commit
|
|
Test build #114482 has finished for PR 26684 at commit
|
sql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
| case i @ InsertIntoStatement( | ||
| u @ UnresolvedRelation(CatalogObjectIdentifier(catalog, ident)), _, _, _, _) | ||
| if i.query.resolved && CatalogV2Util.isSessionCatalog(catalog) => | ||
| val relation = ResolveTempViews(u) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this? temp views should always be resolve first. If we reach here, it's not a temp view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u is inside InsertIntoStatement (and not its children), so it is not resolved when we reach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! Can we resolve temp views inside InsertIntoStatement in ResolveTempViews as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we resolve temp views for InsertIntoStatment.table in ResolveTempViews, we need additional rule here to match SubqueryAlias. Is that what you were suggesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have EliminateSubqueryAliases here, so SubqueryAlias should be fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala#L87 is one example, where temp view resolution is required. Maybe the confusion is that InsertIntoStatement is used for INSERT OVERWRITE, INSERT INTO, etc.?
It seems to me that it should insert into the table default.t1 because it doesn't make sense to insert into the temp view.
I think it should still resolve to temp view (for consistent lookup behavior), but fails during analysis check, which is the current behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, how does the analysis check catch the problem? Are we confident that always works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the following checks for InsertIntoStatement:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
Line 494 in 322ec0b
| object PreWriteCheck extends (LogicalPlan => Unit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think I would rather implement the check a different way:
- Don't resolve temp tables in ResolveRelations
- In CheckAnalysis (in catalyst), check for InsertInto(Unresolved(...)) and if the unresolved relation is a temp table state that it is a temp table and can't be resolved.
It would be good to know what other databases do in this case because my suggestion to not resolve the identifier as a temp table would allow matching a table here if there is one that conflicts. Probably good to consider this case in the broader clean up of temp table handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to know what other databases do in this case because my suggestion to not resolve the identifier as a temp table would allow matching a table here if there is one that conflicts.
In Postgres, it is resolved to temp view:
postgres=# create schema s1;
CREATE SCHEMA
postgres=# SET search_path TO s1;
SET
postgres=# create table s1.t (i int);
CREATE TABLE
postgres=# insert into s1.t values (1);
INSERT 0 1
# resolves to table 't'
postgres=# select * from t;
i
---
1
(1 row)
postgres=# create temp view t as select 2 as i;
CREATE VIEW
# resolves to temp view 't'
postgres=# select * from t;
i
---
2
(1 row)
# resolves to temp view 't'
postgres=# insert into t values (1);
2019-12-05 21:40:47.229 EST [5451] ERROR: cannot insert into view "t"
2019-12-05 21:40:47.229 EST [5451] DETAIL: Views that do not select from a single table or view are not automatically updatable.
|
Test build #114496 has finished for PR 26684 at commit
|
|
This seems like a hard problem. What we need is:
There are two things conflicting:
To do these 2 things together with one Hive metastore access, we have 3 options:
I think option 2 is the easiest to do at the current stage. |
|
Test build #114502 has finished for PR 26684 at commit
|
| private def lookupRelation( | ||
| catalog: CatalogPlugin, | ||
| ident: Identifier, | ||
| recurse: Boolean): Option[LogicalPlan] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ResolveRelations is going to be completely rewritten before 3.0, then we should fix it to separate view resolution from table resolution and to use multiple executions instead of recursion. The only reason why I don't think we should do that is to avoid too many changes to ResolveRelations just before a release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I will also explore removing the recursion separately.
|
I think that this goes too far by adding more V2 resolution into It is helpful to consider how resolution works for other nodes to understand the problem with This shows that there are 2 decision points that are missing from
In addition, As I've said earlier, I don't think it is a good idea to make major changes to
If we make those changes, then we no longer need to run We will also need to follow up with a fix for views. Views that are defined with the session catalog as the current catalog are okay because |
|
Thanks @cloud-fan and @rdblue for detailed explanation and suggestion! There are few things we need to follow up:
I can address this in this PR.
I agree, and I can do a follow up PR to disallow this scenario.
The current implementation of |
This will be fixed by #25651 . We can leave it here. |
This is also what I suggested with option 2. Also agree that we should fix view creation later. |
|
Test build #114747 has finished for PR 26684 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
| CatalogV2Util.loadTable(catalog, newIdent) match { | ||
| case Some(v1Table: V1Table) => | ||
| val tableIdent = TableIdentifier(newIdent.name, newIdent.namespace.headOption) | ||
| if (!isRunningDirectlyOnFiles(tableIdent)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this check? If we find a v1 table, we should read this table instead of treating table name as path and read files directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! For isRunningDirectlyOnFiles to be true, the table should not exist. If CatalogV2Util.loadTable returned v1 table, it means that the table exists, so this will always be false.
| private def lookupV2Relation(identifier: Seq[String]): Option[DataSourceV2Relation] = | ||
| identifier match { | ||
| case CatalogObjectIdentifier(catalog, ident) if !CatalogV2Util.isSessionCatalog(catalog) => | ||
| case NonSessionCatalogAndIdentifier(catalog, ident) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also respect current namespace here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I was planning to do it as a separate PR. Would that be OK?
|
Test build #114887 has finished for PR 26684 at commit
|
|
+1 I'm okay with resolving temp views in |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
|
@rdblue there are some history stories here. Spark supports CREATE TEMP VIEW USING, which creates a special temp view that points to a data source table (e.g. a parquet table, a JDBC table). So INSERT INTO needs to support temp views. Maybe we can remove CREATE TEMP VIEW USING too, but that needs more discussion. |
|
Test build #114924 has finished for PR 26684 at commit
|
|
the last commit just renames a method and has passed compilation, the previous commit passes all tests, I'm merging to master, thanks! |
|
Test build #114927 has finished for PR 26684 at commit
|
### What changes were proposed in this pull request? This PR makes `Analyzer.ResolveRelations` responsible for looking up both v1 and v2 tables from the session catalog and create an appropriate relation. ### Why are the changes needed? Currently there are two issues: 1. As described in [SPARK-29966](https://issues.apache.org/jira/browse/SPARK-29966), the logic for resolving relation can load a table twice, which is a perf regression (e.g., Hive metastore can be accessed twice). 2. As described in [SPARK-30001](https://issues.apache.org/jira/browse/SPARK-30001), if a catalog name is specified for v1 tables, the query fails: ``` scala> sql("create table t using csv as select 1 as i") res2: org.apache.spark.sql.DataFrame = [] scala> sql("select * from t").show +---+ | i| +---+ | 1| +---+ scala> sql("select * from spark_catalog.t").show org.apache.spark.sql.AnalysisException: Table or view not found: spark_catalog.t; line 1 pos 14; 'Project [*] +- 'UnresolvedRelation [spark_catalog, t] ``` ### Does this PR introduce any user-facing change? Yes. Now the catalog name is resolved correctly: ``` scala> sql("create table t using csv as select 1 as i") res0: org.apache.spark.sql.DataFrame = [] scala> sql("select * from t").show +---+ | i| +---+ | 1| +---+ scala> sql("select * from spark_catalog.t").show +---+ | i| +---+ | 1| +---+ ``` ### How was this patch tested? Added new tests. Closes apache#26684 from imback82/resolve_relation. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
|
||
| CatalogV2Util.loadTable(catalog, newIdent) match { | ||
| case Some(v1Table: V1Table) => | ||
| val tableIdent = TableIdentifier(newIdent.name, newIdent.namespace.headOption) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'tableIdent' is not used at all.
What changes were proposed in this pull request?
This PR makes
Analyzer.ResolveRelationsresponsible for looking up both v1 and v2 tables from the session catalog and create an appropriate relation.Why are the changes needed?
Currently there are two issues:
Does this PR introduce any user-facing change?
Yes. Now the catalog name is resolved correctly:
How was this patch tested?
Added new tests.