-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29558][SQL] ResolveTables and ResolveRelations should be order-insensitive #26214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| == SubqueryAlias("tbl1", tempTable1)) | ||
| // Then, if that does not exist, look up the relation in the current database | ||
| catalog.dropTable(TableIdentifier("tbl1"), ignoreIfNotExists = false, purge = false) | ||
| assert(catalog.lookupRelation(TableIdentifier("tbl1")).children.head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lookupRelation is no longer there, so I removed the related tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to have tests for lookup* and createV1Relation?
|
cc @brkyvz |
|
Test build #112476 has finished for PR 26214 at commit
|
| CatalogV2Util.loadTable(tableCatalog, name.asIdentifier).map { | ||
| case v1Table: V1Table => v1SessionCatalog.createV1Relation(v1Table.v1Table) | ||
| case v2Table => DataSourceV2Relation.create(v2Table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If tableCatalog is session catalog, loadTable only returns V1Table, isn't?
override def loadTable(ident: Identifier): Table = {
...
V1Table(catalogTable)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it will return v2 table after #25651
The code was already written in a future-proof way, so I just keep it.
|
|
||
| case u: UnresolvedV2Relation => | ||
| CatalogV2Util.loadTable(u.catalog, u.tableName).map { table => | ||
| DataSourceV2Relation.create(table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResolveTables now only resolves UnresolvedV2Relation. Does it still make sense to have it as separate rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will add more in #25955
| // Note this is compatible with the views defined by older versions of Spark(before 2.2), which | ||
| // have empty defaultDatabase and all the relations in viewText have database part defined. | ||
| def resolveRelation(plan: LogicalPlan): LogicalPlan = plan match { | ||
| case u @ UnresolvedRelation(AsTemporaryViewIdentifier(ident)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be simpler to just call ResolveTables.apply(plan) match { here than to embed all the logic within lookupTableFromCatalog?
|
While code unification is nice to have, I think we've been purposefully trying to keep v1 code paths and v2 code paths separate, to make it a lot easier in the future to potentially delete the v1 parts. I think this change could make that a bit harder? What do you think? If we're worried about ordering of rules in the Analyzer, we can ensure that |
| case v2Table => DataSourceV2Relation.create(v2Table) | ||
| } | ||
| } else { | ||
| CatalogV2Util.loadTable(tableCatalog, name.asIdentifier).map(DataSourceV2Relation.create) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can move CatalogV2Util.loadTable(tableCatalog, name.asIdentifier) out of if-else block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to follow the previous behavior. We only deal with V1Table if it's from v2 session catalog.
| == SubqueryAlias("tbl1", tempTable1)) | ||
| // Then, if that does not exist, look up the relation in the current database | ||
| catalog.dropTable(TableIdentifier("tbl1"), ignoreIfNotExists = false, purge = false) | ||
| assert(catalog.lookupRelation(TableIdentifier("tbl1")).children.head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to have tests for lookup* and createV1Relation?
| } | ||
|
|
||
| def createV1Relation(metadata: CatalogTable): LogicalPlan = { | ||
| val db = formatDatabaseName(metadata.identifier.database.get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will database be always Some? Or database.getOrElse(currentDb)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the metadata is looked up from catalog, so it must have the database set. Let me add some comments.
|
|
||
| def lookupGlobalTempView(db: String, table: String): Option[SubqueryAlias] = { | ||
| val formattedDB = formatDatabaseName(db) | ||
| val formattedTable = formatTableName(table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can move this to inside if block
| * | ||
| * If the relation is a view, we generate a [[View]] operator from the view description, and | ||
| * wrap the logical plan in a [[SubqueryAlias]] which will track the name of the view. | ||
| * [[SubqueryAlias]] will also keep track of the name and database(optional) of the table/view |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some of these comments are useful?
|
@brkyvz if you look at the updated code, there is no v1 catalog call:
step 3 is the only v1 part and it's easy to delete it when we get rid of v1 in the future. |
| val expandedNameParts = defaultDatabase.toSeq ++ nameParts | ||
| if (expandedNameParts.length == 1) { | ||
| val tblName = expandedNameParts.head | ||
| v1SessionCatalog.lookupTempView(tblName).orElse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that, for now we use SessionCatalog to lookup temp view, but temp view is not a v1 stuff. Even if we get rid of v1 in the future, temp view must be still there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @brkyvz
|
I don't think that the approach of this PR is a good idea. There are a few guiding principles that we should follow:
I think that merging v2 table resolution into the v1 rule is the wrong direction. I like the approach @brkyvz suggested to apply the The approach in #25955 is another way to go. Only session catalog tables are matched by |
|
Let me explain more about
The approach from @brkyvz does make things better. It enforces the order so that we won't mess it up by accident in the future. However, this is more like a workaround. It's hard for other people to understand the table lookup logic now:
This PR simplifies it quite a bit:
In I'm OK to take the approach from @brkyvz to fix the order dependent problem first. But We do need to clean up the table lookup logic a little bit. I'm glad to see more ideas about it. |
|
I think we're clear on how resolution in v1 works. Where we disagree is how resolution should work. Because we want to be careful not to break any existing behavior, we avoid making significant changes to Longer term, I think it's a bad idea to keep |
This is hard to discuss as we don't know how the view API would look like in DS v2. The rationale for why we resolve them together before: Hive metastore table entries can be either table or view, and we must get the table entry first, then return either a table or view plan depending on the table entry type. Anyway I get your point. I'll hold it off until we figure out the view API in DS v2. But I still think it's better to resolve relation in one place, to ensure
|
1fbc5ac to
252b32f
Compare
|
Test build #113267 has finished for PR 26214 at commit
|
|
Sorry, I didn't realize it had been updated since I was out most of last week. I'll have another look in the next day or two. |
| * Resolve relations to temp views. This is not an actual rule, and is only called by | ||
| * [[ResolveTables]]. | ||
| */ | ||
| object ResolveTempViews extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this is refactored into a separate rule. Can we move it to an earlier batch? If metastore views can't contain temp views, then there's no reason to do this in the same batch as table and view resolution from catalogs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed in the sync, we decide to keep it in the current batch for safety, in case some user-supplied analyzers rules need Spark to resolve unresolved temp views.
| def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp { | ||
| case u @ UnresolvedRelation(Seq(part1)) => | ||
| v1SessionCatalog.lookupTempView(part1).getOrElse(u) | ||
| case u @ UnresolvedRelation(Seq(part1, part2)) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to check whether part1 is a known catalog. If it is a catalog, then it isn't a temp view reference because catalog resolution happens first.
Not needing to remember that is the purpose of the extractors. I think it would be better to continue using the extractor:
case u @ UnresolvedRelation(AsTemporaryTableIdentifier(ident)) =>
ident.database match {
case Some(db) =>
v1SessionCatalog.lookupGlobalTempView(db, ident.table).getOrElse(u)
case None =>
v1SessionCatalog.lookupTempView(ident.table).getOrElse(u)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed in the sync, we decide to treat the global temp view name prefix global_temp as a special catalog, so that it won't be masked by user-supplied catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is intended to only match global_temp then it should match Seq(GLOBAL_TEMP_NAMESPACE, name) instead of matching part1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately, global_temp is not a constant, it's a static SQL conf that users can set before starting a Spark application(not at runtime)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's why I went ahead with the merge. I think the code is currently correct.
Still, it would be nice to use the runtime setting here in the matcher instead.
| } | ||
| } | ||
|
|
||
| def lookupGlobalTempView(db: String, table: String): Option[SubqueryAlias] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is safe and I do prefer these methods to a combined resolveRelation, but I'm curious why you decided not to use the existing method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is to clearly separate the resolution of temp view and v1/v2 table, so I'd like to avoid using resolveRelation which mixes things together.
These 2 methods mostly copy-paste code from resolveRelation. We can update resolveRelation to only resolve v1 tables, but I'd like to do it later as there are many tests calling resolveRelation and we need to update them as well.
|
@cloud-fan, thanks for updating this. If possible, I'd like to move temporary view resolution into its own batch. I don't think there's a reason why we can't do that. I'd also like to continue using the extractors to avoid mistakes handling identifiers, like the conflict between the global view database and a v2 catalog name. |
| } | ||
| } | ||
|
|
||
| test("global temp view should not be masked by v2 catalog") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is okay for now, but I think it is a little confusing that the catalog is completely ignored. I think this should result in an error instead, but we can do that in a follow-up.
|
Test build #113783 has finished for PR 26214 at commit
|
|
retest this please |
|
Test build #114177 has finished for PR 26214 at commit
|
|
Looks reasonable to me. Please continue the work and fix the test failures 👍 |
|
Test build #114221 has finished for PR 26214 at commit
|
|
+1 Merging to master. |
|
@rdblue thanks for reviewing and merging! |
What changes were proposed in this pull request?
Make
ResolveRelationscallResolveTablesat the beginning, and makeResolveTablescallResolveTempViews(newly added) at the beginning, to ensure the relation resolution priority.Why are the changes needed?
To resolve an
UnresolvedRelation, the general process is:Currently, this process is done by 2 rules:
ResolveTablesandResolveRelations. To avoid rule conflicts, we add a lot of checks:ResolveTablesonly resolvesUnresolvedRelationif it's not a temp view and the resolved table is not v1.ResolveRelationsonly resolvesUnresolvedRelationif the table name has less than 2 parts.This requires to run
ResolveTablesbeforeResolveRelations, otherwise we may resolve a v2 table to a v1 relation.To clearly guarantee the resolution priority, and avoid massive changes, this PR proposes to call one rule in another rule to ensure the rule execution order. Now the process is simple:
ResolveTempViews, see if we can resolve relation to temp viewResolveTables, see if we can resolve relation to v2 tables.ResolveRelations, see if we can resolve relation to v1 tables.Does this PR introduce any user-facing change?
no
How was this patch tested?
existing tests