Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Oct 22, 2019

What changes were proposed in this pull request?

Make ResolveRelations call ResolveTables at the beginning, and make ResolveTables call ResolveTempViews(newly added) at the beginning, to ensure the relation resolution priority.

Why are the changes needed?

To resolve an UnresolvedRelation, the general process is:

  1. try to resolve to (global) temp view first. If it's not a temp view, move on
  2. if the table name specifies a catalog, lookup the table from the specified catalog. Otherwise, lookup table from the current catalog.
  3. when looking up table from session catalog, return a v1 relation if the table provider is v1.

Currently, this process is done by 2 rules: ResolveTables and ResolveRelations. To avoid rule conflicts, we add a lot of checks:

  1. ResolveTables only resolves UnresolvedRelation if it's not a temp view and the resolved table is not v1.
  2. ResolveRelations only resolves UnresolvedRelation if the table name has less than 2 parts.

This requires to run ResolveTables before ResolveRelations, otherwise we may resolve a v2 table to a v1 relation.

To clearly guarantee the resolution priority, and avoid massive changes, this PR proposes to call one rule in another rule to ensure the rule execution order. Now the process is simple:

  1. first run ResolveTempViews, see if we can resolve relation to temp view
  2. then run ResolveTables, see if we can resolve relation to v2 tables.
  3. finally run ResolveRelations, see if we can resolve relation to v1 tables.

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

== SubqueryAlias("tbl1", tempTable1))
// Then, if that does not exist, look up the relation in the current database
catalog.dropTable(TableIdentifier("tbl1"), ignoreIfNotExists = false, purge = false)
assert(catalog.lookupRelation(TableIdentifier("tbl1")).children.head
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lookupRelation is no longer there, so I removed the related tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to have tests for lookup* and createV1Relation?

@cloud-fan
Copy link
Contributor Author

@rdblue
Copy link
Contributor

rdblue commented Oct 22, 2019

cc @brkyvz

@SparkQA
Copy link

SparkQA commented Oct 22, 2019

Test build #112476 has finished for PR 26214 at commit 45ab182.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment on lines 794 to 796
CatalogV2Util.loadTable(tableCatalog, name.asIdentifier).map {
case v1Table: V1Table => v1SessionCatalog.createV1Relation(v1Table.v1Table)
case v2Table => DataSourceV2Relation.create(v2Table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tableCatalog is session catalog, loadTable only returns V1Table, isn't?

override def loadTable(ident: Identifier): Table = {
  ...
  V1Table(catalogTable)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it will return v2 table after #25651

The code was already written in a future-proof way, so I just keep it.


case u: UnresolvedV2Relation =>
CatalogV2Util.loadTable(u.catalog, u.tableName).map { table =>
DataSourceV2Relation.create(table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResolveTables now only resolves UnresolvedV2Relation. Does it still make sense to have it as separate rule?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will add more in #25955

// Note this is compatible with the views defined by older versions of Spark(before 2.2), which
// have empty defaultDatabase and all the relations in viewText have database part defined.
def resolveRelation(plan: LogicalPlan): LogicalPlan = plan match {
case u @ UnresolvedRelation(AsTemporaryViewIdentifier(ident))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler to just call ResolveTables.apply(plan) match { here than to embed all the logic within lookupTableFromCatalog?

@brkyvz
Copy link
Contributor

brkyvz commented Oct 23, 2019

While code unification is nice to have, I think we've been purposefully trying to keep v1 code paths and v2 code paths separate, to make it a lot easier in the future to potentially delete the v1 parts. I think this change could make that a bit harder? What do you think?

If we're worried about ordering of rules in the Analyzer, we can ensure that ResolveRelations always calls ResolveTables first, by just calling the ResolveTables.apply method within ResolveRelations, and remove ResolveTables from the resolution rules in the Analyzer. But it would still maintain that clean separation. Do you think that's possible?

case v2Table => DataSourceV2Relation.create(v2Table)
}
} else {
CatalogV2Util.loadTable(tableCatalog, name.asIdentifier).map(DataSourceV2Relation.create)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can move CatalogV2Util.loadTable(tableCatalog, name.asIdentifier) out of if-else block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to follow the previous behavior. We only deal with V1Table if it's from v2 session catalog.

== SubqueryAlias("tbl1", tempTable1))
// Then, if that does not exist, look up the relation in the current database
catalog.dropTable(TableIdentifier("tbl1"), ignoreIfNotExists = false, purge = false)
assert(catalog.lookupRelation(TableIdentifier("tbl1")).children.head
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to have tests for lookup* and createV1Relation?

}

def createV1Relation(metadata: CatalogTable): LogicalPlan = {
val db = formatDatabaseName(metadata.identifier.database.get)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will database be always Some? Or database.getOrElse(currentDb)?

Copy link
Contributor Author

@cloud-fan cloud-fan Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the metadata is looked up from catalog, so it must have the database set. Let me add some comments.


def lookupGlobalTempView(db: String, table: String): Option[SubqueryAlias] = {
val formattedDB = formatDatabaseName(db)
val formattedTable = formatTableName(table)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can move this to inside if block

*
* If the relation is a view, we generate a [[View]] operator from the view description, and
* wrap the logical plan in a [[SubqueryAlias]] which will track the name of the view.
* [[SubqueryAlias]] will also keep track of the name and database(optional) of the table/view
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of these comments are useful?

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Oct 24, 2019

@brkyvz if you look at the updated code, there is no v1 catalog call:

  1. It first resolves temp view, which is neither v1 or v2 path.
  2. it determines the v2 catalog(can be V2SessionCatalog) and lookup the table from it. This is completely a v2 path.
  3. if the v2 catalog is session catalog, return v1 relation if the table is V1Table. This is an extra step.

step 3 is the only v1 part and it's easy to delete it when we get rid of v1 in the future.

val expandedNameParts = defaultDatabase.toSeq ++ nameParts
if (expandedNameParts.length == 1) {
val tblName = expandedNameParts.head
v1SessionCatalog.lookupTempView(tblName).orElse {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that, for now we use SessionCatalog to lookup temp view, but temp view is not a v1 stuff. Even if we get rid of v1 in the future, temp view must be still there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @brkyvz

@rdblue
Copy link
Contributor

rdblue commented Oct 24, 2019

I don't think that the approach of this PR is a good idea.

There are a few guiding principles that we should follow:

  1. Modify v1 as little as possible: we need to avoid changing any v1 behavior. That's why we've been very careful to avoid modifying how the existing read and write paths work. I think it introduces too much unnecessary risk to rewrite the v1 resolution rule just before a release.
  2. Keep v2 separate: we don't want to require rewriting these rules again to remove v1 in the future. And more importantly, we don't want to use rules like ResolveRelations in v2. It is over-complicated (uses recursion instead of multiple runs), mixes several concerns together, and doesn't fit the design of the analyzer (assumes it is the only resolution rule).
  3. Make incremental changes: we want to avoid completely rewriting v2 resolution to fix a given problem.

I think that merging v2 table resolution into the v1 rule is the wrong direction. I like the approach @brkyvz suggested to apply the ResolveTables rule as part of ResolveRelations, so it is maintained independently and so that we need fewer changes to ResolveRelations.

The approach in #25955 is another way to go. Only session catalog tables are matched by ResolveRelations, so I think it is fine to convert those to UnresolvedCatalogTable and then to a v2 relation in FindDataSourceTables.

@cloud-fan
Copy link
Contributor Author

Let me explain more about ResolveRelations to make sure we are on the same page. It resolves temp views, and it resolves view text (that's why it recursively calls the analyzer). Temp views are managed by Spark, view text can refer to v2 tables (e.g. CREATE VIEW v AS SELECT * FROM myCatalog.tbl), so I don't agree we can simply say ResolveRelations is a v1 rule. It's the rule to resolve relations, including temp view, view and table. I don't think this rule is badly designed. It only does one thing: resolve relations. And it will be very tricky if we have multiples rules to resolve relations: a lot of effort needs to be done to make sure the resolution order is corrected.

ResolveTables is kind of a patch to ResolveRelations, to add the ability of multi-catalog support, and return v2 relation for v2 tables (including v2 tables from the session catalog). And the patch must be run before ResolveRelations.

The approach from @brkyvz does make things better. It enforces the order so that we won't mess it up by accident in the future. However, this is more like a workaround. It's hard for other people to understand the table lookup logic now:

  1. Begin ResolveTables. Check if it's a temp view. If it is, skip and wait for ResolveRelations to run
  2. Determine the catalog
  3. If the catalog is not session catalog, lookup table from it using v2 API and return v2 plan
  4. If the catalog is session catalog, lookup table from it using v2 API. If the table is a v1 table, skip and wait for ResolveRelations to run. Otherwise, return v2 plan.
  5. Begin ResolveRelations. Check if it's a temp view. If it is, return the temp view plan
  6. Lookup table from session catalog using v1 API, and return v1 plan.

This PR simplifies it quite a bit:

  1. Begin ResolveRelations. Check if it's a temp view. If it is, return the temp view plan
  2. Determine the catalog
  3. If the catalog is not session catalog, lookup table from it using v2 API and return v2 plan
  4. If the catalog is session catalog, lookup table from it using v2 API. If the table is a v1 table, return v1 plan. Otherwise, return v2 plan.

In V2SessionCatalog, the loadTable method simply calls the v1 catalog API to get the table and wrap it with V1Table, I don't think it's risky to load v1 table with v2 catalog API.

I'm OK to take the approach from @brkyvz to fix the order dependent problem first. But We do need to clean up the table lookup logic a little bit. I'm glad to see more ideas about it.

@rdblue
Copy link
Contributor

rdblue commented Oct 25, 2019

I think we're clear on how resolution in v1 works. Where we disagree is how resolution should work.

Because we want to be careful not to break any existing behavior, we avoid making significant changes to ResolveRelations if possible. The solution to this problem I added to #25955 doesn't modify the rule, so we have options here. I think it's a good idea to consider a solution that calls ResolveTables from ResolveRelations, but I think it is the wrong approach to make significant changes and mix the two rules together by rewriting ResolveRelations as this PR originally did.

Longer term, I think it's a bad idea to keep ResolveRelations around. I understand how it works, and I think that resolution should resolve tables and views separately. View resolution is delicate because you have to use the context of the view (like the current database) and I think it makes sense to keep it separate. Also, using recursive functions instead of running the same rule multiple times makes this over-complicated and harder to maintain. I think it's worth fixing this when we add views from v2 catalogs, and until then making as few changes to ResolveRelations as possible.

@cloud-fan
Copy link
Contributor Author

and I think that resolution should resolve tables and views separately

This is hard to discuss as we don't know how the view API would look like in DS v2. The rationale for why we resolve them together before: Hive metastore table entries can be either table or view, and we must get the table entry first, then return either a table or view plan depending on the table entry type.

Anyway I get your point. I'll hold it off until we figure out the view API in DS v2. But I still think it's better to resolve relation in one place, to ensure

  1. temp view has higher priority
  2. table/view share the same namespace

@cloud-fan cloud-fan force-pushed the resolve branch 2 times, most recently from 1fbc5ac to 252b32f Compare November 5, 2019 11:20
@cloud-fan
Copy link
Contributor Author

I've updated the PR following @brkyvz 's idea. Please take another look @brkyvz @rdblue

@SparkQA
Copy link

SparkQA commented Nov 5, 2019

Test build #113267 has finished for PR 26214 at commit 252b32f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

@brkyvz @rdblue any more thoughts?

@rdblue
Copy link
Contributor

rdblue commented Nov 12, 2019

Sorry, I didn't realize it had been updated since I was out most of last week. I'll have another look in the next day or two.

* Resolve relations to temp views. This is not an actual rule, and is only called by
* [[ResolveTables]].
*/
object ResolveTempViews extends Rule[LogicalPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this is refactored into a separate rule. Can we move it to an earlier batch? If metastore views can't contain temp views, then there's no reason to do this in the same batch as table and view resolution from catalogs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed in the sync, we decide to keep it in the current batch for safety, in case some user-supplied analyzers rules need Spark to resolve unresolved temp views.

def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
case u @ UnresolvedRelation(Seq(part1)) =>
v1SessionCatalog.lookupTempView(part1).getOrElse(u)
case u @ UnresolvedRelation(Seq(part1, part2)) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to check whether part1 is a known catalog. If it is a catalog, then it isn't a temp view reference because catalog resolution happens first.

Not needing to remember that is the purpose of the extractors. I think it would be better to continue using the extractor:

  case u @ UnresolvedRelation(AsTemporaryTableIdentifier(ident)) =>
    ident.database match {
      case Some(db) =>
        v1SessionCatalog.lookupGlobalTempView(db, ident.table).getOrElse(u)
      case None =>
        v1SessionCatalog.lookupTempView(ident.table).getOrElse(u)
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed in the sync, we decide to treat the global temp view name prefix global_temp as a special catalog, so that it won't be masked by user-supplied catalog.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is intended to only match global_temp then it should match Seq(GLOBAL_TEMP_NAMESPACE, name) instead of matching part1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, global_temp is not a constant, it's a static SQL conf that users can set before starting a Spark application(not at runtime)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's why I went ahead with the merge. I think the code is currently correct.

Still, it would be nice to use the runtime setting here in the matcher instead.

}
}

def lookupGlobalTempView(db: String, table: String): Option[SubqueryAlias] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is safe and I do prefer these methods to a combined resolveRelation, but I'm curious why you decided not to use the existing method?

Copy link
Contributor Author

@cloud-fan cloud-fan Nov 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to clearly separate the resolution of temp view and v1/v2 table, so I'd like to avoid using resolveRelation which mixes things together.

These 2 methods mostly copy-paste code from resolveRelation. We can update resolveRelation to only resolve v1 tables, but I'd like to do it later as there are many tests calling resolveRelation and we need to update them as well.

@rdblue
Copy link
Contributor

rdblue commented Nov 14, 2019

@cloud-fan, thanks for updating this. If possible, I'd like to move temporary view resolution into its own batch. I don't think there's a reason why we can't do that.

I'd also like to continue using the extractors to avoid mistakes handling identifiers, like the conflict between the global view database and a v2 catalog name.

}
}

test("global temp view should not be masked by v2 catalog") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the behavior make sense to you? @rdblue @brkyvz

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay for now, but I think it is a little confusing that the catalog is completely ignored. I think this should result in an error instead, but we can do that in a follow-up.

@SparkQA
Copy link

SparkQA commented Nov 14, 2019

Test build #113783 has finished for PR 26214 at commit eff3980.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 20, 2019

Test build #114177 has finished for PR 26214 at commit eff3980.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

Looks reasonable to me. Please continue the work and fix the test failures 👍

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114221 has finished for PR 26214 at commit c10f288.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Nov 21, 2019

+1

Merging to master.

@cloud-fan
Copy link
Contributor Author

@rdblue thanks for reviewing and merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants