[SPARK-29558][SQL] ResolveTables and ResolveRelations should be order-insensitive #26214

cloud-fan · 2019-10-22T16:33:20Z

What changes were proposed in this pull request?

Make ResolveRelations call ResolveTables at the beginning, and make ResolveTables call ResolveTempViews(newly added) at the beginning, to ensure the relation resolution priority.

Why are the changes needed?

To resolve an UnresolvedRelation, the general process is:

try to resolve to (global) temp view first. If it's not a temp view, move on
if the table name specifies a catalog, lookup the table from the specified catalog. Otherwise, lookup table from the current catalog.
when looking up table from session catalog, return a v1 relation if the table provider is v1.

Currently, this process is done by 2 rules: ResolveTables and ResolveRelations. To avoid rule conflicts, we add a lot of checks:

ResolveTables only resolves UnresolvedRelation if it's not a temp view and the resolved table is not v1.
ResolveRelations only resolves UnresolvedRelation if the table name has less than 2 parts.

This requires to run ResolveTables before ResolveRelations, otherwise we may resolve a v2 table to a v1 relation.

To clearly guarantee the resolution priority, and avoid massive changes, this PR proposes to call one rule in another rule to ensure the rule execution order. Now the process is simple:

first run ResolveTempViews, see if we can resolve relation to temp view
then run ResolveTables, see if we can resolve relation to v2 tables.
finally run ResolveRelations, see if we can resolve relation to v1 tables.

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

cloud-fan · 2019-10-22T16:35:21Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala

-        == SubqueryAlias("tbl1", tempTable1))
-      // Then, if that does not exist, look up the relation in the current database
-      catalog.dropTable(TableIdentifier("tbl1"), ignoreIfNotExists = false, purge = false)
-      assert(catalog.lookupRelation(TableIdentifier("tbl1")).children.head


lookupRelation is no longer there, so I removed the related tests.

do we need to have tests for lookup* and createV1Relation?

cloud-fan · 2019-10-22T16:37:14Z

cc @rdblue @imback82 @dongjoon-hyun @viirya

rdblue · 2019-10-22T17:26:15Z

cc @brkyvz

SparkQA · 2019-10-22T20:56:27Z

Test build #112476 has finished for PR 26214 at commit 45ab182.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-10-23T06:16:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        CatalogV2Util.loadTable(tableCatalog, name.asIdentifier).map {
+          case v1Table: V1Table => v1SessionCatalog.createV1Relation(v1Table.v1Table)
+          case v2Table => DataSourceV2Relation.create(v2Table)


If tableCatalog is session catalog, loadTable only returns V1Table, isn't?

override def loadTable(ident: Identifier): Table = { ... V1Table(catalogTable) }

Yes, but it will return v2 table after #25651

The code was already written in a future-proof way, so I just keep it.

viirya · 2019-10-23T06:32:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-
      case u: UnresolvedV2Relation =>
        CatalogV2Util.loadTable(u.catalog, u.tableName).map { table =>
          DataSourceV2Relation.create(table)


ResolveTables now only resolves UnresolvedV2Relation. Does it still make sense to have it as separate rule?

We will add more in #25955

brkyvz · 2019-10-23T17:55:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

    // Note this is compatible with the views defined by older versions of Spark(before 2.2), which
    // have empty defaultDatabase and all the relations in viewText have database part defined.
    def resolveRelation(plan: LogicalPlan): LogicalPlan = plan match {
-      case u @ UnresolvedRelation(AsTemporaryViewIdentifier(ident))


Wouldn't it be simpler to just call ResolveTables.apply(plan) match { here than to embed all the logic within lookupTableFromCatalog?

brkyvz · 2019-10-23T17:59:30Z

While code unification is nice to have, I think we've been purposefully trying to keep v1 code paths and v2 code paths separate, to make it a lot easier in the future to potentially delete the v1 parts. I think this change could make that a bit harder? What do you think?

If we're worried about ordering of rules in the Analyzer, we can ensure that ResolveRelations always calls ResolveTables first, by just calling the ResolveTables.apply method within ResolveRelations, and remove ResolveTables from the resolution rules in the Analyzer. But it would still maintain that clean separation. Do you think that's possible?

imback82 · 2019-10-23T19:12:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          case v2Table => DataSourceV2Relation.create(v2Table)
+        }
+      } else {
+        CatalogV2Util.loadTable(tableCatalog, name.asIdentifier).map(DataSourceV2Relation.create)


nit: you can move CatalogV2Util.loadTable(tableCatalog, name.asIdentifier) out of if-else block.

This is to follow the previous behavior. We only deal with V1Table if it's from v2 session catalog.

imback82 · 2019-10-23T19:40:32Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala

-        == SubqueryAlias("tbl1", tempTable1))
-      // Then, if that does not exist, look up the relation in the current database
-      catalog.dropTable(TableIdentifier("tbl1"), ignoreIfNotExists = false, purge = false)
-      assert(catalog.lookupRelation(TableIdentifier("tbl1")).children.head


do we need to have tests for lookup* and createV1Relation?

imback82 · 2019-10-23T19:50:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+  }
+
+  def createV1Relation(metadata: CatalogTable): LogicalPlan = {
+    val db = formatDatabaseName(metadata.identifier.database.get)


will database be always Some? Or database.getOrElse(currentDb)?

the metadata is looked up from catalog, so it must have the database set. Let me add some comments.

imback82 · 2019-10-23T19:53:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+
+  def lookupGlobalTempView(db: String, table: String): Option[SubqueryAlias] = {
+    val formattedDB = formatDatabaseName(db)
+    val formattedTable = formatTableName(table)


nit: you can move this to inside if block

imback82 · 2019-10-23T19:56:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

-   *
-   * If the relation is a view, we generate a [[View]] operator from the view description, and
-   * wrap the logical plan in a [[SubqueryAlias]] which will track the name of the view.
-   * [[SubqueryAlias]] will also keep track of the name and database(optional) of the table/view


some of these comments are useful?

cloud-fan · 2019-10-24T07:27:44Z

@brkyvz if you look at the updated code, there is no v1 catalog call:

It first resolves temp view, which is neither v1 or v2 path.
it determines the v2 catalog(can be V2SessionCatalog) and lookup the table from it. This is completely a v2 path.
if the v2 catalog is session catalog, return v1 relation if the table is V1Table. This is an extra step.

step 3 is the only v1 part and it's easy to delete it when we get rid of v1 in the future.

cloud-fan · 2019-10-24T07:40:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      val expandedNameParts = defaultDatabase.toSeq ++ nameParts
+      if (expandedNameParts.length == 1) {
+        val tblName = expandedNameParts.head
+        v1SessionCatalog.lookupTempView(tblName).orElse {


Note that, for now we use SessionCatalog to lookup temp view, but temp view is not a v1 stuff. Even if we get rid of v1 in the future, temp view must be still there.

rdblue · 2019-10-24T17:14:05Z

I don't think that the approach of this PR is a good idea.

There are a few guiding principles that we should follow:

Modify v1 as little as possible: we need to avoid changing any v1 behavior. That's why we've been very careful to avoid modifying how the existing read and write paths work. I think it introduces too much unnecessary risk to rewrite the v1 resolution rule just before a release.
Keep v2 separate: we don't want to require rewriting these rules again to remove v1 in the future. And more importantly, we don't want to use rules like ResolveRelations in v2. It is over-complicated (uses recursion instead of multiple runs), mixes several concerns together, and doesn't fit the design of the analyzer (assumes it is the only resolution rule).
Make incremental changes: we want to avoid completely rewriting v2 resolution to fix a given problem.

I think that merging v2 table resolution into the v1 rule is the wrong direction. I like the approach @brkyvz suggested to apply the ResolveTables rule as part of ResolveRelations, so it is maintained independently and so that we need fewer changes to ResolveRelations.

The approach in #25955 is another way to go. Only session catalog tables are matched by ResolveRelations, so I think it is fine to convert those to UnresolvedCatalogTable and then to a v2 relation in FindDataSourceTables.

cloud-fan · 2019-10-25T13:28:49Z

Let me explain more about ResolveRelations to make sure we are on the same page. It resolves temp views, and it resolves view text (that's why it recursively calls the analyzer). Temp views are managed by Spark, view text can refer to v2 tables (e.g. CREATE VIEW v AS SELECT * FROM myCatalog.tbl), so I don't agree we can simply say ResolveRelations is a v1 rule. It's the rule to resolve relations, including temp view, view and table. I don't think this rule is badly designed. It only does one thing: resolve relations. And it will be very tricky if we have multiples rules to resolve relations: a lot of effort needs to be done to make sure the resolution order is corrected.

ResolveTables is kind of a patch to ResolveRelations, to add the ability of multi-catalog support, and return v2 relation for v2 tables (including v2 tables from the session catalog). And the patch must be run before ResolveRelations.

The approach from @brkyvz does make things better. It enforces the order so that we won't mess it up by accident in the future. However, this is more like a workaround. It's hard for other people to understand the table lookup logic now:

Begin ResolveTables. Check if it's a temp view. If it is, skip and wait for ResolveRelations to run
Determine the catalog
If the catalog is not session catalog, lookup table from it using v2 API and return v2 plan
If the catalog is session catalog, lookup table from it using v2 API. If the table is a v1 table, skip and wait for ResolveRelations to run. Otherwise, return v2 plan.
Begin ResolveRelations. Check if it's a temp view. If it is, return the temp view plan
Lookup table from session catalog using v1 API, and return v1 plan.

This PR simplifies it quite a bit:

Begin ResolveRelations. Check if it's a temp view. If it is, return the temp view plan
Determine the catalog
If the catalog is not session catalog, lookup table from it using v2 API and return v2 plan
If the catalog is session catalog, lookup table from it using v2 API. If the table is a v1 table, return v1 plan. Otherwise, return v2 plan.

In V2SessionCatalog, the loadTable method simply calls the v1 catalog API to get the table and wrap it with V1Table, I don't think it's risky to load v1 table with v2 catalog API.

I'm OK to take the approach from @brkyvz to fix the order dependent problem first. But We do need to clean up the table lookup logic a little bit. I'm glad to see more ideas about it.

rdblue · 2019-10-25T22:47:25Z

I think we're clear on how resolution in v1 works. Where we disagree is how resolution should work.

Because we want to be careful not to break any existing behavior, we avoid making significant changes to ResolveRelations if possible. The solution to this problem I added to #25955 doesn't modify the rule, so we have options here. I think it's a good idea to consider a solution that calls ResolveTables from ResolveRelations, but I think it is the wrong approach to make significant changes and mix the two rules together by rewriting ResolveRelations as this PR originally did.

Longer term, I think it's a bad idea to keep ResolveRelations around. I understand how it works, and I think that resolution should resolve tables and views separately. View resolution is delicate because you have to use the context of the view (like the current database) and I think it makes sense to keep it separate. Also, using recursive functions instead of running the same rule multiple times makes this over-complicated and harder to maintain. I think it's worth fixing this when we add views from v2 catalogs, and until then making as few changes to ResolveRelations as possible.

cloud-fan · 2019-10-28T16:18:54Z

and I think that resolution should resolve tables and views separately

This is hard to discuss as we don't know how the view API would look like in DS v2. The rationale for why we resolve them together before: Hive metastore table entries can be either table or view, and we must get the table entry first, then return either a table or view plan depending on the table entry type.

Anyway I get your point. I'll hold it off until we figure out the view API in DS v2. But I still think it's better to resolve relation in one place, to ensure

temp view has higher priority
table/view share the same namespace

cloud-fan · 2019-11-05T11:28:26Z

I've updated the PR following @brkyvz 's idea. Please take another look @brkyvz @rdblue

SparkQA · 2019-11-05T15:14:32Z

Test build #113267 has finished for PR 26214 at commit 252b32f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-12T17:23:56Z

@brkyvz @rdblue any more thoughts?

rdblue · 2019-11-12T18:04:32Z

Sorry, I didn't realize it had been updated since I was out most of last week. I'll have another look in the next day or two.

rdblue · 2019-11-13T23:51:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+   * Resolve relations to temp views. This is not an actual rule, and is only called by
+   * [[ResolveTables]].
+   */
+  object ResolveTempViews extends Rule[LogicalPlan] {


I like that this is refactored into a separate rule. Can we move it to an earlier batch? If metastore views can't contain temp views, then there's no reason to do this in the same batch as table and view resolution from catalogs.

as discussed in the sync, we decide to keep it in the current batch for safety, in case some user-supplied analyzers rules need Spark to resolve unresolved temp views.

rdblue · 2019-11-14T00:02:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+    def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
+      case u @ UnresolvedRelation(Seq(part1)) =>
+        v1SessionCatalog.lookupTempView(part1).getOrElse(u)
+      case u @ UnresolvedRelation(Seq(part1, part2)) =>


This needs to check whether part1 is a known catalog. If it is a catalog, then it isn't a temp view reference because catalog resolution happens first.

Not needing to remember that is the purpose of the extractors. I think it would be better to continue using the extractor:

case u @ UnresolvedRelation(AsTemporaryTableIdentifier(ident)) => ident.database match { case Some(db) => v1SessionCatalog.lookupGlobalTempView(db, ident.table).getOrElse(u) case None => v1SessionCatalog.lookupTempView(ident.table).getOrElse(u) }

as discussed in the sync, we decide to treat the global temp view name prefix global_temp as a special catalog, so that it won't be masked by user-supplied catalog.

If this is intended to only match global_temp then it should match Seq(GLOBAL_TEMP_NAMESPACE, name) instead of matching part1.

unfortunately, global_temp is not a constant, it's a static SQL conf that users can set before starting a Spark application(not at runtime)

Yeah, that's why I went ahead with the merge. I think the code is currently correct.

Still, it would be nice to use the runtime setting here in the matcher instead.

rdblue · 2019-11-14T00:03:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+    }
+  }
+
+  def lookupGlobalTempView(db: String, table: String): Option[SubqueryAlias] = {


I think this is safe and I do prefer these methods to a combined resolveRelation, but I'm curious why you decided not to use the existing method?

The idea here is to clearly separate the resolution of temp view and v1/v2 table, so I'd like to avoid using resolveRelation which mixes things together.

These 2 methods mostly copy-paste code from resolveRelation. We can update resolveRelation to only resolve v1 tables, but I'd like to do it later as there are many tests calling resolveRelation and we need to update them as well.

rdblue · 2019-11-14T00:22:57Z

@cloud-fan, thanks for updating this. If possible, I'd like to move temporary view resolution into its own batch. I don't think there's a reason why we can't do that.

I'd also like to continue using the extractors to avoid mistakes handling identifiers, like the conflict between the global view database and a v2 catalog name.

cloud-fan · 2019-11-14T12:31:53Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

    }
  }

+  test("global temp view should not be masked by v2 catalog") {


does the behavior make sense to you? @rdblue @brkyvz

This is okay for now, but I think it is a little confusing that the catalog is completely ignored. I think this should result in an error instead, but we can do that in a follow-up.

SparkQA · 2019-11-14T13:02:15Z

Test build #113783 has finished for PR 26214 at commit eff3980.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-20T17:53:36Z

retest this please

SparkQA · 2019-11-20T19:31:51Z

Test build #114177 has finished for PR 26214 at commit eff3980.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-11-21T07:22:26Z

Looks reasonable to me. Please continue the work and fix the test failures 👍

SparkQA · 2019-11-21T12:06:00Z

Test build #114221 has finished for PR 26214 at commit c10f288.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-11-21T17:48:09Z

+1

Merging to master.

cloud-fan · 2019-11-22T02:28:36Z

@rdblue thanks for reviewing and merging!

cloud-fan commented Oct 22, 2019

View reviewed changes

cloud-fan mentioned this pull request Oct 22, 2019

[SPARK-29277][SQL] Add early DSv2 filter and projection pushdown #25955

Closed

dongjoon-hyun added the SQL label Oct 22, 2019

viirya reviewed Oct 23, 2019

View reviewed changes

brkyvz reviewed Oct 23, 2019

View reviewed changes

imback82 reviewed Oct 23, 2019

View reviewed changes

cloud-fan commented Oct 24, 2019

View reviewed changes

cloud-fan force-pushed the resolve branch 2 times, most recently from 1fbc5ac to 252b32f Compare November 5, 2019 11:20

rdblue reviewed Nov 13, 2019

View reviewed changes

rdblue reviewed Nov 14, 2019

View reviewed changes

cloud-fan force-pushed the resolve branch from 252b32f to eff3980 Compare November 14, 2019 12:31

cloud-fan commented Nov 14, 2019

View reviewed changes

ResolveTables and ResolveRelations should be order-insensitive

12bd012

cloud-fan force-pushed the resolve branch from eff3980 to ba63559 Compare November 21, 2019 08:10

refine

c10f288

cloud-fan force-pushed the resolve branch from ba63559 to c10f288 Compare November 21, 2019 08:11

cloud-fan mentioned this pull request Nov 21, 2019

[SPARK-29975][SQL] introduce --CONFIG_DIM directive #26612

Closed

rdblue closed this in 6b4b6a8 Nov 21, 2019

[SPARK-29558][SQL] ResolveTables and ResolveRelations should be order-insensitive #26214

[SPARK-29558][SQL] ResolveTables and ResolveRelations should be order-insensitive #26214

Uh oh!

Conversation

cloud-fan commented Oct 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 22, 2019

Uh oh!

rdblue commented Oct 22, 2019

Uh oh!

SparkQA commented Oct 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz commented Oct 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Oct 24, 2019

Uh oh!

cloud-fan commented Oct 25, 2019

Uh oh!

rdblue commented Oct 25, 2019

Uh oh!

cloud-fan commented Oct 28, 2019

Uh oh!

cloud-fan commented Nov 5, 2019

Uh oh!

SparkQA commented Nov 5, 2019

Uh oh!

cloud-fan commented Nov 12, 2019

Uh oh!

rdblue commented Nov 12, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 22, 2019 •

edited

Loading

cloud-fan Oct 24, 2019 •

edited

Loading

cloud-fan commented Oct 24, 2019 •

edited

Loading

cloud-fan Nov 14, 2019 •

edited

Loading