[SPARK-12725] [SQL] Resolving Name Conflicts in SQL Generation and Name Ambiguity Caused by Internally Generated Expressions #11050

gatorsmile · 2016-02-03T14:34:22Z

Some analysis rules generate aliases or auxiliary attribute references with the same name but different expression IDs. For example, ResolveAggregateFunctions introduces havingCondition and aggOrder, and DistinctAggregationRewriter introduces gid.

This is OK for normal query execution since these attribute references get expression IDs. However, it's troublesome when converting resolved query plans back to SQL query strings since expression IDs are erased.

Here's an example Spark 1.6.0 snippet for illustration:

sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), COUNT(b)").explain(true)

The above code produces the following resolved plan:

== Analyzed Logical Plan ==
_c0: bigint
Project [_c0#101L]
+- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
   +- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
      +- Subquery t
         +- Project [id#46L AS a#47L,id#46L AS b#48L]
            +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at <console>:26

Here we can see that both aggregate expressions in ORDER BY are extracted into an Aggregate operator, and both of them are named aggOrder with different expression IDs.

The solution is to automatically add the expression IDs into the attribute name for the Alias and AttributeReferences that are generated by Analyzer in SQL Generation.

In this PR, it also resolves another issue. Users could use the same name as the internally generated names. The duplicate names should not cause name ambiguity. When resolving the column, Catalyst should not pick the column that is internally generated.

Could you review the solution? @marmbrus @liancheng

I did not set the newly added flag for all the alias and attribute reference generated by Analyzers. Please let me know if I should do it? Thank you!

SparkQA · 2016-02-03T16:40:12Z

Test build #50662 has finished for PR 11050 at commit 82bb46f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Alias(child: Expression, name: String, isGenerated: Boolean = false)(

marmbrus · 2016-02-03T18:06:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

Why not put isGenerated into the second list with the other metadata so that we don't have to rewrite every match statement in the code base?

uh, I see. Will do it. : )

marmbrus · 2016-02-03T18:08:21Z

We should also modify the resolve function to filter out attributes where isGenerated = true. There have been bugs in the past where this creates ambiguity and we can finally fix that.

gatorsmile · 2016-02-03T18:22:59Z

@marmbrus Do you still remember the JIRA number or the case? If so, I can add it into the test cases and verify if the fix works well.

In this PR, isGenerated is added to AttributeReference and Alias. AttributeReference is always resolved. Alias is not resolved unless the child has been resolved, and thus, I think we are unable to skip resolving its child even if the Alias is added by Analyzer. Please correct me if my understanding is wrong. Thank you!

marmbrus · 2016-02-03T18:28:36Z

I'd have to search for it as it was a while ago, you should be able to trigger it by forcing the analyzer to generate a column while referencing a column with the same name.

I wasn't referring to the resolved property of an Expression, but instead the resolve function on a LogicalPlan. This should probably never return attributes that are generated.

gatorsmile · 2016-02-03T18:32:56Z

@marmbrus : ) Now I understand it. Will do it. Thank you!

gatorsmile · 2016-02-03T23:27:09Z

I guess the PR you mentioned is #8231

I think we should disallow users to accidentally use the internally generated names. Thus, adding the following test case to verify if the fix works well. In this case, the table does not have a column named havingCondition. Thus, Spark SQL issues an error message to say it is not resolvable.

errorTest(
   "unresolved attributes with a generated name",
   testRelation2.groupBy('a)(max('b))
     .where(sum('b) > 0)
     .orderBy('havingCondition.asc),
   "cannot resolve" :: "havingCondition" :: Nil)

marmbrus · 2016-02-03T23:34:36Z

Yeah, that test looks great.

SparkQA · 2016-02-04T01:04:33Z

Test build #50695 has finished for PR 11050 at commit 22cf88a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-04T01:30:33Z

Test build #50704 has finished for PR 11050 at commit cf89603.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-02-05T04:51:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

              UnresolvedAlias(child = f.copy(children = newChildren)) :: Nil
-            case Alias(f @ UnresolvedFunction(_, args, _), name) if containsStar(args) =>
+            case a @ Alias(f @ UnresolvedFunction(_, args, _), name)
+                if containsStar(args) =>


Nit: This change can be reverted now.

SparkQA · 2016-02-06T19:52:08Z

Test build #50875 has finished for PR 11050 at commit 4e00349.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-02-06T22:55:36Z

retest this please

SparkQA · 2016-02-07T00:42:33Z

Test build #50882 has finished for PR 11050 at commit 4e00349.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-02-07T08:02:43Z

retest this please

SparkQA · 2016-02-07T10:30:09Z

Test build #50893 has finished for PR 11050 at commit 4e00349.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-02-08T03:06:40Z

retest this please

SparkQA · 2016-02-08T07:30:05Z

Test build #50908 has finished for PR 11050 at commit 4e00349.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-02-08T15:52:17Z

retest this please

SparkQA · 2016-02-08T18:03:11Z

Test build #50926 has finished for PR 11050 at commit 4e00349.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-02-11T02:14:53Z

LGTM, merging to master. Thanks!

cloud-fan · 2016-03-12T02:44:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

    val qualifiersString =
      if (qualifiers.isEmpty) "" else qualifiers.map("`" + _ + "`").mkString("", ".", ".")
-    s"${child.sql} AS $qualifiersString`$name`"
+    val aliasName = if (isGenerated) s"$name#${exprId.id}" else s"$name"


Why do we need to change the sql according to isGenerate?

ah, I see. This is to workaround an issue in SQLBuilder

Yeah. You can get more background from the discussion in the JIRA: https://issues.apache.org/jira/browse/SPARK-12725

…rom Alias and AttributeReference ## What changes were proposed in this pull request? `isTableSample` and `isGenerated ` were introduced for SQL Generation respectively by #11148 and #11050 Since SQL Generation is removed, we do not need to keep `isTableSample`. ## How was this patch tested? The existing test cases Author: Xiao Li <gatorsmile@gmail.com> Closes #18379 from gatorsmile/CleanSample.

…rom Alias and AttributeReference ## What changes were proposed in this pull request? `isTableSample` and `isGenerated ` were introduced for SQL Generation respectively by apache#11148 and apache#11050 Since SQL Generation is removed, we do not need to keep `isTableSample`. ## How was this patch tested? The existing test cases Author: Xiao Li <gatorsmile@gmail.com> Closes apache#18379 from gatorsmile/CleanSample.

gatorsmile added 2 commits February 1, 2016 23:35

turn on the test.

7937d2b

added a flag isGenerated to Alias and AttributeReference

82bb46f

marmbrus reviewed Feb 3, 2016
View reviewed changes

gatorsmile added 3 commits February 3, 2016 11:42

changed isGenerated to the second list in attributeReferences.

808dc8a

changed isGenerated to the second list in Alias.

eae6e9f

internally generated expression names are not resolvable

22cf88a

gatorsmile changed the title ~~[SPARK-12725] [SQL] Resolving Name Conflicts in SQL Generation by Adding a flag isGenerated to Alias and AttributeReference~~ [SPARK-12725] [SQL] Resolving Name Conflicts in SQL Generation and Name Ambiguity Caused by Internally Generated Expressions Feb 3, 2016

style fix.

cf89603

groupingIdName is marked as 'generated=Some(true)'.

dae231b

grouping__id is selectable.

4c03809

liancheng reviewed Feb 5, 2016
View reviewed changes

gatorsmile added 3 commits February 5, 2016 19:23

address comments.

fcb022d

added a type

2b33715

style fix

4e00349

asfgit closed this in 663cc40 Feb 11, 2016

cloud-fan reviewed Mar 12, 2016
View reviewed changes

gatorsmile deleted the namingConflicts branch March 12, 2016 03:13

gatorsmile mentioned this pull request Jun 23, 2017

[SPARK-21164] [SQL] Remove isTableSample from Sample and isGenerated from Alias and AttributeReference #18379

Closed

[SPARK-12725] [SQL] Resolving Name Conflicts in SQL Generation and Name Ambiguity Caused by Internally Generated Expressions #11050

[SPARK-12725] [SQL] Resolving Name Conflicts in SQL Generation and Name Ambiguity Caused by Internally Generated Expressions #11050

Uh oh!

Conversation

gatorsmile commented Feb 3, 2016

Uh oh!

SparkQA commented Feb 3, 2016

Uh oh!

marmbrus Feb 3, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile Feb 3, 2016

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Feb 3, 2016

Uh oh!

gatorsmile commented Feb 3, 2016

Uh oh!

marmbrus commented Feb 3, 2016

Uh oh!

gatorsmile commented Feb 3, 2016

Uh oh!

gatorsmile commented Feb 3, 2016

Uh oh!

marmbrus commented Feb 3, 2016

Uh oh!

SparkQA commented Feb 4, 2016

Uh oh!

SparkQA commented Feb 4, 2016

Uh oh!

liancheng Feb 5, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 6, 2016

Uh oh!

gatorsmile commented Feb 6, 2016

Uh oh!

SparkQA commented Feb 7, 2016

Uh oh!

gatorsmile commented Feb 7, 2016

Uh oh!

SparkQA commented Feb 7, 2016

Uh oh!

liancheng commented Feb 8, 2016

Uh oh!

SparkQA commented Feb 8, 2016

Uh oh!

gatorsmile commented Feb 8, 2016

Uh oh!

SparkQA commented Feb 8, 2016

Uh oh!

liancheng commented Feb 11, 2016

Uh oh!

cloud-fan Mar 12, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 12, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile Mar 12, 2016

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants