Skip to content

Conversation

@gatorsmile
Copy link
Member

Some analysis rules generate aliases or auxiliary attribute references with the same name but different expression IDs. For example, ResolveAggregateFunctions introduces havingCondition and aggOrder, and DistinctAggregationRewriter introduces gid.

This is OK for normal query execution since these attribute references get expression IDs. However, it's troublesome when converting resolved query plans back to SQL query strings since expression IDs are erased.

Here's an example Spark 1.6.0 snippet for illustration:

sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), COUNT(b)").explain(true)

The above code produces the following resolved plan:

== Analyzed Logical Plan ==
_c0: bigint
Project [_c0#101L]
+- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
   +- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
      +- Subquery t
         +- Project [id#46L AS a#47L,id#46L AS b#48L]
            +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at <console>:26

Here we can see that both aggregate expressions in ORDER BY are extracted into an Aggregate operator, and both of them are named aggOrder with different expression IDs.

The solution is to automatically add the expression IDs into the attribute name for the Alias and AttributeReferences that are generated by Analyzer in SQL Generation.

In this PR, it also resolves another issue. Users could use the same name as the internally generated names. The duplicate names should not cause name ambiguity. When resolving the column, Catalyst should not pick the column that is internally generated.

Could you review the solution? @marmbrus @liancheng

I did not set the newly added flag for all the alias and attribute reference generated by Analyzers. Please let me know if I should do it? Thank you!

@SparkQA
Copy link

SparkQA commented Feb 3, 2016

Test build #50662 has finished for PR 11050 at commit 82bb46f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Alias(child: Expression, name: String, isGenerated: Boolean = false)(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put isGenerated into the second list with the other metadata so that we don't have to rewrite every match statement in the code base?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, I see. Will do it. : )

@marmbrus
Copy link
Contributor

marmbrus commented Feb 3, 2016

We should also modify the resolve function to filter out attributes where isGenerated = true. There have been bugs in the past where this creates ambiguity and we can finally fix that.

@gatorsmile
Copy link
Member Author

@marmbrus Do you still remember the JIRA number or the case? If so, I can add it into the test cases and verify if the fix works well.

In this PR, isGenerated is added to AttributeReference and Alias. AttributeReference is always resolved. Alias is not resolved unless the child has been resolved, and thus, I think we are unable to skip resolving its child even if the Alias is added by Analyzer. Please correct me if my understanding is wrong. Thank you!

@marmbrus
Copy link
Contributor

marmbrus commented Feb 3, 2016

I'd have to search for it as it was a while ago, you should be able to trigger it by forcing the analyzer to generate a column while referencing a column with the same name.

I wasn't referring to the resolved property of an Expression, but instead the resolve function on a LogicalPlan. This should probably never return attributes that are generated.

@gatorsmile
Copy link
Member Author

@marmbrus : ) Now I understand it. Will do it. Thank you!

@gatorsmile gatorsmile changed the title [SPARK-12725] [SQL] Resolving Name Conflicts in SQL Generation by Adding a flag isGenerated to Alias and AttributeReference [SPARK-12725] [SQL] Resolving Name Conflicts in SQL Generation and Name Ambiguity Caused by Internally Generated Expressions Feb 3, 2016
@gatorsmile
Copy link
Member Author

I guess the PR you mentioned is #8231

I think we should disallow users to accidentally use the internally generated names. Thus, adding the following test case to verify if the fix works well. In this case, the table does not have a column named havingCondition. Thus, Spark SQL issues an error message to say it is not resolvable.

errorTest(
   "unresolved attributes with a generated name",
   testRelation2.groupBy('a)(max('b))
     .where(sum('b) > 0)
     .orderBy('havingCondition.asc),
   "cannot resolve" :: "havingCondition" :: Nil)

@marmbrus
Copy link
Contributor

marmbrus commented Feb 3, 2016

Yeah, that test looks great.

@SparkQA
Copy link

SparkQA commented Feb 4, 2016

Test build #50695 has finished for PR 11050 at commit 22cf88a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2016

Test build #50704 has finished for PR 11050 at commit cf89603.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

UnresolvedAlias(child = f.copy(children = newChildren)) :: Nil
case Alias(f @ UnresolvedFunction(_, args, _), name) if containsStar(args) =>
case a @ Alias(f @ UnresolvedFunction(_, args, _), name)
if containsStar(args) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This change can be reverted now.

@SparkQA
Copy link

SparkQA commented Feb 6, 2016

Test build #50875 has finished for PR 11050 at commit 4e00349.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 7, 2016

Test build #50882 has finished for PR 11050 at commit 4e00349.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 7, 2016

Test build #50893 has finished for PR 11050 at commit 4e00349.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #50908 has finished for PR 11050 at commit 4e00349.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #50926 has finished for PR 11050 at commit 4e00349.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

LGTM, merging to master. Thanks!

@asfgit asfgit closed this in 663cc40 Feb 11, 2016
val qualifiersString =
if (qualifiers.isEmpty) "" else qualifiers.map("`" + _ + "`").mkString("", ".", ".")
s"${child.sql} AS $qualifiersString`$name`"
val aliasName = if (isGenerated) s"$name#${exprId.id}" else s"$name"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change the sql according to isGenerate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. This is to workaround an issue in SQLBuilder

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. You can get more background from the discussion in the JIRA: https://issues.apache.org/jira/browse/SPARK-12725

@gatorsmile gatorsmile deleted the namingConflicts branch March 12, 2016 03:13
asfgit pushed a commit that referenced this pull request Jun 23, 2017
…rom Alias and AttributeReference

## What changes were proposed in this pull request?
`isTableSample` and `isGenerated ` were introduced for SQL Generation respectively by #11148 and #11050

Since SQL Generation is removed, we do not need to keep `isTableSample`.

## How was this patch tested?
The existing test cases

Author: Xiao Li <gatorsmile@gmail.com>

Closes #18379 from gatorsmile/CleanSample.
robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 29, 2017
…rom Alias and AttributeReference

## What changes were proposed in this pull request?
`isTableSample` and `isGenerated ` were introduced for SQL Generation respectively by apache#11148 and apache#11050

Since SQL Generation is removed, we do not need to keep `isTableSample`.

## How was this patch tested?
The existing test cases

Author: Xiao Li <gatorsmile@gmail.com>

Closes apache#18379 from gatorsmile/CleanSample.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants