[SPARK-6376][SQL] Avoid eliminating subqueries until optimization #5160

marmbrus · 2015-03-24T08:00:54Z

Previously it was okay to throw away subqueries after analysis, as we would never try to use that tree for resolution again. However, with eager analysis in DataFrames this can cause errors for queries such as:

val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
df.as('x).join(df.as('y), $"x.str" === $"y.str").groupBy("x.str").count()

As a result, in this PR we defer the elimination of subqueries until the optimization phase.

rxin · 2015-03-24T08:10:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala

nitpick here - can we put an explicit type?

SparkQA · 2015-03-24T08:44:04Z

Test build #29066 has finished for PR 5160 at commit 81cd597.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-03-24T09:06:42Z

Test build #29071 has finished for PR 5160 at commit 9137e03.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TaskCommitDenied(jobID: Int, partitionID: Int, attemptID: Int) extends TaskFailedReason
- class ExecutorSource(threadPool: ThreadPoolExecutor, executorId: String) extends Source
- class MLPairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)]) extends Serializable
- class NaiveBayesModel(Saveable, Loader):
- class SqlParser extends AbstractSparkSQLParser with DataTypeParser
- case class CombineSum(child: Expression) extends AggregateExpression
- case class CombineSumFunction(expr: Expression, base: AggregateExpression)
- protected[sql] class DataTypeException(message: String) extends Exception(message)

yhuai · 2015-03-24T19:47:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

Add a comment at here to let others know that the first step in Optimizer is to remove SubQueries (which are helper wrappers for query analysis)?

SparkQA · 2015-03-24T20:48:15Z

Test build #29100 has finished for PR 5160 at commit 27d25bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-03-24T21:07:59Z

LGTM

Previously it was okay to throw away subqueries after analysis, as we would never try to use that tree for resolution again. However, with eager analysis in `DataFrame`s this can cause errors for queries such as: ```scala val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str") df.as('x).join(df.as('y), $"x.str" === $"y.str").groupBy("x.str").count() ``` As a result, in this PR we defer the elimination of subqueries until the optimization phase. Author: Michael Armbrust <michael@databricks.com> Closes #5160 from marmbrus/subqueriesInDfs and squashes the following commits: a9bb262 [Michael Armbrust] Update Optimizer.scala 27d25bf [Michael Armbrust] fix hive tests 9137e03 [Michael Armbrust] add type 81cd597 [Michael Armbrust] Avoid eliminating subqueries until optimization (cherry picked from commit cbeaf9e) Signed-off-by: Michael Armbrust <michael@databricks.com>

SparkQA · 2015-03-24T21:54:15Z

Test build #29104 has finished for PR 5160 at commit a9bb262.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Avoid eliminating subqueries until optimization

81cd597

rxin reviewed Mar 24, 2015
View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala Outdated

Copy link

Contributor

rxin Mar 24, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick here - can we put an explicit type?

add type

9137e03

fix hive tests

27d25bf

yhuai reviewed Mar 24, 2015
View reviewed changes

Update Optimizer.scala

a9bb262

asfgit closed this in cbeaf9e Mar 24, 2015

marmbrus deleted the subqueriesInDfs branch August 3, 2015 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6376][SQL] Avoid eliminating subqueries until optimization #5160

[SPARK-6376][SQL] Avoid eliminating subqueries until optimization #5160

Uh oh!

marmbrus commented Mar 24, 2015

Uh oh!

rxin Mar 24, 2015

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

yhuai Mar 24, 2015

Uh oh!

marmbrus Mar 24, 2015

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

yhuai commented Mar 24, 2015

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-6376][SQL] Avoid eliminating subqueries until optimization #5160

[SPARK-6376][SQL] Avoid eliminating subqueries until optimization #5160

Uh oh!

Conversation

marmbrus commented Mar 24, 2015

Uh oh!

rxin Mar 24, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

yhuai Mar 24, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus Mar 24, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

yhuai commented Mar 24, 2015

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants