Skip to content

Conversation

@beliefer
Copy link
Contributor

@beliefer beliefer commented May 12, 2021

What changes were proposed in this pull request?

Currently, Spark eagerly executes commands on the caller side of QueryExecution, which is a bit hacky as QueryExecution is not aware of it and leads to confusion.

For example, if you run sql("show tables").collect(), you will see two queries with identical query plans in the web UI.
image
image
image

The first query is triggered at Dataset.logicalPlan, which eagerly executes the command.
The second query is triggered at Dataset.collect, which is the normal query execution.

From the web UI, it's hard to tell that these two queries are caused by eager command execution.

This PR proposes to move the eager command execution to QueryExecution, and turn the command plan to CommandResult to indicate that command has been executed already. Now sql("show tables").collect() still triggers two queries, but the quey plans are not identical. The second query becomes:
image

In addition to the UI improvements, this PR also has other benefits:

  1. Simplifies code as caller side no need to worry about eager command execution. QueryExecution takes care of it.
  2. It helps [SPARK-35283][SQL] Support query some DDL with CTES #32442 , where there can be more plan nodes above commands, and we need to replace commands with something like local relation that produces unsafe rows.

Why are the changes needed?

Explained above.

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing tests

@github-actions github-actions bot added the SQL label May 12, 2021
@SparkQA
Copy link

SparkQA commented May 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42941/

@SparkQA
Copy link

SparkQA commented May 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42941/

@SparkQA
Copy link

SparkQA commented May 12, 2021

Test build #138416 has finished for PR 32513 at commit 40cea6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

ping @cloud-fan @wangyum @maropu @viirya

// We can't clone `logical` here, which will reset the `_analyzed` flag.
sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker)
sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker) match {
case c: Command => c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we leave out the root node Command? Can we remove the command execution logic in Dataset.logicalPlan and execute all the commands here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to maintain the Command here and ensure its behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's unify the behavior eagerly execute the commands.

@beliefer beliefer changed the title [SPARK-35378][SQL] Convert LeafRunnableCommand to LocalRelation when query with CTE [SPARK-35378][SQL] Eagerly execute LeafRunnableCommand so that query command with CTE May 14, 2021
@SparkQA
Copy link

SparkQA commented May 14, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43064/

@SparkQA
Copy link

SparkQA commented May 14, 2021

Test build #138543 has finished for PR 32513 at commit a82ed76.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker)
sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker) transform {
// SPARK-35378: Eagerly execute LeafRunnableCommand so that query command with CTE
case r: LeafRunnableCommand =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should run all Commands, not just LeafRunnableCommand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@beliefer beliefer changed the title [SPARK-35378][SQL] Eagerly execute LeafRunnableCommand so that query command with CTE [SPARK-35378][SQL] Eagerly execute Command so that query command with CTE May 14, 2021
@SparkQA
Copy link

SparkQA commented May 17, 2021

Test build #138624 has finished for PR 32513 at commit 4c9b3cf.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class AgeExample(birthday: Expression, child: Expression) extends RuntimeReplaceable
  • class SessionExtensionsWithLoader extends SparkSessionExtensionsProvider
  • class SessionExtensionsWithoutLoader extends SparkSessionExtensionsProvider
  • case class AvroWrite(
  • case class KafkaWrite(
  • case class TryEval(child: Expression) extends UnaryExpression with NullIntolerant
  • case class TryAdd(left: Expression, right: Expression, child: Expression)
  • case class TryDivide(left: Expression, right: Expression, child: Expression)
  • new RuntimeException(s\"Failed to convert value $value (class of $cls) \" +
  • trait FileWrite extends Write
  • case class CSVWrite(
  • case class JsonWrite(
  • case class OrcWrite(
  • case class ParquetWrite(
  • case class TextWrite(
  • class ConsoleWrite(schema: StructType, options: CaseInsensitiveStringMap)
  • class ForeachWrite[T](
  • class MemoryWrite(sink: MemorySink, schema: StructType, needTruncate: Boolean) extends Write

@SparkQA
Copy link

SparkQA commented May 17, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43144/

@SparkQA
Copy link

SparkQA commented May 17, 2021

Test build #138626 has finished for PR 32513 at commit e3b8454.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 17, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43146/

@SparkQA
Copy link

SparkQA commented May 17, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43148/

case c: Command =>
val subQueryExecution = sparkSession.sessionState.executePlan(c)
LocalRelation(c.output,
subQueryExecution.executedPlan.executeCollect(), false, Some(subQueryExecution.id))
Copy link
Contributor

@cloud-fan cloud-fan May 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we create new query plan nodes: CommandResult and CommandResultExec

case class CommandResult(qe: QueryExecution) extends LeadNode {
  def innerChildren = Seq(qe.analyzedPlan)
  def output = qe.logicalPlan.output
}
case class CommandResultExec(qe: QueryExecution) extends LeafNodeExec {
  def innerChildren = Seq(qe.executedPlan)
  def output = qe.logicalPlan.output
}

then both UI and EXPLAIN can have pretty output.

@SparkQA
Copy link

SparkQA commented May 17, 2021

Test build #138628 has finished for PR 32513 at commit 6516cc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43985/

@SparkQA
Copy link

SparkQA commented Jun 8, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43985/

@SparkQA
Copy link

SparkQA commented Jun 8, 2021

Test build #139462 has finished for PR 32513 at commit 83d2710.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* callback functions.
*/
private def runCommand(name: String)(command: LogicalPlan): Unit = {
private def runCommand()(command: LogicalPlan): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall this just be private def runCommand(command: LogicalPlan)?

@SparkQA
Copy link

SparkQA commented Jun 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44018/

@SparkQA
Copy link

SparkQA commented Jun 8, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44018/

@SparkQA
Copy link

SparkQA commented Jun 8, 2021

Test build #139494 has finished for PR 32513 at commit d15e166.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 8013f98 Jun 9, 2021
@beliefer
Copy link
Contributor Author

beliefer commented Jun 9, 2021

@cloud-fan Thanks for your hard work! @viirya @yaooqinn Thanks for review too.

* limitations under the License.
*/

package org.apache.spark.sql.expressions
Copy link
Contributor

@cloud-fan cloud-fan Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a public package which makes CommandResult a public API. This is unexpected. We should move this class to org.apache.spark.sql.catalyst.plans.logical. @beliefer can you help to make this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

cloud-fan added a commit that referenced this pull request Jun 17, 2021
…ataFrameWriterV2

### What changes were proposed in this pull request?

This is a followup of #32513

It's hard to keep the command execution name for `DataFrameWriter`, as the command logical plan is a bit messy (DS v1, file source and hive and different command logical plans) and sometimes it's hard to distinguish "insert" and "save".

However, `DataFrameWriterV2` only produce v2 commands which are pretty clean. It's easy to keep the command execution name for them.

### Why are the changes needed?

less breaking changes.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #32919 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
dongjoon-hyun pushed a commit that referenced this pull request Jun 17, 2021
…ical

### What changes were proposed in this pull request?
#32513 added the case class `CommandResult` in package `org.apache.spark.sql.expression`. It is not suitable, so this PR move `CommandResult` from `org.apache.spark.sql.expression` to `org.apache.spark.sql.catalyst.plans.logical`.

### Why are the changes needed?
Make `CommandResult` in suitable package.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
No need.

Closes #32942 from beliefer/SPARK-35378-followup.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
cloud-fan pushed a commit that referenced this pull request Jun 22, 2021
### What changes were proposed in this pull request?
#32513 added the case class `CommandResult` so as we can eagerly execute command locally. But we forgot to update
`isLocal` of `Dataset`.

### Why are the changes needed?
`Dataset.isLocal` should consider `CommandResult`.

### Does this PR introduce _any_ user-facing change?
Yes. If the SQL plan is `CommandResult`, `Dataset.isLocal` must return true.

### How was this patch tested?
No test.

Closes #32963 from beliefer/SPARK-35378-followup2.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request May 23, 2022
…ultExec.executeCollect()

### What changes were proposed in this pull request?

This PR is a follow-up for #32513 and fixes an issue introduced by that patch.

CommandResultExec is supposed to return `UnsafeRow` records in all of the `executeXYZ` methods but `executeCollect` was left out which causes issues like this one:
```
Error in SQL statement: ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow
cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow
```

We need to return `unsafeRows` instead of `rows` in `executeCollect` similar to other methods in the class.

### Why are the changes needed?

Fixes a bug in CommandResultExec.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added a unit test to check the return type of all commands.

Closes #36632 from sadikovi/fix-command-exec.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request May 23, 2022
…ultExec.executeCollect()

### What changes were proposed in this pull request?

This PR is a follow-up for #32513 and fixes an issue introduced by that patch.

CommandResultExec is supposed to return `UnsafeRow` records in all of the `executeXYZ` methods but `executeCollect` was left out which causes issues like this one:
```
Error in SQL statement: ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow
cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow
```

We need to return `unsafeRows` instead of `rows` in `executeCollect` similar to other methods in the class.

### Why are the changes needed?

Fixes a bug in CommandResultExec.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added a unit test to check the return type of all commands.

Closes #36632 from sadikovi/fix-command-exec.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a0decfc)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request May 23, 2022
…ultExec.executeCollect()

### What changes were proposed in this pull request?

This PR is a follow-up for #32513 and fixes an issue introduced by that patch.

CommandResultExec is supposed to return `UnsafeRow` records in all of the `executeXYZ` methods but `executeCollect` was left out which causes issues like this one:
```
Error in SQL statement: ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow
cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow
```

We need to return `unsafeRows` instead of `rows` in `executeCollect` similar to other methods in the class.

### Why are the changes needed?

Fixes a bug in CommandResultExec.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added a unit test to check the return type of all commands.

Closes #36632 from sadikovi/fix-command-exec.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a0decfc)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
…ultExec.executeCollect()

### What changes were proposed in this pull request?

This PR is a follow-up for apache#32513 and fixes an issue introduced by that patch.

CommandResultExec is supposed to return `UnsafeRow` records in all of the `executeXYZ` methods but `executeCollect` was left out which causes issues like this one:
```
Error in SQL statement: ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow
cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow
```

We need to return `unsafeRows` instead of `rows` in `executeCollect` similar to other methods in the class.

### Why are the changes needed?

Fixes a bug in CommandResultExec.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added a unit test to check the return type of all commands.

Closes apache#36632 from sadikovi/fix-command-exec.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a0decfc)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
…ultExec.executeCollect()

### What changes were proposed in this pull request?

This PR is a follow-up for apache#32513 and fixes an issue introduced by that patch.

CommandResultExec is supposed to return `UnsafeRow` records in all of the `executeXYZ` methods but `executeCollect` was left out which causes issues like this one:
```
Error in SQL statement: ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow
cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow
```

We need to return `unsafeRows` instead of `rows` in `executeCollect` similar to other methods in the class.

### Why are the changes needed?

Fixes a bug in CommandResultExec.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added a unit test to check the return type of all commands.

Closes apache#36632 from sadikovi/fix-command-exec.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a0decfc)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants