[SPARK-35283][SQL] Support query some DDL with CTES #32442

beliefer · 2021-05-05T12:20:09Z

What changes were proposed in this pull request?

Some command used to display metadata, such as: SHOW TABLES, SHOW TABLE EXTENDED,SHOW TBLPROPERTIES and so no.
If the output rows much than screen height, the output very unfriendly to developers.
So we should have a way to filter the output like the behavior of
WITH s AS (SHOW NAMESPACES) SELECT * FROM s WHERE namespace = 'query_ddl_namespace';

Why are the changes needed?

This PR provides a better way to display DDL when output rows much than screen height.

Does this PR introduce any user-facing change?

'Yes'. A new syntax.

How was this patch tested?

New tests.

SparkQA · 2021-05-05T13:48:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42695/

SparkQA · 2021-05-05T13:48:36Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42695/

SparkQA · 2021-05-05T15:31:34Z

Test build #138174 has finished for PR 32442 at commit aca1fcc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-06T04:59:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42711/

SparkQA · 2021-05-06T04:59:54Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42711/

SparkQA · 2021-05-06T08:41:11Z

Test build #138190 has finished for PR 32442 at commit 09a04b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-06T10:05:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42726/

SparkQA · 2021-05-06T10:08:19Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42727/

SparkQA · 2021-05-06T10:10:18Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42726/

SparkQA · 2021-05-06T11:30:30Z

Test build #138205 has finished for PR 32442 at commit 0235cb4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-07T03:48:38Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42751/

SparkQA · 2021-05-07T07:23:24Z

Test build #138229 has finished for PR 32442 at commit 4f8b782.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2021-05-07T08:04:37Z

ping @cloud-fan

cloud-fan · 2021-05-07T08:38:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

  override def visitNamedQuery(ctx: NamedQueryContext): SubqueryAlias = withOrigin(ctx) {
-    val subQuery: LogicalPlan = plan(ctx.query).optionalMap(ctx.columnAliases)(
+    val logicalPlan = Option(ctx.query).map(plan).orElse(
+      Option(ctx.ddlStatementForQuery).map(visitDdlStatementForQuery)).get


nit: we can call visitDdlQuery and don't need to create visitDdlStatementForQuery

cloud-fan · 2021-05-07T08:38:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+      case columns: ShowColumnsContext => visitShowColumns(columns)
+      case views: ShowViewsContext => visitShowViews(views)
+      case functions: ShowFunctionsContext => visitShowFunctions(functions)
+      case _ => throw QueryParsingErrors.unsupportedDdlStatementForQueryError(ctx)


This can't happen, and is an assert like error

Let's remove it.

cloud-fan · 2021-05-07T08:39:20Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

    : WITH namedQuery (',' namedQuery)*
    ;

+ddlStatementForQuery


how about informationQueries?

cloud-fan · 2021-05-07T08:39:58Z

...erver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala

    // SPARK-28620
    "postgreSQL/float4.sql",
+    // SPARK-35283
+    "cte-ddl.sql",


why it doesn't work in thriftserver?

Because the output schema of hive is different from spark

because Hive doesn't support this syntax?

how about adding a comment here to explain?

how about adding a comment here to explain?

OK

Because the output schema of some DDL in Hive is differing from Spark SQL, we exclude it.
For example, the output schema of SHOW TABLES is (namespace, tableName, isTemporary) in Hive, but (tableName) in Spark SQL.

cloud-fan · 2021-05-07T08:40:26Z

The syntax looks good, cc @yaooqinn @wangyum @viirya @maropu

wangyum · 2021-05-07T08:58:25Z

+1. This syntax looks good.

maropu · 2021-05-07T12:08:40Z

We accept SHOW XXX DDLs in common table exprs but we don't accept them in a FROM clause like SELECT * FROM (SHOW NAMESPACES)? This new feature itself looks good. Btw, is this PR related to #31548? It seems the motivation is the same, but the approaches/the jira numbers are different?

viirya · 2021-05-07T17:48:45Z

CTE looks better if you want to use self join
WITH s AS (SHOW TABLES) SELECT ... FROM s left JOIN s right ON left.xxx = right.xxx

Okay, sounds good. BTW, after #31548, we can also put it in FROM clause, right?

viirya · 2021-05-07T17:57:21Z

Does this PR introduce any user-facing change?
'No'. Just a new syntax.

I think this is a user-facing change.

viirya

Looks okay. We may also need to update sql syntax docs, e.g. docs/sql-ref-syntax-aux-show-tables.md .

maropu · 2021-05-07T23:49:46Z

Yea, a CTE is useful when it is referenced more than once like self-joins. But, in simple cases (e.g., filtering SHOW output rows as the description says), SELECT * FROM (SHOW XXX) WHERE ... seems easier, I think.

beliefer · 2021-05-08T02:28:59Z

We accept SHOW XXX DDLs in common table exprs but we don't accept them in a FROM clause like SELECT * FROM (SHOW NAMESPACES)? This new feature itself looks good. Btw, is this PR related to #31548? It seems the motivation is the same, but the approaches/the jira numbers are different?

Yes, this PR related to #31548. A discussion #31548 (comment)

beliefer · 2021-05-08T02:29:39Z

It would be nice to update the SQL doc, too.

I updated docs/sql-ref-syntax-qry-select-cte.md

beliefer · 2021-05-08T04:05:12Z

Looks okay. We may also need to update sql syntax docs, e.g. docs/sql-ref-syntax-aux-show-tables.md .

I updated docs/sql-ref-syntax-qry-select-cte.md

SparkQA · 2021-05-08T04:54:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42797/

SparkQA · 2021-05-08T04:54:22Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42797/

SparkQA · 2021-05-08T09:00:14Z

Test build #138275 has finished for PR 32442 at commit afed495.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2021-05-10T00:46:24Z

sql/core/src/test/resources/sql-tests/inputs/cte-ddl.sql

+DROP TABLE test_show_table_properties;
+DROP TABLE test_show_tables;
+USE default;
+DROP NAMESPACE query_ddl_namespace;


Please add a newline character.

wangyum · 2021-05-10T00:49:40Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4


+informationQuery
+    : SHOW (DATABASES | NAMESPACES) ((FROM | IN) multipartIdentifier)? (LIKE? pattern=STRING)?              #showNamespaces
+    | SHOW TABLES ((FROM | IN) multipartIdentifier)? (LIKE? pattern=STRING)?                                #showTables


Why do not support SHOW TABLE EXTENDED?

wangyum · 2021-05-10T00:54:42Z

sql/core/src/test/resources/sql-tests/inputs/cte-ddl.sql

+SHOW TABLES;
+WITH s AS (SHOW TABLES) SELECT * FROM s;
+WITH s AS (SHOW TABLES) SELECT * FROM s WHERE tableName = 'test_show_tables';
+WITH s(ns, tn, t) AS (SHOW TABLES) SELECT * FROM s WHERE tn = 'test_show_tables';


Could we add more tests? For example:

WITH s(ns, tn, t) AS (SHOW TABLES) SELECT tn FROM s; WITH s(ns, tn, t) AS (SHOW TABLES) SELECT tn FROM s ORDER BY rn;

beliefer · 2021-05-10T08:44:00Z

sql/core/src/test/resources/sql-tests/results/cte-ddl.sql.out

+struct<>
+-- !query output
+java.lang.ClassCastException
+org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow


I don't know the reason yet. It seems we should fix this issue in another PR?
cc @cloud-fan @wangyum @maropu @viirya

SparkQA · 2021-05-10T10:12:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42851/

SparkQA · 2021-05-10T10:12:35Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42851/

SparkQA · 2021-05-10T10:28:09Z

Test build #138330 has finished for PR 32442 at commit 171c6ce.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2021-05-10T10:43:18Z

retest this please

SparkQA · 2021-05-10T12:06:18Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42854/

SparkQA · 2021-05-10T12:47:29Z

Test build #138333 has finished for PR 32442 at commit 171c6ce.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

… of caller sides ### What changes were proposed in this pull request? Currently, Spark eagerly executes commands on the caller side of `QueryExecution`, which is a bit hacky as `QueryExecution` is not aware of it and leads to confusion. For example, if you run `sql("show tables").collect()`, you will see two queries with identical query plans in the web UI. ![image](https://user-images.githubusercontent.com/3182036/121193729-a72d0480-c8a0-11eb-8b12-379019607ad5.png) ![image](https://user-images.githubusercontent.com/3182036/121193822-bc099800-c8a0-11eb-9d2a-34ab1329e2f7.png) ![image](https://user-images.githubusercontent.com/3182036/121193845-c0ce4c00-c8a0-11eb-96d0-ef604a4dfab0.png) The first query is triggered at `Dataset.logicalPlan`, which eagerly executes the command. The second query is triggered at `Dataset.collect`, which is the normal query execution. From the web UI, it's hard to tell that these two queries are caused by eager command execution. This PR proposes to move the eager command execution to `QueryExecution`, and turn the command plan to `CommandResult` to indicate that command has been executed already. Now `sql("show tables").collect()` still triggers two queries, but the quey plans are not identical. The second query becomes: ![image](https://user-images.githubusercontent.com/3182036/121194850-b3659180-c8a1-11eb-9abf-2980f84f089d.png) In addition to the UI improvements, this PR also has other benefits: 1. Simplifies code as caller side no need to worry about eager command execution. `QueryExecution` takes care of it. 2. It helps #32442 , where there can be more plan nodes above commands, and we need to replace commands with something like local relation that produces unsafe rows. ### Why are the changes needed? Explained above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #32513 from beliefer/SPARK-35378. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

beliefer added 2 commits April 30, 2021 18:33

Support query some DDL with CTES

99575cc

Update code

aca1fcc

github-actions bot added the SQL label May 5, 2021

beliefer added 2 commits May 6, 2021 11:02

Merge branch 'master' into SPARK-35283

36b36bf

Update code

09a04b0

beliefer added 2 commits May 6, 2021 17:09

Optimize code

2d50c1b

Update code

0235cb4

beliefer added 2 commits May 7, 2021 10:19

Update code

3031831

Update code

4f8b782

cloud-fan reviewed May 7, 2021

View reviewed changes

Optimize code

851f9cc

viirya reviewed May 7, 2021

View reviewed changes

Update code

afed495

github-actions bot added the DOCS label May 8, 2021

wangyum reviewed May 10, 2021

View reviewed changes

beliefer added 3 commits May 10, 2021 16:37

Update code

e7e67c0

Update code

6a75a99

Update code

171c6ce

beliefer commented May 10, 2021

View reviewed changes

Optimize code

3312ac0

beliefer closed this May 17, 2021

beliefer mentioned this pull request May 26, 2021

[SPARK-35378][SQL] Eagerly execute commands in QueryExecution instead of caller sides #32513

Closed

[SPARK-35283][SQL] Support query some DDL with CTES #32442

[SPARK-35283][SQL] Support query some DDL with CTES #32442

Uh oh!

Conversation

beliefer commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 6, 2021

Uh oh!

SparkQA commented May 6, 2021

Uh oh!

SparkQA commented May 6, 2021

Uh oh!

SparkQA commented May 6, 2021

Uh oh!

SparkQA commented May 6, 2021

Uh oh!

SparkQA commented May 6, 2021

Uh oh!

SparkQA commented May 6, 2021

Uh oh!

SparkQA commented May 7, 2021

Uh oh!

SparkQA commented May 7, 2021

Uh oh!

beliefer commented May 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer May 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 7, 2021

Uh oh!

wangyum commented May 7, 2021

Uh oh!

maropu commented May 7, 2021

Uh oh!

viirya commented May 7, 2021

Uh oh!

viirya commented May 7, 2021

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

maropu commented May 7, 2021

Uh oh!

beliefer commented May 8, 2021

Uh oh!

beliefer commented May 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beliefer commented May 8, 2021

beliefer commented May 5, 2021 •

edited

Loading

beliefer May 8, 2021 •

edited

Loading

beliefer commented May 8, 2021 •

edited

Loading