-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35283][SQL] Support query some DDL with CTES #32442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #138174 has finished for PR 32442 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #138190 has finished for PR 32442 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Kubernetes integration test status failure |
|
Test build #138205 has finished for PR 32442 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #138229 has finished for PR 32442 at commit
|
|
ping @cloud-fan |
| override def visitNamedQuery(ctx: NamedQueryContext): SubqueryAlias = withOrigin(ctx) { | ||
| val subQuery: LogicalPlan = plan(ctx.query).optionalMap(ctx.columnAliases)( | ||
| val logicalPlan = Option(ctx.query).map(plan).orElse( | ||
| Option(ctx.ddlStatementForQuery).map(visitDdlStatementForQuery)).get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can call visitDdlQuery and don't need to create visitDdlStatementForQuery
| case columns: ShowColumnsContext => visitShowColumns(columns) | ||
| case views: ShowViewsContext => visitShowViews(views) | ||
| case functions: ShowFunctionsContext => visitShowFunctions(functions) | ||
| case _ => throw QueryParsingErrors.unsupportedDdlStatementForQueryError(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't happen, and is an assert like error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove it.
| : WITH namedQuery (',' namedQuery)* | ||
| ; | ||
|
|
||
| ddlStatementForQuery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about informationQueries?
| // SPARK-28620 | ||
| "postgreSQL/float4.sql", | ||
| // SPARK-35283 | ||
| "cte-ddl.sql", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it doesn't work in thriftserver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the output schema of hive is different from spark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because Hive doesn't support this syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about adding a comment here to explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about adding a comment here to explain?
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the output schema of some DDL in Hive is differing from Spark SQL, we exclude it.
For example, the output schema of SHOW TABLES is (namespace, tableName, isTemporary) in Hive, but (tableName) in Spark SQL.
|
+1. This syntax looks good. |
|
We accept |
Okay, sounds good. BTW, after #31548, we can also put it in FROM clause, right? |
I think this is a user-facing change. |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay. We may also need to update sql syntax docs, e.g. docs/sql-ref-syntax-aux-show-tables.md .
|
Yea, a CTE is useful when it is referenced more than once like self-joins. But, in simple cases (e.g., filtering |
Yes, this PR related to #31548. A discussion #31548 (comment) |
I updated docs/sql-ref-syntax-qry-select-cte.md |
I updated docs/sql-ref-syntax-qry-select-cte.md |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #138275 has finished for PR 32442 at commit
|
| DROP TABLE test_show_table_properties; | ||
| DROP TABLE test_show_tables; | ||
| USE default; | ||
| DROP NAMESPACE query_ddl_namespace; No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a newline character.
|
|
||
| informationQuery | ||
| : SHOW (DATABASES | NAMESPACES) ((FROM | IN) multipartIdentifier)? (LIKE? pattern=STRING)? #showNamespaces | ||
| | SHOW TABLES ((FROM | IN) multipartIdentifier)? (LIKE? pattern=STRING)? #showTables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do not support SHOW TABLE EXTENDED?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| SHOW TABLES; | ||
| WITH s AS (SHOW TABLES) SELECT * FROM s; | ||
| WITH s AS (SHOW TABLES) SELECT * FROM s WHERE tableName = 'test_show_tables'; | ||
| WITH s(ns, tn, t) AS (SHOW TABLES) SELECT * FROM s WHERE tn = 'test_show_tables'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add more tests? For example:
WITH s(ns, tn, t) AS (SHOW TABLES) SELECT tn FROM s;
WITH s(ns, tn, t) AS (SHOW TABLES) SELECT tn FROM s ORDER BY rn;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| struct<> | ||
| -- !query output | ||
| java.lang.ClassCastException | ||
| org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the reason yet. It seems we should fix this issue in another PR?
cc @cloud-fan @wangyum @maropu @viirya
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #138330 has finished for PR 32442 at commit
|
|
retest this please |
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #138333 has finished for PR 32442 at commit
|
… of caller sides
### What changes were proposed in this pull request?
Currently, Spark eagerly executes commands on the caller side of `QueryExecution`, which is a bit hacky as `QueryExecution` is not aware of it and leads to confusion.
For example, if you run `sql("show tables").collect()`, you will see two queries with identical query plans in the web UI.



The first query is triggered at `Dataset.logicalPlan`, which eagerly executes the command.
The second query is triggered at `Dataset.collect`, which is the normal query execution.
From the web UI, it's hard to tell that these two queries are caused by eager command execution.
This PR proposes to move the eager command execution to `QueryExecution`, and turn the command plan to `CommandResult` to indicate that command has been executed already. Now `sql("show tables").collect()` still triggers two queries, but the quey plans are not identical. The second query becomes:

In addition to the UI improvements, this PR also has other benefits:
1. Simplifies code as caller side no need to worry about eager command execution. `QueryExecution` takes care of it.
2. It helps #32442 , where there can be more plan nodes above commands, and we need to replace commands with something like local relation that produces unsafe rows.
### Why are the changes needed?
Explained above.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests
Closes #32513 from beliefer/SPARK-35378.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Some command used to display metadata, such as:
SHOW TABLES,SHOW TABLE EXTENDED,SHOW TBLPROPERTIESand so no.If the output rows much than screen height, the output very unfriendly to developers.
So we should have a way to filter the output like the behavior of
WITH s AS (SHOW NAMESPACES) SELECT * FROM s WHERE namespace = 'query_ddl_namespace';Why are the changes needed?
This PR provides a better way to display DDL when output rows much than screen height.
Does this PR introduce any user-facing change?
'Yes'. A new syntax.
How was this patch tested?
New tests.