Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions datafusion/sql/src/parser.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,30 @@ fn parse_file_type(s: &str) -> Result<String, ParserError> {
}

/// DataFusion extension DDL for `CREATE EXTERNAL TABLE`
///
/// Syntax:
///
/// ```text
/// CREATE EXTERNAL TABLE
/// [ IF NOT EXISTS ]
/// <TABLE_NAME>[ (<column_definition>) ]
/// STORED AS <file_type>
/// [ WITH HEADER ROW ]
/// [ DELIMITER <char> ]
/// [ COMPRESSION TYPE <GZIP | BZIP2 | XZ | ZSTD> ]
/// [ PARTITIONED BY (<column list>) ]
/// [ WITH ORDER (<ordered column list>)
/// [ OPTIONS (<key_value_list>) ]
/// LOCATION <literal>
///
/// <column_definition> := (<column_name> <data_type>, ...)
///
/// <column_list> := (<column_name>, ...)
///
/// <ordered_column_list> := (<column_name> <sort_clause>, ...)
///
/// <key_value_list> := (<literal> <literal, <literal> <literal>, ...)
/// ```
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct CreateExternalTable {
/// Table name
Expand Down
58 changes: 51 additions & 7 deletions docs/source/user-guide/sql/ddl.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,17 +47,50 @@ CREATE SCHEMA cat.emu;

## CREATE EXTERNAL TABLE

Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. It is not necessary
to provide schema information for Parquet files.
`CREATE EXTERNAL TABLE` SQL statement registers a location on a local
file system or remote object store as a named table which can be queried.

The supported syntax is:

```
CREATE EXTERNAL TABLE
[ IF NOT EXISTS ]
<TABLE_NAME>[ (<column_definition>) ]
STORED AS <file_type>
[ WITH HEADER ROW ]
[ DELIMITER <char> ]
[ COMPRESSION TYPE <GZIP | BZIP2 | XZ | ZSTD> ]
[ PARTITIONED BY (<column list>) ]
[ WITH ORDER (<ordered column list>)
[ OPTIONS (<key_value_list>) ]
LOCATION <literal>

<column_definition> := (<column_name> <data_type>, ...)

<column_list> := (<column_name>, ...)

<ordered_column_list> := (<column_name> <sort_clause>, ...)

<key_value_list> := (<literal> <literal, <literal> <literal>, ...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a collection of available options? If someone were to implement a mechanism based on options, how should it be documented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any option is (yet) handled by datafusion core

I believe datafusion-cli handles some: https://arrow.apache.org/datafusion/user-guide/cli.html#registering-s3-data-sources

@r4ntix maybe knows more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone were to implement a mechanism based on options, how should it be documented?

I recommend anything meant to be used by users of datafusion should be explicitly in the CREATE TABLE syntax -- e.g. #6248)

If we do want to do something with options, perhaps it could be documented in https://arrow.apache.org/datafusion/user-guide/sql/ddl.html

Copy link
Contributor

@r4ntix r4ntix May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any option is (yet) handled by datafusion core

I believe datafusion-cli handles some: https://arrow.apache.org/datafusion/user-guide/cli.html#registering-s3-data-sources

@r4ntix maybe knows more

@alamb Yes, not all options are supported in datafusion-core. There is no actual [ OPTIONS (<key_value_list>) ] support in the current datafusion-core.

@metesynnada Do you mean that for all the [ ... ] options, do we need to be more detailed in the documentation? 🤔️

```

`file_type` is one of `CSV`, `PARQUET`, `AVRO` or `JSON`

`LOCATION <literal>` specfies the location to find the data. It can be
a path to a file or directory of partitioned files locally or on an
object store.

Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement such as the following. It is not necessary to
provide schema information for Parquet files.

```sql
CREATE EXTERNAL TABLE taxi
STORED AS PARQUET
LOCATION '/mnt/nyctaxi/tripdata.parquet';
```

CSV data sources can also be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. The schema will be
inferred based on scanning a subset of the file.
CSV data sources can also be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. The schema will be inferred based on
scanning a subset of the file.

```sql
CREATE EXTERNAL TABLE test
Expand Down Expand Up @@ -89,9 +122,20 @@ WITH HEADER ROW
LOCATION '/path/to/aggregate_test_100.csv';
```

When creating an output from a data source that is already ordered by an expression, you can pre-specify the order of
the data using the `WITH ORDER` clause. This applies even if the expression used for sorting is complex,
allowing for greater flexibility.
It is also possible to specify a directory that contains a partitioned
table (multiple files with the same schema)

```sql
CREATE EXTERNAL TABLE test
STORED AS CSV
WITH HEADER ROW
LOCATION '/path/to/directory/of/files';
```

When creating an output from a data source that is already ordered by
an expression, you can pre-specify the order of the data using the
`WITH ORDER` clause. This applies even if the expression used for
sorting is complex, allowing for greater flexibility.

Here's an example of how to use `WITH ORDER` clause.

Expand Down