Skip to content

Implement COPY ... TO statement  #5654

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I would like to parquet data from one format to another, for example to see the effects of page pruning -- #4085 or different orderings on compression and other properties

arrow-rs and DataFusion have all the parts we need (reading from files, sorting, writing to files) we just now need to put them together

We do have a very specialized version in the tpch benchmark driver
https://github.com/apache/arrow-datafusion/blob/26e1b20ea3362ea62cb713004a0636b8af6a16d7/benchmarks/src/tpch.rs#L332-L400

Describe the solution you'd like
I would like DataFusion to support duckdb style COPY sql statements

For example:

-- export the table `t` to data.parquet
COPY t TO 'data.parquet' (FORMAT PARQUET);
-- export as parquet, compressed with ZSTD, with a row_group_size of 100000
COPY t TO 'data.parquet' (FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000);
--- export the output of of a query `SELECT * FROM tbl`
COPY (SELECT * FROM tbl ORDER BY time) TO 'data.parquet' (FORMAT PARQUET);

Reference:

  1. https://duckdb.org/docs/sql/statements/copy
  2. https://duckdb.org/docs/sql/statements/export

Describe alternatives you've considered
@metesynnada is working on INSERT INTO style syntax in #5130

Bonus points for CSV support (ideally the code structure will allow support in the long term but not as part of the initial PR)

-- export as CSV with the given options
COPY t TO 'data.csv'  (FORMAT CSV, DELIMITER '|', HEADER);

Additional context

#5130 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions