-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I would like to parquet data from one format to another, for example to see the effects of page pruning -- #4085 or different orderings on compression and other properties
arrow-rs and DataFusion have all the parts we need (reading from files, sorting, writing to files) we just now need to put them together
We do have a very specialized version in the tpch benchmark driver
https://github.com/apache/arrow-datafusion/blob/26e1b20ea3362ea62cb713004a0636b8af6a16d7/benchmarks/src/tpch.rs#L332-L400
Describe the solution you'd like
I would like DataFusion to support duckdb style COPY sql statements
For example:
-- export the table `t` to data.parquet
COPY t TO 'data.parquet' (FORMAT PARQUET);
-- export as parquet, compressed with ZSTD, with a row_group_size of 100000
COPY t TO 'data.parquet' (FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000);
--- export the output of of a query `SELECT * FROM tbl`
COPY (SELECT * FROM tbl ORDER BY time) TO 'data.parquet' (FORMAT PARQUET);Reference:
Describe alternatives you've considered
@metesynnada is working on INSERT INTO style syntax in #5130
Bonus points for CSV support (ideally the code structure will allow support in the long term but not as part of the initial PR)
-- export as CSV with the given options
COPY t TO 'data.csv' (FORMAT CSV, DELIMITER '|', HEADER);Additional context