Skip to content

[Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False #31636

@asfimport

Description

@asfimport

The pq.write_to_dataset (legacy implementation) supports the row_group_size/chunk_size keyword to specify the row group size of the written parquet files.

Now that we made use_legacy_dataset=False the default, this keyword doesn't work anymore.

This is because dataset.write_dataset(..) doesn't support the parquet row_group_size keyword. The ParquetFileWriteOptions class doesn't support this keyword.

On the parquet side, this is also the only keyword that is not passed to the ParquetWriter init (and thus to parquet's WriterProperties or ArrowWriterProperties), but to the actual write_table call. In C++ this can be seen at

static ::arrow::Status Open(const ::arrow::Schema& schema, MemoryPool* pool,
std::shared_ptr<::arrow::io::OutputStream> sink,
std::shared_ptr<WriterProperties> properties,
std::shared_ptr<ArrowWriterProperties> arrow_properties,
std::unique_ptr<FileWriter>* writer);
virtual std::shared_ptr<::arrow::Schema> schema() const = 0;
/// \brief Write a Table to Parquet.
virtual ::arrow::Status WriteTable(const ::arrow::Table& table, int64_t chunk_size) = 0;

See discussion: #12811 (comment)

Reporter: Alenka Frim / @AlenkaF
Assignee: Alenka Frim / @AlenkaF

PRs and other links:

Note: This issue was originally created as ARROW-16240. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions