[Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False

The `pq.write_to_dataset` (legacy implementation) supports the `row_group_size`/`chunk_size` keyword to specify the row group size of the written parquet files.

Now that we made `use_legacy_dataset=False` the default, this keyword doesn't work anymore.

This is because `dataset.write_dataset(..)` doesn't support the parquet `row_group_size` keyword. The `ParquetFileWriteOptions` class doesn't support this keyword. 

On the parquet side, this is also the only keyword that is not passed to the `ParquetWriter` init (and thus to parquet's `WriterProperties` or `ArrowWriterProperties`), but to the actual `write_table` call. In C++ this can be seen at https://github.com/apache/arrow/blob/76d064c729f5e2287bf2a2d5e02d1fb192ae5738/cpp/src/parquet/arrow/writer.h#L62-L71


See discussion: <https://github.com/apache/arrow/pull/12811#discussion_r845304218>

**Reporter**: [Alenka Frim](https://issues.apache.org/jira/browse/ARROW-16240) / @AlenkaF
**Assignee**: [Alenka Frim](https://issues.apache.org/jira/browse/ARROW-16240) / @AlenkaF
#### PRs and other links:
- [GitHub Pull Request #12955](https://github.com/apache/arrow/pull/12955)

<sub>**Note**: *This issue was originally created as [ARROW-16240](https://issues.apache.org/jira/browse/ARROW-16240). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

	static ::arrow::Status Open(const ::arrow::Schema& schema, MemoryPool* pool,
	std::shared_ptr<::arrow::io::OutputStream> sink,
	std::shared_ptr<WriterProperties> properties,
	std::shared_ptr<ArrowWriterProperties> arrow_properties,
	std::unique_ptr<FileWriter>* writer);

	virtual std::shared_ptr<::arrow::Schema> schema() const = 0;

	/// \brief Write a Table to Parquet.
	virtual ::arrow::Status WriteTable(const ::arrow::Table& table, int64_t chunk_size) = 0;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False #31636

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False #31636

Description

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions