Skip to content

[Python][Dataset] Writing dataset from python iterator of record batches #26817

@asfimport

Description

@asfimport

At the moment, from python you can write a dataset with ds.write_dataset for example starting from a list of record batches.

But this currently needs to be an actual list (or gets converted to a list), so an iterator or generated gets fully consumed (potentially bringing the record batches in memory), before starting to write.

We should also be able to use the python iterator itself to back a RecordBatchIterator-like object, that can be consumed while writing the batches.

We already have a arrow::py::PyRecordBatchReader that might be useful here.

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: David Li / @lidavidm

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-10882. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions