Skip to content

[Python] pyarrow.dataset.dataset does not accept RecordBatchReader as source #38012

@sugibuchi

Description

@sugibuchi

Describe the bug, including details regarding any error messages, version, and platform.

The documentation of pyarrow.dataset.dataset says this function accepts RecordBatchReader as source.

(List of) batches or tables, iterable of batches, or RecordBatchReader:
Create an InMemoryDataset. If an iterable or empty list is given, a schema must also be given. If an iterable or RecordBatchReader is given, the resulting dataset can only be scanned once; further attempts will raise an error.
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow-dataset-dataset

However, pyarrow.dataset.dataset throws TypeError when we call this function with RecordBatchReader as source.

Environment

  • OS: Ubuntu 22.04
  • Python: Python 3.9.18
  • PyArrow: 13.0.0

POC

import pyarrow as pa
import pyarrow.dataset as ds

table = pa.Table.from_pydict({
    "col_1": list(range(0, 10000)),
    "col_2": [f"v{v}" for v in range(0, 10000)]
})

batches = t.to_batches(max_chunksize=100)

ds.dataset(batches) # -> <pyarrow._dataset.InMemoryDataset at ...>

batch_reader = pa.RecordBatchReader.from_batches(table.schema, batches)
ds.dataset(batch_reader) # -> Fail!

The last line fails with the following error.

File /opt/conda/lib/python3.9/site-packages/pyarrow/dataset.py:793, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    791     return _in_memory_dataset(source, **kwargs)
    792 else:
--> 793     raise TypeError(
    794         'Expected a path-like, list of path-likes or a list of Datasets '
    795         'instead of the given type: {}'.format(type(source).__name__)
    796     )

TypeError: Expected a path-like, list of path-likes or a list of Datasets instead of the given type: RecordBatchReader

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions