-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
The documentation of pyarrow.dataset.dataset says this function accepts RecordBatchReader as source.
(List of) batches or tables, iterable of batches, or RecordBatchReader:
Create an InMemoryDataset. If an iterable or empty list is given, a schema must also be given. If an iterable or RecordBatchReader is given, the resulting dataset can only be scanned once; further attempts will raise an error.
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow-dataset-dataset
However, pyarrow.dataset.dataset throws TypeError when we call this function with RecordBatchReader as source.
Environment
- OS: Ubuntu 22.04
- Python: Python 3.9.18
- PyArrow: 13.0.0
POC
import pyarrow as pa
import pyarrow.dataset as ds
table = pa.Table.from_pydict({
"col_1": list(range(0, 10000)),
"col_2": [f"v{v}" for v in range(0, 10000)]
})
batches = t.to_batches(max_chunksize=100)
ds.dataset(batches) # -> <pyarrow._dataset.InMemoryDataset at ...>
batch_reader = pa.RecordBatchReader.from_batches(table.schema, batches)
ds.dataset(batch_reader) # -> Fail!The last line fails with the following error.
File /opt/conda/lib/python3.9/site-packages/pyarrow/dataset.py:793, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
791 return _in_memory_dataset(source, **kwargs)
792 else:
--> 793 raise TypeError(
794 'Expected a path-like, list of path-likes or a list of Datasets '
795 'instead of the given type: {}'.format(type(source).__name__)
796 )
TypeError: Expected a path-like, list of path-likes or a list of Datasets instead of the given type: RecordBatchReader
Component(s)
Python