-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
The business logic of the python implementation of reading partitioned parquet datasets in pyarrow.parquet.ParquetDataset has been ported to C++ (ARROW-3764), and has also been optionally enabled in ParquetDataset(..) by using use_legacy_dataset=False (ARROW-8039).
But the question still is: what do we do with this class long term?
So for users who now do:
dataset = pq.ParquetDataset(...)
dataset.metadata
table = dataset.read()what should they do in the future?
Do we keep a class like this (but backed by the pyarrow.dataset implementation), or do we deprecate the class entirely, pointing users to dataset = ds.dataset(..., format="parquet") ?
In any case, we should strive to entirely delete the current custom python implementation, but we could keep a ParquetDataset class that wraps or inherits pyarrow.dataset.FileSystemDataset and adds some parquet specifics to it (eg access to the parquet schema, the common metadata, exposing the parquet-specific constructor keywords more easily, ..).
Features the ParquetDataset currently has that are not exactly covered by pyarrow.dataset:
- Partitioning information (the
.partitionsattribute - Access to common metadata (
.metadata_path,.common_metadata_pathand.metadataattributes) - ParquetSchema of the dataset
Reporter: Joris Van den Bossche / @jorisvandenbossche
Related issues:
- [Python] Deprecate the legacy ParquetDataset custom python-based implementation (is a parent of)
- [Python] Legacy dataset can't roundtrip Int64 with nulls if partitioned (relates to)
- [Python][Documentation] Document migration from ParquetDataset to pyarrow.datasets (relates to)
- [Python] Remove the legacy ParquetDataset custom python-based implementation (relates to)
Note: This issue was originally created as ARROW-9720. Please see the migration documentation for further details.