[Python] Long-term fate of pyarrow.parquet.ParquetDataset

The business logic of the python implementation of reading partitioned parquet datasets in `pyarrow.parquet.ParquetDataset` has been ported to C++ (ARROW-3764), and has also been optionally enabled in ParquetDataset(..) by using `use_legacy_dataset=False` (ARROW-8039).

But the question still is: what do we do with this class long term? 

So for users who now do:

```Java

dataset = pq.ParquetDataset(...)
dataset.metadata
table = dataset.read()
```

what should they do in the future?  
Do we keep a class like this (but backed by the pyarrow.dataset implementation), or do we deprecate the class entirely, pointing users to `dataset = ds.dataset(..., format="parquet")` ?

In any case, we should strive to entirely delete the current custom python implementation, but we could keep a `ParquetDataset` class that wraps or inherits `pyarrow.dataset.FileSystemDataset` and adds some parquet specifics to it (eg access to the parquet schema, the common metadata, exposing the parquet-specific constructor keywords more easily, ..). 

Features the `ParquetDataset` currently has that are not exactly covered by pyarrow.dataset:

- Partitioning information (the `.partitions` attribute
- Access to common metadata (`.metadata_path`, `.common_metadata_path` and `.metadata` attributes)
- ParquetSchema of the dataset


**Reporter**: [Joris Van den Bossche](https://issues.apache.org/jira/browse/ARROW-9720) / @jorisvandenbossche
#### Related issues:
- [[Python] Deprecate the legacy ParquetDataset custom python-based implementation](https://github.com/apache/arrow/issues/31529) (is a parent of)
- [[Python] Legacy dataset can't roundtrip Int64 with nulls if partitioned](https://github.com/apache/arrow/issues/31175) (relates to)
- [[Python][Documentation] Document migration from ParquetDataset to pyarrow.datasets](https://github.com/apache/arrow/issues/24261) (relates to)
- [[Python] Remove the legacy ParquetDataset custom python-based implementation](https://github.com/apache/arrow/issues/31303) (relates to)

<sub>**Note**: *This issue was originally created as [ARROW-9720](https://issues.apache.org/jira/browse/ARROW-9720). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Long-term fate of pyarrow.parquet.ParquetDataset #25775

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] Long-term fate of pyarrow.parquet.ParquetDataset #25775

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions