[Python] More graceful reading of empty String columns in ParquetDataset

When currently saving a `ParquetDataset` from Pandas, we don't get consistent schemas, even if the source was a single DataFrame. This is due to the fact that in some partitions object columns like string can become empty. Then the resulting Arrow schema will differ. In the central metadata, we will store this column as `pa.string` whereas in the partition file with the empty columns, this columns will be stored as `pa.null`.

The two schemas are still a valid match in terms of schema evolution and we should respect that in https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 Instead of doing a `pa.Schema.equals` in https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 we should introduce a new method `pa.Schema.can_evolve_to` that is more graceful and returns `True` if a dataset piece has a null column where the main metadata states a nullable column of any type.

**Reporter**: [Uwe Korn](https://issues.apache.org/jira/browse/ARROW-2659) / @xhochy
#### Related issues:
- [[Python][Parquet][C++] Null values in a single partition of Parquet dataset, results in invalid schema on read](https://github.com/apache/arrow/issues/19233) (is related to)
- [[Python][C++][Parquet] Support reading Parquet files having a permutation of column order](https://github.com/apache/arrow/issues/18353) (is related to)
- [[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim](https://github.com/apache/arrow/issues/17077) (depends upon)
- [[C++][Dataset] Support null -> other type promotion in Dataset scanning](https://github.com/apache/arrow/issues/25255) (depends upon)
#### Original Issue Attachments:
- [read_parquet_dataset.error.read_table.novalidation.txt](https://issues.apache.org/jira/secure/attachment/12926110/read_parquet_dataset.error.read_table.novalidation.txt)
- [read_parquet_dataset.error.read_table.txt](https://issues.apache.org/jira/secure/attachment/12926111/read_parquet_dataset.error.read_table.txt)

<sub>**Note**: *This issue was originally created as [ARROW-2659](https://issues.apache.org/jira/browse/ARROW-2659). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] More graceful reading of empty String columns in ParquetDataset #19053

Related issues:

Original Issue Attachments:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] More graceful reading of empty String columns in ParquetDataset #19053

Description

Related issues:

Original Issue Attachments:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions