Skip to content

[Python] More graceful reading of empty String columns in ParquetDataset #19053

@asfimport

Description

@asfimport

When currently saving a ParquetDataset from Pandas, we don't get consistent schemas, even if the source was a single DataFrame. This is due to the fact that in some partitions object columns like string can become empty. Then the resulting Arrow schema will differ. In the central metadata, we will store this column as pa.string whereas in the partition file with the empty columns, this columns will be stored as pa.null.

The two schemas are still a valid match in terms of schema evolution and we should respect that in

def validate_schemas(self):
Instead of doing a pa.Schema.equals in
if not dataset_schema.equals(file_schema, check_metadata=False):
we should introduce a new method pa.Schema.can_evolve_to that is more graceful and returns True if a dataset piece has a null column where the main metadata states a nullable column of any type.

Reporter: Uwe Korn / @xhochy

Related issues:

Original Issue Attachments:

Note: This issue was originally created as ARROW-2659. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions