Skip to content

[Python] Schema Evolution - Add new Field #25970

@asfimport

Description

@asfimport

We are trying to leverage the new Dataset implementation and specifically rely on the schema evolution feature there. However when adding a new field in a later parquet file, the schemas don't seem to be merged and the new field is not available. 

Simple example:

import pandas as pd
from pyarrow import parquet as pq
from pyarrow import dataset as ds
import pyarrow as pa

path = "data/sample/"

df1 = pd.DataFrame({"field1": ["a", "b", "c"]})
df2 = pd.DataFrame({"field1": ["d", "e", "f"],
                    "field2": ["x", "y", "z"]})

df1.to_parquet(path + "df1.parquet", coerce_timestamps=None, version="2.0", index=False)
df2.to_parquet(path + "df2.parquet", coerce_timestamps=None, version="2.0", index=False)

# read via pandas
df = pd.read_parquet(path)
print(df.head())
print(df.info())

Output:


  field1
0      a
1      b
2      c
3      d
4      e
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 1 columns):
#   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   field1  6 non-null      object
dtypes: object(1)
memory usage: 176.0+ bytes
None

My expectation was to get the field2 as well based on what I have understood with the new Dataset implementation from ARROW-8039.

When using the Dataset API with a schema created from the second dataframe I'm able to read the field2:

# write metadata
schema = pa.Schema.from_pandas(df2, preserve_index=False)
pq.write_metadata(schema, path + "_common_metadata", version="2.0", coerce_timestamps=None)

# read with new dataset and schema
schema = pq.read_schema(path + "_common_metadata")
df = ds.dataset(path, schema=schema, format="parquet").to_table().to_pandas()
print(df.head())
print(df.info())

Output:


  field1 field2
0      a   None
1      b   None
2      c   None
3      d      x
4      e      y
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
#   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   field1  6 non-null      object
 1   field2  3 non-null      object
dtypes: object(2)
memory usage: 224.0+ bytes
None

This works, however I want to avoid to write a _common_metadata file if possible. Is there a way to get the schema merge without passing an explicit schema? Or is this this yet to be implemented?

Environment: pandas==1.1.1
pyarrow==1.0.0
Reporter: Daniel Figus

Related issues:

Note: This issue was originally created as ARROW-9942. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions