[Python] Schema Evolution - Add new Field

We are trying to leverage the new Dataset implementation and specifically rely on the schema evolution feature there. However when adding a new field in a later parquet file, the schemas don't seem to be merged and the new field is not available. 

Simple example:
```python

import pandas as pd
from pyarrow import parquet as pq
from pyarrow import dataset as ds
import pyarrow as pa

path = "data/sample/"

df1 = pd.DataFrame({"field1": ["a", "b", "c"]})
df2 = pd.DataFrame({"field1": ["d", "e", "f"],
                    "field2": ["x", "y", "z"]})

df1.to_parquet(path + "df1.parquet", coerce_timestamps=None, version="2.0", index=False)
df2.to_parquet(path + "df2.parquet", coerce_timestamps=None, version="2.0", index=False)

# read via pandas
df = pd.read_parquet(path)
print(df.head())
print(df.info())
```
Output:
```

  field1
0      a
1      b
2      c
3      d
4      e
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 1 columns):
#   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   field1  6 non-null      object
dtypes: object(1)
memory usage: 176.0+ bytes
None
```
My expectation was to get the field2 as well based on what I have understood with the new Dataset implementation from ARROW-8039.

When using the Dataset API with a schema created from the second dataframe I'm able to read the field2:
```python

# write metadata
schema = pa.Schema.from_pandas(df2, preserve_index=False)
pq.write_metadata(schema, path + "_common_metadata", version="2.0", coerce_timestamps=None)

# read with new dataset and schema
schema = pq.read_schema(path + "_common_metadata")
df = ds.dataset(path, schema=schema, format="parquet").to_table().to_pandas()
print(df.head())
print(df.info())
```
Output:
```

  field1 field2
0      a   None
1      b   None
2      c   None
3      d      x
4      e      y
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
#   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   field1  6 non-null      object
 1   field2  3 non-null      object
dtypes: object(2)
memory usage: 224.0+ bytes
None
```
This works, however I want to avoid to write a `_common_metadata` file if possible. Is there a way to get the schema merge without passing an explicit schema? Or is this this yet to be implemented?

**Environment**: pandas==1.1.1
pyarrow==1.0.0
**Reporter**: [Daniel Figus](https://issues.apache.org/jira/browse/ARROW-9942)
#### Related issues:
- [[C++][Dataset] Schema evolution in Dataset scanning](https://github.com/apache/arrow/issues/26923) (is related to)

<sub>**Note**: *This issue was originally created as [ARROW-9942](https://issues.apache.org/jira/browse/ARROW-9942). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Schema Evolution - Add new Field #25970

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] Schema Evolution - Add new Field #25970

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions