-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
We are trying to leverage the new Dataset implementation and specifically rely on the schema evolution feature there. However when adding a new field in a later parquet file, the schemas don't seem to be merged and the new field is not available.
Simple example:
import pandas as pd
from pyarrow import parquet as pq
from pyarrow import dataset as ds
import pyarrow as pa
path = "data/sample/"
df1 = pd.DataFrame({"field1": ["a", "b", "c"]})
df2 = pd.DataFrame({"field1": ["d", "e", "f"],
"field2": ["x", "y", "z"]})
df1.to_parquet(path + "df1.parquet", coerce_timestamps=None, version="2.0", index=False)
df2.to_parquet(path + "df2.parquet", coerce_timestamps=None, version="2.0", index=False)
# read via pandas
df = pd.read_parquet(path)
print(df.head())
print(df.info())Output:
field1
0 a
1 b
2 c
3 d
4 e
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 field1 6 non-null object
dtypes: object(1)
memory usage: 176.0+ bytes
None
My expectation was to get the field2 as well based on what I have understood with the new Dataset implementation from ARROW-8039.
When using the Dataset API with a schema created from the second dataframe I'm able to read the field2:
# write metadata
schema = pa.Schema.from_pandas(df2, preserve_index=False)
pq.write_metadata(schema, path + "_common_metadata", version="2.0", coerce_timestamps=None)
# read with new dataset and schema
schema = pq.read_schema(path + "_common_metadata")
df = ds.dataset(path, schema=schema, format="parquet").to_table().to_pandas()
print(df.head())
print(df.info())Output:
field1 field2
0 a None
1 b None
2 c None
3 d x
4 e y
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 field1 6 non-null object
1 field2 3 non-null object
dtypes: object(2)
memory usage: 224.0+ bytes
None
This works, however I want to avoid to write a _common_metadata file if possible. Is there a way to get the schema merge without passing an explicit schema? Or is this this yet to be implemented?
Environment: pandas==1.1.1
pyarrow==1.0.0
Reporter: Daniel Figus
Related issues:
- [C++][Dataset] Schema evolution in Dataset scanning (is related to)
Note: This issue was originally created as ARROW-9942. Please see the migration documentation for further details.