[Python][Parquet][C++] Null values in a single partition of Parquet dataset, results in invalid schema on read

```python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

from datetime import datetime, timedelta


def generate_data(event_type, event_id, offset=0):
    """Generate data."""
    now = datetime.utcnow() + timedelta(seconds=offset)
    obj = {
        'event_type': event_type,
        'event_id': event_id,
        'event_date': now.date(),
        'foo': None,
        'bar': u'hello',
    }
    if event_type == 2:
        obj['foo'] = 1
        obj['bar'] = u'world'
    if event_type == 3:
        obj['different'] = u'data'
        obj['bar'] = u'event type 3'
    else:
        obj['different'] = None
    return obj


data = [
    generate_data(1, 1, 1),
    generate_data(1, 1, 3600 * 72),
    generate_data(2, 1, 1),
    generate_data(2, 1, 3600 * 72),
    generate_data(3, 1, 1),
    generate_data(3, 1, 3600 * 72),
]

df = pd.DataFrame.from_records(data, index='event_id')
table = pa.Table.from_pandas(df)

pq.write_to_dataset(table, root_path='/tmp/events', partition_cols=['event_type', 'event_date'])

dataset = pq.ParquetDataset('/tmp/events')
table = dataset.read()
print(table.num_rows)
```

Expected output:
```python

6
```

Actual:
```python

python example_failure.py
Traceback (most recent call last):
  File "example_failure.py", line 43, in <module>
    dataset = pq.ParquetDataset('/tmp/events')
  File "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py", line 745, in __init__
    self.validate_schemas()
  File "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py", line 775, in validate_schemas
    dataset_schema))
ValueError: Schema in partition[event_type=2, event_date=0] /tmp/events/event_type=3/event_date=2018-07-16 00:00:00/be001bf576674d09825539f20e99ebe5.parquet was different.
bar: string
different: string
foo: double
event_id: int64
metadata
--------
{'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], "columns": [{"metadata": null, "field_name": "bar", "name": "bar", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "field_name": "different", "name": "different", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "field_name": "foo", "name": "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "event_id", "name": "event_id", "numpy_type": "int64", "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}

vs

bar: string
different: null
foo: double
event_id: int64
metadata
--------
{'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], "columns": [{"metadata": null, "field_name": "bar", "name": "bar", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "field_name": "different", "name": "different", "numpy_type": "object", "pandas_type": "empty"}, {"metadata": null, "field_name": "foo", "name": "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "event_id", "name": "event_id", "numpy_type": "int64", "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
```

Apparently what is happening is that pyarrow is interpreting the schema from each of the partitions individually and the partitions for `event_type=3 / event_date=\*`  both have values for the column `different` whereas the other columns do not. The discrepancy causes the `None` values of the other partitions to be labeled as `pandas_type` `empty` instead of `unicode`.

**Reporter**: [Sam Oluwalana](https://issues.apache.org/jira/browse/ARROW-2860)
#### Related issues:
- [[Python] More graceful reading of empty String columns in ParquetDataset](https://github.com/apache/arrow/issues/19053) (relates to)
- [[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim](https://github.com/apache/arrow/issues/17077) (depends upon)
- [[C++][Dataset] Support null -> other type promotion in Dataset scanning](https://github.com/apache/arrow/issues/25255) (depends upon)

<sub>**Note**: *This issue was originally created as [ARROW-2860](https://issues.apache.org/jira/browse/ARROW-2860). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][Parquet][C++] Null values in a single partition of Parquet dataset, results in invalid schema on read #19233

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python][Parquet][C++] Null values in a single partition of Parquet dataset, results in invalid schema on read #19233

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions