[Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns])

When reading Parquet data with timestamps stored as INT96 pyarrow will assume that the timestamp type should be nanoseconds and when converted into an arrow table will cause overflow if the parquet col has stored values that are out of bounds for nanoseconds. 


```python

# Round Trip Example
import datetime
import pandas as pd
import pyarrow as pa
from pyarrow import parquet as pq

df = pd.DataFrame({"a": [datetime.datetime(1000,1,1), datetime.datetime(2000,1,1), datetime.datetime(3000,1,1)]})
a_df = pa.Table.from_pandas(df)
a_df.schema # a: timestamp[us] 

pq.write_table(a_df, "test_round_trip.parquet", use_deprecated_int96_timestamps=True, version="1.0")
pfile = pq.ParquetFile("test_round_trip.parquet")
pfile.schema_arrow # a: timestamp[ns]
pq.read_table("test_round_trip.parquet").to_pandas()
# # Results in values:
# 2169-02-08 23:09:07.419103232
# 2000-01-01 00:00:00
# 1830-11-23 00:50:52.580896768
```


The above example is just trying to demonstrate this bug by getting pyarrow to write out the parquet format to a similar state of original file (where this bug was discovered). This bug was originally found when trying to read in Parquet outputs from Amazon Athena with pyarrow (where we can't control the output format of the parquet file format) [Context](https://github.com/awslabs/aws-data-wrangler/issues/592).

I found some existing issues that might also be related:

- [ARROW-10444](https://issues.apache.org/jira/browse/ARROW-10444) 
- [ARROW-6779](https://issues.apache.org/jira/browse/ARROW-6779) (This shows a similar response although testing this on pyarrow v3 will raise an out of bounds error)


**Environment**: macos mojave 10.14.6
Python 3.8.3
pyarrow 3.0.0
pandas 1.2.3
**Reporter**: [Karik Isichei](https://issues.apache.org/jira/browse/ARROW-12096) / @isichei
**Assignee**: [Karik Isichei](https://issues.apache.org/jira/browse/ARROW-12096) / @isichei
#### Related issues:
- [[Python] Expose Parquet ArrowReaderProperties::coerce_int96_timestamp_unit_](https://github.com/apache/arrow/issues/28792) (is required by)
- [[R] Expose Parquet ArrowReaderProperties::coerce_int96_timestamp_unit_](https://github.com/apache/arrow/issues/28793) (is required by)
#### PRs and other links:
- [GitHub Pull Request #10461](https://github.com/apache/arrow/pull/10461)

<sub>**Note**: *This issue was originally created as [ARROW-12096](https://issues.apache.org/jira/browse/ARROW-12096). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns]) #27920

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns]) #27920

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions