Skip to content

[Python] Difference in timezone-awareness of result when calling to_pandas between unnested and nested timestamp arrays #41162

@amoeba

Description

@amoeba

Describe the bug, including details regarding any error messages, version, and platform.

When you call .to_pandas() on a timestamp array, you get timezone-aware values. When you call .to_pandas() on a nested timestamp array, you get timezone-naive values. For example:

import pandas as pd
import pyarrow as pa

ts = pandas.Timestamp('2024-01-01 12:00:00+0000', tz = 'Europe/Paris')

# unnested, we get a timezone-aware result
pa.Array.from_pandas([myts]).to_pandas()[0]
# => Timestamp('2024-01-01 13:00:00+0100', tz='Europe/Paris')

# nested, we get a timezone-naive result
pa.Array.from_pandas([[myts]]).to_pandas()[0][0]
# => numpy.datetime64('2024-01-01T12:00:00.000000')

While the values appear correct (which is good), the unnested case is timezone-aware while the nested case is timezone-naive. This difference may be surprising to users and would require extra steps on their part to re-construct a timezone-aware result if that was their goal.

Another difference I notice in the above output is that the unnested version is returned as a pandas Timestamp while the nested version is returned as numpy datetime64. It's my understanding that numpy's datetimes aren't timezone-aware (ref) so it seems possible PyArrow is inheriting that behavior. The pandas docs point to the arrays.DatetimeArray extensiontype which I don't think PyArrow is making use of.

Is it possible to have a consistent result with respect to timezone-awareness in this case?

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions