-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the bug, including details regarding any error messages, version, and platform.
When you call .to_pandas() on a timestamp array, you get timezone-aware values. When you call .to_pandas() on a nested timestamp array, you get timezone-naive values. For example:
import pandas as pd
import pyarrow as pa
ts = pandas.Timestamp('2024-01-01 12:00:00+0000', tz = 'Europe/Paris')
# unnested, we get a timezone-aware result
pa.Array.from_pandas([myts]).to_pandas()[0]
# => Timestamp('2024-01-01 13:00:00+0100', tz='Europe/Paris')
# nested, we get a timezone-naive result
pa.Array.from_pandas([[myts]]).to_pandas()[0][0]
# => numpy.datetime64('2024-01-01T12:00:00.000000')While the values appear correct (which is good), the unnested case is timezone-aware while the nested case is timezone-naive. This difference may be surprising to users and would require extra steps on their part to re-construct a timezone-aware result if that was their goal.
Another difference I notice in the above output is that the unnested version is returned as a pandas Timestamp while the nested version is returned as numpy datetime64. It's my understanding that numpy's datetimes aren't timezone-aware (ref) so it seems possible PyArrow is inheriting that behavior. The pandas docs point to the arrays.DatetimeArray extensiontype which I don't think PyArrow is making use of.
Is it possible to have a consistent result with respect to timezone-awareness in this case?
Component(s)
Python