-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Currently, we can't parse "our own" string representation of a timestamp array with the timestamp parser strptime:
import datetime
import pyarrow as pa
import pyarrow.compute as pc
>>> pa.array([datetime.datetime(2022, 3, 5, 9)])
<pyarrow.lib.TimestampArray object at 0x7f00c1d53dc0>
[
2022-03-05 09:00:00.000000
]
# trying to parse the above representation as string
>>> pc.strptime(["2022-03-05 09:00:00.000000"], format="%Y-%m-%d %H:%M:%S", unit="us")
...
ArrowInvalid: Failed to parse string: '2022-03-05 09:00:00.000000' as a scalar of type timestamp[us]The reason for this is the fractional second part, so the following works:
>>> pc.strptime(["2022-03-05 09:00:00"], format="%Y-%m-%d %H:%M:%S", unit="us")
<pyarrow.lib.TimestampArray object at 0x7f00c1d6f940>
[
2022-03-05 09:00:00.000000
]Now, I think the reason that this fails is because strptime only supports parsing seconds as an integer (https://man7.org/linux/man-pages/man3/strptime.3.html).
But, it creates a strange situation where the timestamp parser cannot parse the representation we use for timestamps.
In addition, for CSV we have a custom ISO parser (used by default), so when parsing the strings while reading a CSV file, the same string with fractional seconds does work:
s = b"""a
2022-03-05 09:00:00.000000"""
from pyarrow import csv
>>> csv.read_csv(io.BytesIO(s))
pyarrow.Table
a: timestamp[ns]
----
a: [[2022-03-05 09:00:00.000000000]]I realize that you can use the generic "cast" for doing this string parsing:
>>> pc.cast(["2022-03-05 09:00:00.000000"], pa.timestamp("us"))
<pyarrow.lib.TimestampArray object at 0x7f00c1d53d60>
[
2022-03-05 09:00:00.000000
]But this was not the first way I thought about (I think it is quite typical to first think of strptime, and it is confusing that that doesn't work; the error message is also not helpful)
cc @pitrou @rok
Reporter: Joris Van den Bossche / @jorisvandenbossche
Watchers: Rok Mihevc / @rok
Related issues:
- [C++] Strptime issues umbrella (is a child of)
- [Python] Failed to parse string into timestamp (duplicates)
- [C++][Python] strptime fails to parse subsecond timestamps (duplicates)
Note: This issue was originally created as ARROW-15883. Please see the migration documentation for further details.