-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9528: [Python] Honor tzinfo when converting from datetime #7805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@github-actions crossbow submit test-conda-python-3.7-spark-master |
|
@github-actions crossbow submit test-conda-python-3.7-spark-master |
| } else if (PyDateTime_Check(obj)) { | ||
| ++timestamp_micro_count_; | ||
| OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo")); | ||
| if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timezone_ should be first here.
| def as_py(self): | ||
| """ | ||
| Return this value as a Python datetime.datetime instance. | ||
| Return this value as a Pandas Timestamp instance (if available), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be reverted.
|
@github-actions crossbow submit test-conda-python-3.8-spark-master |
|
Revision: a5b2a51 Submitted crossbow builds: ursa-labs/crossbow @ actions-433
|
|
We can use task listed only in https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L1930 for If we need to run with Python 3.7, we can add a task to |
|
@kou 3.8 should be fine. Thank you!. I copied the command from the previous PR related to datetimes, I guess CI has changed since then. |
| ++int_count_; | ||
| } else if (PyDateTime_Check(obj)) { | ||
| ++timestamp_micro_count_; | ||
| OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If null is returned, it means Python raised an error (for example the attribute doesn't exist, which is unlikely). You want either to return that error, or to ignore it (using PyErr_Clear).
| def test_nested_with_timestamp_tz_round_trip(): | ||
| ts = pd.Timestamp.now() | ||
| ts_dt = ts.to_pydatetime() | ||
| arr = pa.array([ts_dt], type=pa.timestamp('us', tz='America/New_York')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test presume that ts itself was produced in "America/New York" timezone? It's not clear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe so. Pretty sure my machine uses it. Local time, I'll double check by setting a different tz
| if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) { | ||
| // From public docs on array construction | ||
| // "Localized timestamps will currently be returned as UTC " | ||
| // representation). " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean Arrow simply stores an erroneous value? We don't do timezone conversion in Arrow, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not sure what it was intended to mean. This was my best guess. Not accounting for timezones info seems like a bug? Would you prefer I try to capture the first time zone encountered as a string? Or is the preference not to have the logic in this PR in arrow in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I reuse the nomenclature from the test, I get without this PR (it's 14h21 UTC currently):
>>> now_utc
datetime.datetime(2020, 7, 20, 14, 21, 42, 96119, tzinfo=<UTC>)
>>> arr = pa.array([now_utc])
>>> arr
<pyarrow.lib.TimestampArray object at 0x7f44b0bfcc90>
[
2020-07-20 14:21:42.096119
]
>>> arr.type.tz
>>> arr.to_pylist()
[datetime.datetime(2020, 7, 20, 14, 21, 42, 96119)]
>>> arr.to_pandas()
0 2020-07-20 14:21:42.096119
dtype: datetime64[ns]
>>> now_with_tz
datetime.datetime(2020, 7, 20, 10, 21, 42, 96119, tzinfo=<DstTzInfo 'US/Eastern' EDT-1 day, 20:00:00 DST>)
>>> arr = pa.array([now_with_tz])
>>> arr.type.tz
>>> arr.to_pylist()
[datetime.datetime(2020, 7, 20, 10, 21, 42, 96119)]
>>> arr.to_pylist()[0].tzinfo
>>> arr.to_pandas()
0 2020-07-20 10:21:42.096119
dtype: datetime64[ns]
So without the PR, there's the problem that timestamps lose the timezone information from Python. It seems to get worse with this PR because non-UTC timestamps get tagged as UTC without being corrected for the timezone's offset, which is misleading. At least originally you may be alerted by the absence of a timezone on the type (in Python terms, it's a "naive" timestamp).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to get worse with this PR because non-UTC timestamps get tagged as UTC without being corrected for the timezone's offset, which is misleading.
That is not the intent of the PR, right now everything gets corrected to UTC. As an example:
This correctly keeps the times logically the same. I can make the change to try to keep the original timezones in place and changes US/Eastern to the correct time in UTC>
>>> now_with_tz = datetime.datetime(2020, 7, 20, 10, 21, 42, 96119, tzinfo=pytz.timezone('US/Eastern'))
>>> arr = pa.array([now_with_tz])
>>> arr.type.tz
'UTC'
>>> arr.to_pylist()
[datetime.datetime(2020, 7, 20, 15, 17, 42, 96119, tzinfo=<UTC>)]
>>> arr.to_pylist()[0].tzinfo
<UTC>
>>> arr.to_pandas()
0 2020-07-20 15:17:42.096119+00:00
dtype: datetime64[ns, UTC]
| struct = pa.StructArray.from_arrays([arr, arr], ['start', 'stop']) | ||
|
|
||
| result = struct.to_pandas() | ||
| # N.B. we test with Panaas because the Arrow types are not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Pandas" :-)
| result = struct.to_pandas() | ||
| # N.B. we test with Panaas because the Arrow types are not | ||
| # actually equal. All Timezone aware times are currently normalized | ||
| # to "UTC" as the timesetamp type.but since this conversion results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"timestamp"
| # N.B. we test with Panaas because the Arrow types are not | ||
| # actually equal. All Timezone aware times are currently normalized | ||
| # to "UTC" as the timesetamp type.but since this conversion results | ||
| # in object dtypes, and tzinfo is preserrved the comparison should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"preserved"
| def test_array_from_scalar(): | ||
| today = datetime.date.today() | ||
| now = datetime.datetime.now() | ||
| now_utc = now.replace(tzinfo=pytz.utc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my experimentations, you should write:
now_utc = datetime.datetime.now(tz=pytz.utc)
(simply calling .replace(tzinfo=pytz.utc) doesn't adjust the recorded time for the timezone change, so you get the local time tagged with a UTC timezone)
And, yes, this probably doesn't matter for the test's correctness :-)
|
|
||
|
|
||
| def test_nested_with_timestamp_tz_round_trip(): | ||
| ts = pd.Timestamp.now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What timezone does this timestamp have? Is it a naive timestamp? Would be nice explaining it in comments.
|
My main concern with this solution is while it resolves the pandas roundtrip, the intermediate array values are different. Running the following snippet on three different revisions: 0.17.1MasterThis patchWhile the current master works for this example and the spark patch fixes the spark integration test, it breaks the nested roundtrip example discussed in the ML thread. @emkornfield @BryanCutler thoughts? |
|
@kszucs Breaking users is a concern, I'll add an environment variable for both this change and the previous one that can keep the old buggy behavior. Just to clarify: was actually ~5PM Budapest time when you ran these test (i.e. this patch looks like it fixes a bug?) I thought the unit test I added for Pandas captured the intent of the ML example? Let me try to run the example by hand in python to see the results. |
Right, this patch produces the right result.
It fixes that as well, just doesn't keep the old buggy behavior. I was considering to just apply the spark patch on the current master to keep the old buggy behavior, but there is still the nested issue. |
I've run out of time this morning to work on this PR. I'll update it tonight with an environment variable flag that can retain the old buggy behavior. |
|
@emkornfield Thanks for working on this! In the meantime I'm going to apply the reversion patch and cut RC2 since it is going to take at least 6-8 hours to build and three additional days for voting, so we'll have enough time to sort this issue out and decide to either cut RC3 including this patch or keep RC2. |
|
Just to clarify things, is the main concern with this patch over keeping the previous buggy behavior? Besides that are these changes producing correct results and passing roundtrip tests? |
|
Correct me if I'm wrong, but IIUC there are doubts about a few things:
This patch may be the right solution, but I fear that we haven't adequately thought through (and tested) all of the implications and upgrade paths. And two of the people with the strongest opinion's about pyarrow's API (@wesm and @pitrou) just left for vacation and have expressed a preference for reverting the initial change for the 1.0 release. At this stage of the 1.0 release, I'd rather pyarrow continue to be wrong in the expected way (i.e. revert and not merge this yet) than be right in an unexpected way and possibly wrong in other unknown ways. |
|
@nealrichardson I think we should discuss this on the mailing list. The prior patch has been reverted and I'll use this one to have an end-to-end solution which probably won't make it into 1.0 |
|
Sure. I thought the mailing list discussion said to discuss here 🤷 |
|
too many communication channels. |
|
@emkornfield I think we can close this in favor of #7816 |
|
yes. |
Follow up of: - ARROW-9223: [Python] Propagate timezone information in pandas conversion - ARROW-9528: [Python] Honor tzinfo when converting from datetime (#7805) TODOs: - [x] Store all Timestamp values normalized to UTC - [x] Infer timezone from the array values if no explicit type was given - [x] Testing (especially pandas object roundtrip) - [x] Testing of timezone-naive roundtrips - [x] Testing mixed pandas and datetime objects Closes #7816 from kszucs/tz Lead-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org>
Follow up of: - ARROW-9223: [Python] Propagate timezone information in pandas conversion - ARROW-9528: [Python] Honor tzinfo when converting from datetime (apache/arrow#7805) TODOs: - [x] Store all Timestamp values normalized to UTC - [x] Infer timezone from the array values if no explicit type was given - [x] Testing (especially pandas object roundtrip) - [x] Testing of timezone-naive roundtrips - [x] Testing mixed pandas and datetime objects Closes #7816 from kszucs/tz Lead-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org>
Draft PR to enable round trip to TZ info to hopefully solve spark issues.