-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9223: [Python] Propagate timezone information in pandas conversion #7604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Ports string_to_timezone to C++ - Causes nested timestamp columns within structs to use conversion to object path. - Copy timezone on to_object path. Open to other suggestions on how to solve this.
|
Not sure if doing this correctly will break spark in unintentional ways. |
(previously datetime64 was used). Do correct timezone adjustment.
|
If the approach is agreeable it would be nice to get in before the next release. This potentially breaks List[Timestamp] because that now returns datetimes as well but I think that is better for consistency. |
|
One other concern. For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC converted to system time zone. thoughts? |
I don't think we should do that (it would also be a big break). Tz-naive timestamps have no timezone and should therefore be left as is IMO (applications on top of Arrow can choose to use the "naive means system time zone" behaviour, but I would say that Arrow itself should never deal with the system time zone) |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks good to me (it's a bit unfortunate to have to write the python code that deals with the timezone in C++ ;), but nothing to do about that)
And this actually also solves the fact that right now, to_pandas(timestamp_as_object=True) drops the timezone information as well (I clearly forgot that case in the recent PR that added that keyword).
|
@github-actions crossbow submit test-conda-python-3.7-spark-master |
|
Revision: d321a8d Submitted crossbow builds: ursa-labs/crossbow @ actions-373
|
|
There is one failure (which involves datetimes with timezones, but can't directly understand if it also involves structs): |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely in favor of undoing some of the hacks around this. Unfortunately may need to write a manual parser for tz offsets to work around the std::regex issue
You've probably seen this, but this is what our docs say about timezone-naive data: http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE (Side note: do our doc URLs have to be so ugly?) I agree with Joris that at this level, we should not do any localizing of timezone-naive data. |
Please see reply: #7604 (comment) I haven't looked in the code deeply but this might an overall bug, or at least idiosyncrasy, with our conversion to/from datetime (not accounting for system timezone). We can handle this as a separate issue though. |
|
Moving the discussion at #7604 (comment) outside the inline thread here (which makes it easier to find). It's still not really clear to me what the actual issue is at hand that we are discussing. What was the problem that started the discussion?
But that is only for the In this PR (and also already on master), we are using |
I think this is the potentially confusing part. it is quite possible it is just me. But if the timestamp is meant for display purposes, I would expect the tz-naive datetime value to reflect the local timezone. Since it displays a "calendar" like object. This isn't clear cut case just stating my expectations. I think this might be a conversation to be had on the ML. To finish up the conversation, the reason for not always using to the to_object path, is because I don't want to potentially change functionality of pandas conversion to datetime. |
OK, that's indeed what was causing the confusion ;) So my view on this: the datetime.datetime object is also a tz-naive object, so we should "just" convert to the exact same tz-naive timestamp value. This is not only for display, but an actual value to work with when converting the Struct type to native python objects (eg dicts of datetime.datetime values). We already do such conversions right now, and we always preserve the actual timestamp value (and never adjust for system timezone). And for display: datetime.datetime does not edit the displayed value related to the system timezone. Personally I like this behaviour, while you would prefer that it would use system timezone. But once we convert to datetime.datetime, it's their display that we use, so we don't really have control over it, although you may not like that. |
Pandas also never uses the system timezone for conversions or display, so by using the to_object path, we wouldn't change any functionality AFAIK. Eg right now we have the following behaviour: where the tz-naive timestamps are converted to the same tz-naive datetime.datetime in the to_pandas conversion (without any adjustment for system timezone). |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
Added two suggestions for clarifying comments.
Further, this seems to expose a bug in Array.to_pandas(timestamp_as_object=True) when having tz-aware data (before that silently dropped the timezone, but now it is erroring). Fix + test:
diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi
index 6abf4b70c..f36ba67a1 100644
--- a/python/pyarrow/array.pxi
+++ b/python/pyarrow/array.pxi
@@ -1174,7 +1174,9 @@ cdef _array_like_to_pandas(obj, options):
result = pandas_api.series(arr, dtype=dtype, name=name)
if (isinstance(original_type, TimestampType) and
- original_type.tz is not None):
+ original_type.tz is not None and
+ # can be object dtype for non-ns and timestamp_as_object=True
+ result.dtype.kind == "M"):
from pyarrow.pandas_compat import make_tz_aware
result = make_tz_aware(result, original_type.tz)
diff --git a/python/pyarrow/tests/test_pandas.py b/python/pyarrow/tests/test_pandas.py
index 392a32210..4ca21dbe5 100644
--- a/python/pyarrow/tests/test_pandas.py
+++ b/python/pyarrow/tests/test_pandas.py
@@ -4010,19 +4010,23 @@ def test_timestamp_as_object_out_of_range():
@pytest.mark.parametrize("resolution", ["s", "ms", "us"])
+@pytest.mark.parametrize("tz", [None, "America/New_York"])
# One datetime outside nanosecond range, one inside nanosecond range:
@pytest.mark.parametrize("dt", [datetime(1553, 1, 1), datetime(2020, 1, 1)])
-def test_timestamp_as_object_non_nanosecond(resolution, dt):
+def test_timestamp_as_object_non_nanosecond(resolution, tz, dt):
# Timestamps can be converted Arrow and reloaded into Pandas with no loss
# of information if the timestamp_as_object option is True.
- arr = pa.array([dt], type=pa.timestamp(resolution))
- result = arr.to_pandas(timestamp_as_object=True)
- assert result.dtype == object
- assert isinstance(result[0], datetime)
- assert result[0] == dt
-
+ arr = pa.array([dt], type=pa.timestamp(resolution, tz=tz))
table = pa.table({'a': arr})
- result = table.to_pandas(timestamp_as_object=True)['a']
- assert result.dtype == object
- assert isinstance(result[0], datetime)
- assert result[0] == dt
+
+ for result in [
+ arr.to_pandas(timestamp_as_object=True),
+ table.to_pandas(timestamp_as_object=True)['a']
+ ]:
+ assert result.dtype == object
+ assert isinstance(result[0], datetime)
+ if tz:
+ assert result[0].tzinfo is not None
+ else:
+ assert result[0].tzinfo is None
+ assert result[0] == dtBut I can also push that change to your branch, if that's fine with you.
@jorisvandenbossche Yes please feel free to push to this branch. |
|
@github-actions crossbow submit test-conda-python-3.7-spark-master |
|
Revision: a550a76 Submitted crossbow builds: ursa-labs/crossbow @ actions-380
|
|
Looks like the Spark test fails with |
|
Which I think is as expected? (at least if the data in the test are tz-aware) And thus the test can be updated on the spark side? |
|
Yeah, it seems possible to fix on the Spark side, but I haven't been able to take a close look at this yet. I'll try to do that as soon as possible and report back. |
|
If we agree we can "break" spark's tests like this, @emkornfield can you maybe skip the specific test in the spark integration tests, so the we notice if other tests would start failing. |
|
ping @BryanCutler, would be good to merge this for the release if possible |
|
@jorisvandenbossche Is there a way to skip specific tests, I thought all of the live in spark code? |
The script selects all Spark tests that use Arrow. I'm pretty sure all of them are being run and there was just the one failure. The problem on the Spark side is that it isn't handling nested timestamps correctly either, so fixing that will be a step in the right direction. The user will end up seeing different results for nested timestamps, if upgrading, but they weren't officially supported anyway. As for Spark tests, I should be able to fix this for master and branch-3.0, not sure about branch-2.4 but maybe. +1 on this change for me. |
With pytest you can deselect a single test by name (that's also what we do with the dask integration tests), but pysparks seems to use a custom test runner (https://github.com/apache/spark/blob/master/python/run-tests.py, being called from arrow/ci/scripts/integration_spark.sh Lines 56 to 67 in 2596b34
|
|
There is no way to deselect specific tests, afaik. You would have select all the other tests individually from that module, for example |
My preference, if it is ok with everyone reviewing this would be to avoid this and try to get something merged into Spark master quickly (I can own looking at nightly spark jobs until that is done to verify no more failures). @BryanCutler if you don't have the bandwidth to make the PR to spark, I can devote a little bit of effort to submitting one. Thoughts? |
|
We are doing integration tests for both spark master as spark's 3.0 branch, lately. But yes, @BryanCutler indicated that a fix for those branches could be possible, so then it's certainly fine for me to merge as is without skipping any tests. |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I'm making the executive decision to merge this and hopefully that will be a forcing function to address any fallout =)
@emkornfield not sure if I'll have the time to get to it in the next couple of days, so if you're able to take it on that would be great. I made https://issues.apache.org/jira/browse/SPARK-32285 for the issue - just leave a comment if you start work on it. BTW, the spark integration tests are currently failing due to recent Arrow Java changes and a patch will have to be made to fix that first. I should be able to get that done. |
- Ports string_to_timezone to C++ - Causes nested timestamp columns within structs that have timezones to go through object conversion - Copy timezone on to_object path. Open to other suggestions on how to solve this. It still seems like this is inconsistent with nested timestamps in lists which isn't great. Closes apache#7604 from emkornfield/datetime Lead-authored-by: Micah Kornfield <emkornfield@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org>
object conversion
Open to other suggestions on how to solve this. It still seems like this is inconsistent with nested timestamps in lists which isn't great.