Skip to content

Conversation

@emkornfield
Copy link
Contributor

@emkornfield emkornfield commented Jul 1, 2020

  • Ports string_to_timezone to C++
  • Causes nested timestamp columns within structs that have timezones to go through
    object conversion
  • Copy timezone on to_object path.

Open to other suggestions on how to solve this. It still seems like this is inconsistent with nested timestamps in lists which isn't great.

- Ports string_to_timezone to C++
- Causes nested timestamp columns within structs to use conversion
  to object path.
- Copy timezone on to_object path.

Open to other suggestions on how to solve this.
@emkornfield
Copy link
Contributor Author

CC @wesm @BryanCutler

Not sure if doing this correctly will break spark in unintentional ways.

@github-actions
Copy link

github-actions bot commented Jul 1, 2020

(previously datetime64 was used).

Do correct timezone adjustment.
@emkornfield
Copy link
Contributor Author

If the approach is agreeable it would be nice to get in before the next release. This potentially breaks List[Timestamp] because that now returns datetimes as well but I think that is better for consistency.

@emkornfield
Copy link
Contributor Author

emkornfield commented Jul 2, 2020

One other concern. For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC converted to system time zone. thoughts?

@jorisvandenbossche
Copy link
Member

For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC converted to system time zone. thoughts?

I don't think we should do that (it would also be a big break). Tz-naive timestamps have no timezone and should therefore be left as is IMO (applications on top of Arrow can choose to use the "naive means system time zone" behaviour, but I would say that Arrow itself should never deal with the system time zone)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me (it's a bit unfortunate to have to write the python code that deals with the timezone in C++ ;), but nothing to do about that)

And this actually also solves the fact that right now, to_pandas(timestamp_as_object=True) drops the timezone information as well (I clearly forgot that case in the recent PR that added that keyword).

@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit test-conda-python-3.7-spark-master

@github-actions
Copy link

github-actions bot commented Jul 2, 2020

Revision: d321a8d

Submitted crossbow builds: ursa-labs/crossbow @ actions-373

Task Status
test-conda-python-3.7-spark-master Github Actions

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jul 2, 2020

There is one failure (which involves datetimes with timezones, but can't directly understand if it also involves structs):

======================================================================
FAIL: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 588, in test_grouped_over_window_with_key
    self.assertTrue(all([r[0] for r in result]))
AssertionError: False is not true

----------------------------------------------------------------------
Ran 21 tests in 141.194s

FAILED (failures=1)

https://github.com/apache/spark/blob/7fda184f0fc39613fb68e912c189c54b93c638e6/python/pyspark/sql/tests/test_pandas_grouped_map.py#L546-L574

@wesm wesm changed the title ARROW-9223: [Python] Propagate Timzone information in pandas conversion ARROW-9223: [Python] Propagate timezone information in pandas conversion Jul 2, 2020
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely in favor of undoing some of the hacks around this. Unfortunately may need to write a manual parser for tz offsets to work around the std::regex issue

@nealrichardson
Copy link
Member

For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC converted to system time zone. thoughts?

I don't think we should do that (it would also be a big break). Tz-naive timestamps have no timezone and should therefore be left as is IMO (applications on top of Arrow can choose to use the "naive means system time zone" behaviour, but I would say that Arrow itself should never deal with the system time zone)

You've probably seen this, but this is what our docs say about timezone-naive data: http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE

(Side note: do our doc URLs have to be so ugly?)

I agree with Joris that at this level, we should not do any localizing of timezone-naive data.

@emkornfield
Copy link
Contributor Author

emkornfield commented Jul 2, 2020

For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC converted to system time zone. thoughts?

I don't think we should do that (it would also be a big break). Tz-naive timestamps have no timezone and should therefore be left as is IMO (applications on top of Arrow can choose to use the "naive means system time zone" behaviour, but I would say that Arrow itself should never deal with the system time zone)

You've probably seen this, but this is what our docs say about timezone-naive data: http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE

(Side note: do our doc URLs have to be so ugly?)

I agree with Joris that at this level, we should not do any localizing of timezone-naive data.

Please see reply: #7604 (comment)

I haven't looked in the code deeply but this might an overall bug, or at least idiosyncrasy, with our conversion to/from datetime (not accounting for system timezone).

We can handle this as a separate issue though.

@jorisvandenbossche
Copy link
Member

Moving the discussion at #7604 (comment) outside the inline thread here (which makes it easier to find).

It's still not really clear to me what the actual issue is at hand that we are discussing. What was the problem that started the discussion?

timezone naive datetimes reflect system timezone

Who does this?

Python's datetime.astimezone assumes this as of 3.6. Since datetime essentially becomes the "display" I would think most people would assume system timezone values.

But that is only for the astimezone method, which we don't use? (in this PR you now only use it once in the tests). And the display of datetime.datetime is not influenced by system timezone, AFAIK?

In this PR (and also already on master), we are using fromutc, which seems correct to me, since internally in arrow tz-aware timestamps are stored as UTC.
And for tz-naive we don't need to handle any timezone issue, and can just convert it do a datetime.datetime object, which by default is also tz-naive and can thus be used as is.

@emkornfield
Copy link
Contributor Author

And for tz-naive we don't need to handle any timezone issue, and can just convert it do a datetime.datetime object, which by default is also tz-naive and can thus be used as is.

I think this is the potentially confusing part. it is quite possible it is just me. But if the timestamp is meant for display purposes, I would expect the tz-naive datetime value to reflect the local timezone. Since it displays a "calendar" like object. This isn't clear cut case just stating my expectations.

I think this might be a conversation to be had on the ML. To finish up the conversation, the reason for not always using to the to_object path, is because I don't want to potentially change functionality of pandas conversion to datetime.

@jorisvandenbossche
Copy link
Member

But if the timestamp is meant for display purposes, I would expect the tz-naive datetime value to reflect the local timezone. Since it displays a "calendar" like object. This isn't clear cut case just stating my expectations.

OK, that's indeed what was causing the confusion ;)

So my view on this: the datetime.datetime object is also a tz-naive object, so we should "just" convert to the exact same tz-naive timestamp value. This is not only for display, but an actual value to work with when converting the Struct type to native python objects (eg dicts of datetime.datetime values). We already do such conversions right now, and we always preserve the actual timestamp value (and never adjust for system timezone).

And for display: datetime.datetime does not edit the displayed value related to the system timezone. Personally I like this behaviour, while you would prefer that it would use system timezone. But once we convert to datetime.datetime, it's their display that we use, so we don't really have control over it, although you may not like that.

@jorisvandenbossche
Copy link
Member

the reason for not always using to the to_object path, is because I don't want to potentially change functionality of pandas conversion to datetime.

Pandas also never uses the system timezone for conversions or display, so by using the to_object path, we wouldn't change any functionality AFAIK.

Eg right now we have the following behaviour:

In [49]: arr = pa.array([0], type=pa.timestamp(unit)) 
    ...: arr2 = pa.StructArray.from_arrays([arr, arr], ['start', 'stop'])  

In [50]: arr2   
Out[50]: 
<pyarrow.lib.StructArray object at 0x7f10edf7d7c8>
-- is_valid: all not null
-- child 0 type: timestamp[us]
  [
    1970-01-01 00:00:00.000000
  ]
-- child 1 type: timestamp[us]
  [
    1970-01-01 00:00:00.000000
  ]

In [52]: arr2.to_pandas()[0]    
Out[52]: 
{'start': datetime.datetime(1970, 1, 1, 0, 0),
 'stop': datetime.datetime(1970, 1, 1, 0, 0)}

where the tz-naive timestamps are converted to the same tz-naive datetime.datetime in the to_pandas conversion (without any adjustment for system timezone).
(and IMO we need to keep this behaviour)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!
Added two suggestions for clarifying comments.

Further, this seems to expose a bug in Array.to_pandas(timestamp_as_object=True) when having tz-aware data (before that silently dropped the timezone, but now it is erroring). Fix + test:

diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi
index 6abf4b70c..f36ba67a1 100644
--- a/python/pyarrow/array.pxi
+++ b/python/pyarrow/array.pxi
@@ -1174,7 +1174,9 @@ cdef _array_like_to_pandas(obj, options):
     result = pandas_api.series(arr, dtype=dtype, name=name)
 
     if (isinstance(original_type, TimestampType) and
-            original_type.tz is not None):
+            original_type.tz is not None and
+            # can be object dtype for non-ns and timestamp_as_object=True
+            result.dtype.kind == "M"):
         from pyarrow.pandas_compat import make_tz_aware
         result = make_tz_aware(result, original_type.tz)
 
diff --git a/python/pyarrow/tests/test_pandas.py b/python/pyarrow/tests/test_pandas.py
index 392a32210..4ca21dbe5 100644
--- a/python/pyarrow/tests/test_pandas.py
+++ b/python/pyarrow/tests/test_pandas.py
@@ -4010,19 +4010,23 @@ def test_timestamp_as_object_out_of_range():
 
 
 @pytest.mark.parametrize("resolution", ["s", "ms", "us"])
+@pytest.mark.parametrize("tz", [None, "America/New_York"])
 # One datetime outside nanosecond range, one inside nanosecond range:
 @pytest.mark.parametrize("dt", [datetime(1553, 1, 1), datetime(2020, 1, 1)])
-def test_timestamp_as_object_non_nanosecond(resolution, dt):
+def test_timestamp_as_object_non_nanosecond(resolution, tz, dt):
     # Timestamps can be converted Arrow and reloaded into Pandas with no loss
     # of information if the timestamp_as_object option is True.
-    arr = pa.array([dt], type=pa.timestamp(resolution))
-    result = arr.to_pandas(timestamp_as_object=True)
-    assert result.dtype == object
-    assert isinstance(result[0], datetime)
-    assert result[0] == dt
-
+    arr = pa.array([dt], type=pa.timestamp(resolution, tz=tz))
     table = pa.table({'a': arr})
-    result = table.to_pandas(timestamp_as_object=True)['a']
-    assert result.dtype == object
-    assert isinstance(result[0], datetime)
-    assert result[0] == dt
+
+    for result in [
+        arr.to_pandas(timestamp_as_object=True),
+        table.to_pandas(timestamp_as_object=True)['a']
+    ]:
+        assert result.dtype == object
+        assert isinstance(result[0], datetime)
+        if tz:
+            assert result[0].tzinfo is not None
+        else:
+            assert result[0].tzinfo is None
+        assert result[0] == dt

But I can also push that change to your branch, if that's fine with you.

@emkornfield
Copy link
Contributor Author

But I can also push that change to your branch, if that's fine with you.

@jorisvandenbossche Yes please feel free to push to this branch.

@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit test-conda-python-3.7-spark-master

@github-actions
Copy link

github-actions bot commented Jul 6, 2020

Revision: a550a76

Submitted crossbow builds: ursa-labs/crossbow @ actions-380

Task Status
test-conda-python-3.7-spark-master Github Actions

@BryanCutler
Copy link
Member

BryanCutler commented Jul 6, 2020

Looks like the Spark test fails with

   File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 593, in f
    "{} != {}".format(expected_key[i][1], window_range)
AssertionError: 
{'start': datetime.datetime(2018, 3, 10, 0, 0), 'end': datetime.datetime(2018, 3, 15, 0, 0)} != 
{'start': datetime.datetime(2018, 3, 10, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'end': datetime.datetime(2018, 3, 15, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>)}

@jorisvandenbossche
Copy link
Member

Which I think is as expected? (at least if the data in the test are tz-aware) And thus the test can be updated on the spark side?
Since we actually change to now return datetimes with tzinfo set, instead of losing the timezone.

@BryanCutler
Copy link
Member

Yeah, it seems possible to fix on the Spark side, but I haven't been able to take a close look at this yet. I'll try to do that as soon as possible and report back.

@jorisvandenbossche
Copy link
Member

If we agree we can "break" spark's tests like this, @emkornfield can you maybe skip the specific test in the spark integration tests, so the we notice if other tests would start failing.

@wesm
Copy link
Member

wesm commented Jul 9, 2020

ping @BryanCutler, would be good to merge this for the release if possible

@emkornfield
Copy link
Contributor Author

@jorisvandenbossche Is there a way to skip specific tests, I thought all of the live in spark code?

@BryanCutler
Copy link
Member

@jorisvandenbossche Is there a way to skip specific tests, I thought all of the live in spark code?

The script selects all Spark tests that use Arrow. I'm pretty sure all of them are being run and there was just the one failure.

The problem on the Spark side is that it isn't handling nested timestamps correctly either, so fixing that will be a step in the right direction. The user will end up seeing different results for nested timestamps, if upgrading, but they weren't officially supported anyway. As for Spark tests, I should be able to fix this for master and branch-3.0, not sure about branch-2.4 but maybe.

+1 on this change for me.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jul 10, 2020

Is there a way to skip specific tests, I thought all of the live in spark code?

With pytest you can deselect a single test by name (that's also what we do with the dask integration tests), but pysparks seems to use a custom test runner (https://github.com/apache/spark/blob/master/python/run-tests.py, being called from

spark_python_tests=(
"pyspark.sql.tests.test_arrow"
"pyspark.sql.tests.test_pandas_map"
"pyspark.sql.tests.test_pandas_cogrouped_map"
"pyspark.sql.tests.test_pandas_grouped_map"
"pyspark.sql.tests.test_pandas_udf"
"pyspark.sql.tests.test_pandas_udf_scalar"
"pyspark.sql.tests.test_pandas_udf_grouped_agg"
"pyspark.sql.tests.test_pandas_udf_window")
(echo "Testing PySpark:"; IFS=$'\n'; echo "${spark_python_tests[*]}")
python/run-tests --testnames "$(IFS=,; echo "${spark_python_tests[*]}")" --python-executables python
), so no idea then. It doesn't seem to allow to deselect, only to select (https://github.com/apache/spark/blob/master/python/run-tests.py#L176-L216)

@BryanCutler
Copy link
Member

There is no way to deselect specific tests, afaik. You would have select all the other tests individually from that module, for example

spark_python_tests=( 
   ...
   "pyspark.sql.tests.test_pandas_grouped_map GroupedMapInPandasTests.test_supported_types"  
   "pyspark.sql.tests.test_pandas_grouped_map GroupedMapInPandasTests.test_array_type_correct"
   etc...
)

@emkornfield
Copy link
Contributor Author

There is no way to deselect specific tests, afaik. You would have select all the other tests individually from that module, for example

My preference, if it is ok with everyone reviewing this would be to avoid this and try to get something merged into Spark master quickly (I can own looking at nightly spark jobs until that is done to verify no more failures).

@BryanCutler if you don't have the bandwidth to make the PR to spark, I can devote a little bit of effort to submitting one.

Thoughts?

@jorisvandenbossche
Copy link
Member

We are doing integration tests for both spark master as spark's 3.0 branch, lately. But yes, @BryanCutler indicated that a fix for those branches could be possible, so then it's certainly fine for me to merge as is without skipping any tests.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I'm making the executive decision to merge this and hopefully that will be a forcing function to address any fallout =)

@wesm wesm closed this in f86c4db Jul 11, 2020
@BryanCutler
Copy link
Member

@BryanCutler if you don't have the bandwidth to make the PR to spark, I can devote a little bit of effort to submitting one.

@emkornfield not sure if I'll have the time to get to it in the next couple of days, so if you're able to take it on that would be great. I made https://issues.apache.org/jira/browse/SPARK-32285 for the issue - just leave a comment if you start work on it. BTW, the spark integration tests are currently failing due to recent Arrow Java changes and a patch will have to be made to fix that first. I should be able to get that done.

kszucs pushed a commit to kszucs/arrow that referenced this pull request Aug 10, 2020
- Ports string_to_timezone to C++
- Causes nested timestamp columns within structs that have timezones to go through
  object conversion
- Copy timezone on to_object path.

Open to other suggestions on how to solve this.  It still seems like this is inconsistent with nested timestamps in lists which isn't great.

Closes apache#7604 from emkornfield/datetime

Lead-authored-by: Micah Kornfield <emkornfield@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Wes McKinney <wesm@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants