ARROW-9223: [Python] Propagate timezone information in pandas conversion #7604

emkornfield · 2020-07-01T07:59:00Z

Ports string_to_timezone to C++
Causes nested timestamp columns within structs that have timezones to go through
object conversion
Copy timezone on to_object path.

Open to other suggestions on how to solve this. It still seems like this is inconsistent with nested timestamps in lists which isn't great.

- Ports string_to_timezone to C++ - Causes nested timestamp columns within structs to use conversion to object path. - Copy timezone on to_object path. Open to other suggestions on how to solve this.

emkornfield · 2020-07-01T07:59:27Z

CC @wesm @BryanCutler

Not sure if doing this correctly will break spark in unintentional ways.

github-actions · 2020-07-01T08:05:09Z

https://issues.apache.org/jira/browse/ARROW-9223

cpp/src/arrow/python/datetime.cc

cpp/src/arrow/python/arrow_to_pandas.cc

(previously datetime64 was used). Do correct timezone adjustment.

emkornfield · 2020-07-02T03:45:01Z

If the approach is agreeable it would be nice to get in before the next release. This potentially breaks List[Timestamp] because that now returns datetimes as well but I think that is better for consistency.

emkornfield · 2020-07-02T03:52:08Z

One other concern. For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC converted to system time zone. thoughts?

jorisvandenbossche · 2020-07-02T06:37:55Z

For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC converted to system time zone. thoughts?

I don't think we should do that (it would also be a big break). Tz-naive timestamps have no timezone and should therefore be left as is IMO (applications on top of Arrow can choose to use the "naive means system time zone" behaviour, but I would say that Arrow itself should never deal with the system time zone)

jorisvandenbossche

The approach looks good to me (it's a bit unfortunate to have to write the python code that deals with the timezone in C++ ;), but nothing to do about that)

And this actually also solves the fact that right now, to_pandas(timestamp_as_object=True) drops the timezone information as well (I clearly forgot that case in the recent PR that added that keyword).

cpp/src/arrow/python/arrow_to_pandas.cc

python/pyarrow/tests/test_pandas.py

jorisvandenbossche · 2020-07-02T08:09:57Z

@github-actions crossbow submit test-conda-python-3.7-spark-master

github-actions · 2020-07-02T08:11:52Z

Revision: d321a8d

Submitted crossbow builds: ursa-labs/crossbow @ actions-373

Task	Status
test-conda-python-3.7-spark-master

jorisvandenbossche · 2020-07-02T11:40:44Z

There is one failure (which involves datetimes with timezones, but can't directly understand if it also involves structs):

======================================================================
FAIL: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 588, in test_grouped_over_window_with_key
    self.assertTrue(all([r[0] for r in result]))
AssertionError: False is not true

----------------------------------------------------------------------
Ran 21 tests in 141.194s

FAILED (failures=1)

https://github.com/apache/spark/blob/7fda184f0fc39613fb68e912c189c54b93c638e6/python/pyspark/sql/tests/test_pandas_grouped_map.py#L546-L574

wesm

I'm definitely in favor of undoing some of the hacks around this. Unfortunately may need to write a manual parser for tz offsets to work around the std::regex issue

cpp/src/arrow/python/datetime.cc

nealrichardson · 2020-07-02T15:17:55Z

For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC converted to system time zone. thoughts?

I don't think we should do that (it would also be a big break). Tz-naive timestamps have no timezone and should therefore be left as is IMO (applications on top of Arrow can choose to use the "naive means system time zone" behaviour, but I would say that Arrow itself should never deal with the system time zone)

You've probably seen this, but this is what our docs say about timezone-naive data: http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE

(Side note: do our doc URLs have to be so ugly?)

I agree with Joris that at this level, we should not do any localizing of timezone-naive data.

emkornfield · 2020-07-02T15:42:22Z

For timezone naive Timestamps, I'm not sure if we should be adjusting the datetime to reflect UTC converted to system time zone. thoughts?

I don't think we should do that (it would also be a big break). Tz-naive timestamps have no timezone and should therefore be left as is IMO (applications on top of Arrow can choose to use the "naive means system time zone" behaviour, but I would say that Arrow itself should never deal with the system time zone)

You've probably seen this, but this is what our docs say about timezone-naive data: http://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE

(Side note: do our doc URLs have to be so ugly?)

I agree with Joris that at this level, we should not do any localizing of timezone-naive data.

Please see reply: #7604 (comment)

I haven't looked in the code deeply but this might an overall bug, or at least idiosyncrasy, with our conversion to/from datetime (not accounting for system timezone).

We can handle this as a separate issue though.

jorisvandenbossche · 2020-07-02T20:58:40Z

Moving the discussion at #7604 (comment) outside the inline thread here (which makes it easier to find).

It's still not really clear to me what the actual issue is at hand that we are discussing. What was the problem that started the discussion?

timezone naive datetimes reflect system timezone

Who does this?

Python's datetime.astimezone assumes this as of 3.6. Since datetime essentially becomes the "display" I would think most people would assume system timezone values.

But that is only for the astimezone method, which we don't use? (in this PR you now only use it once in the tests). And the display of datetime.datetime is not influenced by system timezone, AFAIK?

In this PR (and also already on master), we are using fromutc, which seems correct to me, since internally in arrow tz-aware timestamps are stored as UTC.
And for tz-naive we don't need to handle any timezone issue, and can just convert it do a datetime.datetime object, which by default is also tz-naive and can thus be used as is.

emkornfield · 2020-07-02T21:14:54Z

And for tz-naive we don't need to handle any timezone issue, and can just convert it do a datetime.datetime object, which by default is also tz-naive and can thus be used as is.

I think this is the potentially confusing part. it is quite possible it is just me. But if the timestamp is meant for display purposes, I would expect the tz-naive datetime value to reflect the local timezone. Since it displays a "calendar" like object. This isn't clear cut case just stating my expectations.

I think this might be a conversation to be had on the ML. To finish up the conversation, the reason for not always using to the to_object path, is because I don't want to potentially change functionality of pandas conversion to datetime.

jorisvandenbossche · 2020-07-02T21:22:16Z

But if the timestamp is meant for display purposes, I would expect the tz-naive datetime value to reflect the local timezone. Since it displays a "calendar" like object. This isn't clear cut case just stating my expectations.

OK, that's indeed what was causing the confusion ;)

So my view on this: the datetime.datetime object is also a tz-naive object, so we should "just" convert to the exact same tz-naive timestamp value. This is not only for display, but an actual value to work with when converting the Struct type to native python objects (eg dicts of datetime.datetime values). We already do such conversions right now, and we always preserve the actual timestamp value (and never adjust for system timezone).

And for display: datetime.datetime does not edit the displayed value related to the system timezone. Personally I like this behaviour, while you would prefer that it would use system timezone. But once we convert to datetime.datetime, it's their display that we use, so we don't really have control over it, although you may not like that.

jorisvandenbossche · 2020-07-02T21:29:29Z

the reason for not always using to the to_object path, is because I don't want to potentially change functionality of pandas conversion to datetime.

Pandas also never uses the system timezone for conversions or display, so by using the to_object path, we wouldn't change any functionality AFAIK.

Eg right now we have the following behaviour:

In [49]: arr = pa.array([0], type=pa.timestamp(unit)) 
    ...: arr2 = pa.StructArray.from_arrays([arr, arr], ['start', 'stop'])  

In [50]: arr2   
Out[50]: 
<pyarrow.lib.StructArray object at 0x7f10edf7d7c8>
-- is_valid: all not null
-- child 0 type: timestamp[us]
  [
    1970-01-01 00:00:00.000000
  ]
-- child 1 type: timestamp[us]
  [
    1970-01-01 00:00:00.000000
  ]

In [52]: arr2.to_pandas()[0]    
Out[52]: 
{'start': datetime.datetime(1970, 1, 1, 0, 0),
 'stop': datetime.datetime(1970, 1, 1, 0, 0)}

where the tz-naive timestamps are converted to the same tz-naive datetime.datetime in the to_pandas conversion (without any adjustment for system timezone).
(and IMO we need to keep this behaviour)

jorisvandenbossche

Thanks for the update!
Added two suggestions for clarifying comments.

Further, this seems to expose a bug in Array.to_pandas(timestamp_as_object=True) when having tz-aware data (before that silently dropped the timezone, but now it is erroring). Fix + test:

diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi
index 6abf4b70c..f36ba67a1 100644
--- a/python/pyarrow/array.pxi
+++ b/python/pyarrow/array.pxi
@@ -1174,7 +1174,9 @@ cdef _array_like_to_pandas(obj, options):
     result = pandas_api.series(arr, dtype=dtype, name=name)
 
     if (isinstance(original_type, TimestampType) and
-            original_type.tz is not None):
+            original_type.tz is not None and
+            # can be object dtype for non-ns and timestamp_as_object=True
+            result.dtype.kind == "M"):
         from pyarrow.pandas_compat import make_tz_aware
         result = make_tz_aware(result, original_type.tz)
 
diff --git a/python/pyarrow/tests/test_pandas.py b/python/pyarrow/tests/test_pandas.py
index 392a32210..4ca21dbe5 100644
--- a/python/pyarrow/tests/test_pandas.py
+++ b/python/pyarrow/tests/test_pandas.py
@@ -4010,19 +4010,23 @@ def test_timestamp_as_object_out_of_range():
 
 
 @pytest.mark.parametrize("resolution", ["s", "ms", "us"])
+@pytest.mark.parametrize("tz", [None, "America/New_York"])
 # One datetime outside nanosecond range, one inside nanosecond range:
 @pytest.mark.parametrize("dt", [datetime(1553, 1, 1), datetime(2020, 1, 1)])
-def test_timestamp_as_object_non_nanosecond(resolution, dt):
+def test_timestamp_as_object_non_nanosecond(resolution, tz, dt):
     # Timestamps can be converted Arrow and reloaded into Pandas with no loss
     # of information if the timestamp_as_object option is True.
-    arr = pa.array([dt], type=pa.timestamp(resolution))
-    result = arr.to_pandas(timestamp_as_object=True)
-    assert result.dtype == object
-    assert isinstance(result[0], datetime)
-    assert result[0] == dt
-
+    arr = pa.array([dt], type=pa.timestamp(resolution, tz=tz))
     table = pa.table({'a': arr})
-    result = table.to_pandas(timestamp_as_object=True)['a']
-    assert result.dtype == object
-    assert isinstance(result[0], datetime)
-    assert result[0] == dt
+
+    for result in [
+        arr.to_pandas(timestamp_as_object=True),
+        table.to_pandas(timestamp_as_object=True)['a']
+    ]:
+        assert result.dtype == object
+        assert isinstance(result[0], datetime)
+        if tz:
+            assert result[0].tzinfo is not None
+        else:
+            assert result[0].tzinfo is None
+        assert result[0] == dt

But I can also push that change to your branch, if that's fine with you.

cpp/src/arrow/python/datetime.cc

emkornfield · 2020-07-04T04:46:06Z

But I can also push that change to your branch, if that's fine with you.

@jorisvandenbossche Yes please feel free to push to this branch.

jorisvandenbossche · 2020-07-06T14:18:10Z

@github-actions crossbow submit test-conda-python-3.7-spark-master

github-actions · 2020-07-06T15:05:21Z

Revision: a550a76

Submitted crossbow builds: ursa-labs/crossbow @ actions-380

Task	Status
test-conda-python-3.7-spark-master

BryanCutler · 2020-07-06T17:22:04Z

Looks like the Spark test fails with

   File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 593, in f
    "{} != {}".format(expected_key[i][1], window_range)
AssertionError: 
{'start': datetime.datetime(2018, 3, 10, 0, 0), 'end': datetime.datetime(2018, 3, 15, 0, 0)} != 
{'start': datetime.datetime(2018, 3, 10, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'end': datetime.datetime(2018, 3, 15, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>)}

jorisvandenbossche · 2020-07-06T17:47:46Z

Which I think is as expected? (at least if the data in the test are tz-aware) And thus the test can be updated on the spark side?
Since we actually change to now return datetimes with tzinfo set, instead of losing the timezone.

BryanCutler · 2020-07-06T20:59:43Z

Yeah, it seems possible to fix on the Spark side, but I haven't been able to take a close look at this yet. I'll try to do that as soon as possible and report back.

jorisvandenbossche · 2020-07-08T16:25:12Z

If we agree we can "break" spark's tests like this, @emkornfield can you maybe skip the specific test in the spark integration tests, so the we notice if other tests would start failing.

wesm · 2020-07-09T22:13:21Z

ping @BryanCutler, would be good to merge this for the release if possible

emkornfield · 2020-07-09T22:37:12Z

@jorisvandenbossche Is there a way to skip specific tests, I thought all of the live in spark code?

BryanCutler · 2020-07-10T00:46:11Z

@jorisvandenbossche Is there a way to skip specific tests, I thought all of the live in spark code?

The script selects all Spark tests that use Arrow. I'm pretty sure all of them are being run and there was just the one failure.

The problem on the Spark side is that it isn't handling nested timestamps correctly either, so fixing that will be a step in the right direction. The user will end up seeing different results for nested timestamps, if upgrading, but they weren't officially supported anyway. As for Spark tests, I should be able to fix this for master and branch-3.0, not sure about branch-2.4 but maybe.

+1 on this change for me.

jorisvandenbossche · 2020-07-10T05:58:51Z

Is there a way to skip specific tests, I thought all of the live in spark code?

With pytest you can deselect a single test by name (that's also what we do with the dask integration tests), but pysparks seems to use a custom test runner (https://github.com/apache/spark/blob/master/python/run-tests.py, being called from

arrow/ci/scripts/integration_spark.sh

Lines 56 to 67 in 2596b34

    
           spark_python_tests=( 
        
             "pyspark.sql.tests.test_arrow" 
        
             "pyspark.sql.tests.test_pandas_map" 
        
             "pyspark.sql.tests.test_pandas_cogrouped_map" 
        
             "pyspark.sql.tests.test_pandas_grouped_map" 
        
             "pyspark.sql.tests.test_pandas_udf" 
        
             "pyspark.sql.tests.test_pandas_udf_scalar" 
        
             "pyspark.sql.tests.test_pandas_udf_grouped_agg" 
        
             "pyspark.sql.tests.test_pandas_udf_window") 
        
           (echo "Testing PySpark:"; IFS=$'\n'; echo "${spark_python_tests[*]}") 
        
           python/run-tests --testnames "$(IFS=,; echo "${spark_python_tests[*]}")" --python-executables python

), so no idea then. It doesn't seem to allow to deselect, only to select (https://github.com/apache/spark/blob/master/python/run-tests.py#L176-L216)

BryanCutler · 2020-07-10T17:06:22Z

There is no way to deselect specific tests, afaik. You would have select all the other tests individually from that module, for example

spark_python_tests=( 
   ...
   "pyspark.sql.tests.test_pandas_grouped_map GroupedMapInPandasTests.test_supported_types"  
   "pyspark.sql.tests.test_pandas_grouped_map GroupedMapInPandasTests.test_array_type_correct"
   etc...
)

emkornfield · 2020-07-11T04:16:12Z

There is no way to deselect specific tests, afaik. You would have select all the other tests individually from that module, for example

My preference, if it is ok with everyone reviewing this would be to avoid this and try to get something merged into Spark master quickly (I can own looking at nightly spark jobs until that is done to verify no more failures).

@BryanCutler if you don't have the bandwidth to make the PR to spark, I can devote a little bit of effort to submitting one.

Thoughts?

jorisvandenbossche · 2020-07-11T12:46:51Z

We are doing integration tests for both spark master as spark's 3.0 branch, lately. But yes, @BryanCutler indicated that a fix for those branches could be possible, so then it's certainly fine for me to merge as is without skipping any tests.

wesm

+1, I'm making the executive decision to merge this and hopefully that will be a forcing function to address any fallout =)

BryanCutler · 2020-07-12T22:47:24Z

@BryanCutler if you don't have the bandwidth to make the PR to spark, I can devote a little bit of effort to submitting one.

@emkornfield not sure if I'll have the time to get to it in the next couple of days, so if you're able to take it on that would be great. I made https://issues.apache.org/jira/browse/SPARK-32285 for the issue - just leave a comment if you start work on it. BTW, the spark integration tests are currently failing due to recent Arrow Java changes and a patch will have to be made to fix that first. I should be able to get that done.

- Ports string_to_timezone to C++ - Causes nested timestamp columns within structs that have timezones to go through object conversion - Copy timezone on to_object path. Open to other suggestions on how to solve this. It still seems like this is inconsistent with nested timestamps in lists which isn't great. Closes apache#7604 from emkornfield/datetime Lead-authored-by: Micah Kornfield <emkornfield@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org>

ARROW-9223: [Python] Propagate Timzone information in pandas conversion

b738ae7

- Ports string_to_timezone to C++ - Causes nested timestamp columns within structs to use conversion to object path. - Copy timezone on to_object path. Open to other suggestions on how to solve this.

minor fixes

c14a516

emkornfield added 2 commits July 1, 2020 08:09

fix string

aa91a36

ask question inline about list handling

cbb443b

emkornfield commented Jul 1, 2020

View reviewed changes

cpp/src/arrow/python/datetime.cc Outdated Show resolved Hide resolved

emkornfield commented Jul 1, 2020

View reviewed changes

cpp/src/arrow/python/arrow_to_pandas.cc Outdated Show resolved Hide resolved

Apply same convesion of timestamp to datetime for lists

a09602b

(previously datetime64 was used). Do correct timezone adjustment.

emkornfield added 3 commits July 1, 2020 20:54

add whitespace for lint

4aafda5

only modify with timezone path

2ed3d41

use fromutc, it is more succinct

d321a8d

jorisvandenbossche reviewed Jul 2, 2020

View reviewed changes

cpp/src/arrow/python/arrow_to_pandas.cc Outdated Show resolved Hide resolved

python/pyarrow/tests/test_pandas.py Outdated Show resolved Hide resolved

wesm changed the title ~~ARROW-9223: [Python] Propagate Timzone information in pandas conversion~~ ARROW-9223: [Python] Propagate timezone information in pandas conversion Jul 2, 2020

wesm reviewed Jul 2, 2020

View reviewed changes

cpp/src/arrow/python/datetime.cc Outdated Show resolved Hide resolved

BryanCutler mentioned this pull request Jul 3, 2020

[SPARK-32162][PYTHON][TESTS] Improve error message of Pandas grouped map test with window apache/spark#28987

Closed

emkornfield added 2 commits July 3, 2020 04:35

remove regex. use to_object always for struct

783f370

only truncate seconds and milliseconds

3d86477

move comment

1ed5ee3

jorisvandenbossche reviewed Jul 3, 2020

View reviewed changes

cpp/src/arrow/python/datetime.cc Show resolved Hide resolved

cpp/src/arrow/python/datetime.cc Show resolved Hide resolved

fix logic for propagating object_as_timestamp and other review comments

4285420

jorisvandenbossche added 2 commits July 6, 2020 14:55

tz fix for to_pandas with timestamp array and timestamp_as_object=True

4137304

fix test

a550a76

wesm approved these changes Jul 11, 2020

View reviewed changes

wesm closed this in f86c4db Jul 11, 2020

BryanCutler mentioned this pull request Jul 14, 2020

ARROW-9438: [CI] Add spark patch to compile with recent Arrow Java changes #7746

Closed

asfimport mentioned this pull request Jul 20, 2020

[Python] Fix to_pandas() export for timestamps within structs #25321

Closed

jorisvandenbossche mentioned this pull request Apr 15, 2024

[Python] Difference in timezone-awareness of result when calling to_pandas between unnested and nested timestamp arrays #41162

Open

ARROW-9223: [Python] Propagate timezone information in pandas conversion #7604

ARROW-9223: [Python] Propagate timezone information in pandas conversion #7604

Uh oh!

Conversation

emkornfield commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emkornfield commented Jul 1, 2020

Uh oh!

github-actions bot commented Jul 1, 2020

Uh oh!

Uh oh!

Uh oh!

emkornfield commented Jul 2, 2020

Uh oh!

emkornfield commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Jul 2, 2020

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche commented Jul 2, 2020

Uh oh!

github-actions bot commented Jul 2, 2020

Uh oh!

jorisvandenbossche commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nealrichardson commented Jul 2, 2020

Uh oh!

emkornfield commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Jul 2, 2020

Uh oh!

emkornfield commented Jul 2, 2020

Uh oh!

jorisvandenbossche commented Jul 2, 2020

Uh oh!

jorisvandenbossche commented Jul 2, 2020

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

emkornfield commented Jul 4, 2020

Uh oh!

jorisvandenbossche commented Jul 6, 2020

Uh oh!

github-actions bot commented Jul 6, 2020

Uh oh!

BryanCutler commented Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Jul 6, 2020

Uh oh!

BryanCutler commented Jul 6, 2020

Uh oh!

jorisvandenbossche commented Jul 8, 2020

Uh oh!

wesm commented Jul 9, 2020

Uh oh!

emkornfield commented Jul 9, 2020

Uh oh!

BryanCutler commented Jul 10, 2020

Uh oh!

jorisvandenbossche commented Jul 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BryanCutler commented Jul 10, 2020

Uh oh!

emkornfield commented Jul 1, 2020 •

edited

Loading

emkornfield commented Jul 2, 2020 •

edited

Loading

jorisvandenbossche commented Jul 2, 2020 •

edited

Loading

emkornfield commented Jul 2, 2020 •

edited

Loading

BryanCutler commented Jul 6, 2020 •

edited

Loading

jorisvandenbossche commented Jul 10, 2020 •

edited

Loading