Skip to content

Conversation

@emkornfield
Copy link
Contributor

Draft PR to enable round trip to TZ info to hopefully solve spark issues.

@emkornfield
Copy link
Contributor Author

@github-actions crossbow submit test-conda-python-3.7-spark-master

@emkornfield emkornfield changed the title Honor tzinfo when converting from datetime ARROW-9528: [Python] Honor tzinfo when converting from datetime Jul 20, 2020
@emkornfield
Copy link
Contributor Author

@github-actions crossbow submit test-conda-python-3.7-spark-master

@github-actions
Copy link

} else if (PyDateTime_Check(obj)) {
++timestamp_micro_count_;
OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo"));
if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timezone_ should be first here.

def as_py(self):
"""
Return this value as a Python datetime.datetime instance.
Return this value as a Pandas Timestamp instance (if available),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be reverted.

@kou
Copy link
Member

kou commented Jul 20, 2020

@github-actions crossbow submit test-conda-python-3.8-spark-master

@github-actions
Copy link

Revision: a5b2a51

Submitted crossbow builds: ursa-labs/crossbow @ actions-433

Task Status
test-conda-python-3.8-spark-master Github Actions

@kou
Copy link
Member

kou commented Jul 20, 2020

We can use task listed only in https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L1930 for crossbow submit.
We don't have ...-3.7-... task. We only have ...-3.8-... task for Spark master.

If we need to run with Python 3.7, we can add a task to dev/tasks/tasks.yml.

@emkornfield
Copy link
Contributor Author

@kou 3.8 should be fine. Thank you!. I copied the command from the previous PR related to datetimes, I guess CI has changed since then.

++int_count_;
} else if (PyDateTime_Check(obj)) {
++timestamp_micro_count_;
OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If null is returned, it means Python raised an error (for example the attribute doesn't exist, which is unlikely). You want either to return that error, or to ignore it (using PyErr_Clear).

def test_nested_with_timestamp_tz_round_trip():
ts = pd.Timestamp.now()
ts_dt = ts.to_pydatetime()
arr = pa.array([ts_dt], type=pa.timestamp('us', tz='America/New_York'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test presume that ts itself was produced in "America/New York" timezone? It's not clear to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so. Pretty sure my machine uses it. Local time, I'll double check by setting a different tz

if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) {
// From public docs on array construction
// "Localized timestamps will currently be returned as UTC "
// representation). "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean Arrow simply stores an erroneous value? We don't do timezone conversion in Arrow, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure what it was intended to mean. This was my best guess. Not accounting for timezones info seems like a bug? Would you prefer I try to capture the first time zone encountered as a string? Or is the preference not to have the logic in this PR in arrow in the first place?

Copy link
Member

@pitrou pitrou Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I reuse the nomenclature from the test, I get without this PR (it's 14h21 UTC currently):

>>> now_utc                                                                                                     
datetime.datetime(2020, 7, 20, 14, 21, 42, 96119, tzinfo=<UTC>)
>>> arr = pa.array([now_utc])                                                                                   
>>> arr                                                                                                         
<pyarrow.lib.TimestampArray object at 0x7f44b0bfcc90>
[
  2020-07-20 14:21:42.096119
]
>>> arr.type.tz                                                                                                 
>>> arr.to_pylist()                                                                                             
[datetime.datetime(2020, 7, 20, 14, 21, 42, 96119)]
>>> arr.to_pandas()                                                                                             
0   2020-07-20 14:21:42.096119
dtype: datetime64[ns]
>>> now_with_tz                                                                                                           
datetime.datetime(2020, 7, 20, 10, 21, 42, 96119, tzinfo=<DstTzInfo 'US/Eastern' EDT-1 day, 20:00:00 DST>)
>>> arr = pa.array([now_with_tz])                                                                                         
>>> arr.type.tz                                                                                                           
>>> arr.to_pylist()                                                                                                       
[datetime.datetime(2020, 7, 20, 10, 21, 42, 96119)]
>>> arr.to_pylist()[0].tzinfo                                                                                             
>>> arr.to_pandas()                                                                                                       
0   2020-07-20 10:21:42.096119
dtype: datetime64[ns]

So without the PR, there's the problem that timestamps lose the timezone information from Python. It seems to get worse with this PR because non-UTC timestamps get tagged as UTC without being corrected for the timezone's offset, which is misleading. At least originally you may be alerted by the absence of a timezone on the type (in Python terms, it's a "naive" timestamp).

Copy link
Contributor Author

@emkornfield emkornfield Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to get worse with this PR because non-UTC timestamps get tagged as UTC without being corrected for the timezone's offset, which is misleading.

That is not the intent of the PR, right now everything gets corrected to UTC. As an example:
This correctly keeps the times logically the same. I can make the change to try to keep the original timezones in place and changes US/Eastern to the correct time in UTC>

>>> now_with_tz = datetime.datetime(2020, 7, 20, 10, 21, 42, 96119, tzinfo=pytz.timezone('US/Eastern'))
>>> arr = pa.array([now_with_tz]) 
>>> arr.type.tz  
'UTC'
>>> arr.to_pylist() 
[datetime.datetime(2020, 7, 20, 15, 17, 42, 96119, tzinfo=<UTC>)]
>>> arr.to_pylist()[0].tzinfo
<UTC>
>>> arr.to_pandas()
0   2020-07-20 15:17:42.096119+00:00
dtype: datetime64[ns, UTC]

struct = pa.StructArray.from_arrays([arr, arr], ['start', 'stop'])

result = struct.to_pandas()
# N.B. we test with Panaas because the Arrow types are not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Pandas" :-)

result = struct.to_pandas()
# N.B. we test with Panaas because the Arrow types are not
# actually equal. All Timezone aware times are currently normalized
# to "UTC" as the timesetamp type.but since this conversion results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"timestamp"

# N.B. we test with Panaas because the Arrow types are not
# actually equal. All Timezone aware times are currently normalized
# to "UTC" as the timesetamp type.but since this conversion results
# in object dtypes, and tzinfo is preserrved the comparison should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"preserved"

def test_array_from_scalar():
today = datetime.date.today()
now = datetime.datetime.now()
now_utc = now.replace(tzinfo=pytz.utc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my experimentations, you should write:

now_utc = datetime.datetime.now(tz=pytz.utc)

(simply calling .replace(tzinfo=pytz.utc) doesn't adjust the recorded time for the timezone change, so you get the local time tagged with a UTC timezone)

And, yes, this probably doesn't matter for the test's correctness :-)



def test_nested_with_timestamp_tz_round_trip():
ts = pd.Timestamp.now()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What timezone does this timestamp have? Is it a naive timestamp? Would be nice explaining it in comments.

@kszucs
Copy link
Member

kszucs commented Jul 20, 2020

My main concern with this solution is while it resolves the pandas roundtrip, the intermediate array values are different.
People may "rely" on the previous buggy behavior, and I'm afraid that it'll cause more post release trouble than we expect.

Running the following snippet on three different revisions:

import pytz
from datetime import datetime

import pyarrow as pa

now_at_budapest = datetime.now(pytz.timezone('Europe/Budapest'))
arr = pa.array([now_at_budapest], type=pa.timestamp('s', tz='Europe/Budapest'))

try:
    pa.show_versions()
except AttributeError:
    print("Arrow version: {}".format(pa.__version__))

print(arr)
print(arr.to_pandas())

0.17.1

Arrow version: 0.17.1
[
    2020-07-20 17:01:11
]
0   2020-07-20 19:01:11+02:00
dtype: datetime64[ns, Europe/Budapest]

Master

pyarrow version info
--------------------
Package kind: not indicated
Arrow C++ library version: 1.0.0-SNAPSHOT
Arrow C++ compiler: AppleClang 11.0.3.11030032
Arrow C++ compiler flags:  -Qunused-arguments -fcolor-diagnostics -ggdb -O0
Arrow C++ git revision: 210d3609f027ef9ed83911c2d1132cb9cbb2dc06
Arrow C++ git description: apache-arrow-0.17.0-756-g210d3609f
[
    2020-07-20 17:10:11
]
0   2020-07-20 19:10:11+02:00
dtype: datetime64[ns, Europe/Budapest]

This patch

pyarrow version inf
--------------------                                                        
Package kind: not indicated                                                 
Arrow C++ library version: 1.0.0-SNAPSHOT                                   
Arrow C++ compiler: AppleClang 11.0.3.11030032                              
Arrow C++ compiler flags:  -Qunused-arguments -fcolor-diagnostics -ggdb -O0 
Arrow C++ git revision: a5b2a51665ab1383fb371ecd76bb3c20c4bf8726            
Arrow C++ git description: apache-arrow-0.17.0-761-ga5b2a5166               
[                                                                           
  2020-07-20 15:01:12                                                       
]                                                                           
0   2020-07-20 17:01:12+02:00                                               
dtype: datetime64[ns, Europe/Budapest]                                      

While the current master works for this example and the spark patch fixes the spark integration test, it breaks the nested roundtrip example discussed in the ML thread.

@emkornfield @BryanCutler thoughts?

@emkornfield
Copy link
Contributor Author

@kszucs Breaking users is a concern, I'll add an environment variable for both this change and the previous one that can keep the old buggy behavior. Just to clarify: was actually ~5PM Budapest time when you ran these test (i.e. this patch looks like it fixes a bug?)

I thought the unit test I added for Pandas captured the intent of the ML example? Let me try to run the example by hand in python to see the results.

@kszucs
Copy link
Member

kszucs commented Jul 20, 2020

@kszucs Breaking users is a concern, I'll add an environment variable for both this change and the previous one that can keep the old buggy behavior. Just to clarify: was actually ~5PM Budapest time when you ran these test (i.e. this patch looks like it fixes a bug?)

Right, this patch produces the right result.

I thought the unit test I added for Pandas captured the intent of the ML example? Let me try to run the example by hand in python to see the results.

It fixes that as well, just doesn't keep the old buggy behavior. I was considering to just apply the spark patch on the current master to keep the old buggy behavior, but there is still the nested issue.

@emkornfield
Copy link
Contributor Author

It fixes that as well, just doesn't keep the old buggy behavior. I was considering to just apply the spark patch on the current master to keep the old buggy behavior, but there is still the nested issue.

I've run out of time this morning to work on this PR. I'll update it tonight with an environment variable flag that can retain the old buggy behavior.

@kszucs
Copy link
Member

kszucs commented Jul 20, 2020

@emkornfield Thanks for working on this!

In the meantime I'm going to apply the reversion patch and cut RC2 since it is going to take at least 6-8 hours to build and three additional days for voting, so we'll have enough time to sort this issue out and decide to either cut RC3 including this patch or keep RC2.

@BryanCutler
Copy link
Member

Just to clarify things, is the main concern with this patch over keeping the previous buggy behavior? Besides that are these changes producing correct results and passing roundtrip tests?
It does seem a little late in the game to be making these kind of changes, so unless others view this as extremely low risk, I'm in favor of reverting and fixing this later when it's not so rushed.

@nealrichardson
Copy link
Member

Correct me if I'm wrong, but IIUC there are doubts about a few things:

  1. the correctness of the behavior on master (prior to reverting the initial change)
  2. the correctness of the behavior on this patch
  3. how much people have built expectations around the wrong behavior that is the status quo (0.17.1), in Spark and in general
  4. how to give people a safe upgrade path that doesn't break production code, even if the changes are more "correct".

This patch may be the right solution, but I fear that we haven't adequately thought through (and tested) all of the implications and upgrade paths. And two of the people with the strongest opinion's about pyarrow's API (@wesm and @pitrou) just left for vacation and have expressed a preference for reverting the initial change for the 1.0 release. At this stage of the 1.0 release, I'd rather pyarrow continue to be wrong in the expected way (i.e. revert and not merge this yet) than be right in an unexpected way and possibly wrong in other unknown ways.

@emkornfield
Copy link
Contributor Author

@nealrichardson I think we should discuss this on the mailing list. The prior patch has been reverted and I'll use this one to have an end-to-end solution which probably won't make it into 1.0

@nealrichardson
Copy link
Member

Sure. I thought the mailing list discussion said to discuss here 🤷

@emkornfield
Copy link
Contributor Author

too many communication channels.

@kszucs
Copy link
Member

kszucs commented Jul 22, 2020

@emkornfield I think we can close this in favor of #7816

@emkornfield
Copy link
Contributor Author

yes.

wesm pushed a commit that referenced this pull request Aug 16, 2020
Follow up of:
- ARROW-9223: [Python] Propagate timezone information in pandas conversion
- ARROW-9528: [Python] Honor tzinfo when converting from datetime (#7805)

TODOs:
- [x] Store all Timestamp values normalized to UTC
- [x] Infer timezone from the array values if no explicit type was given
- [x] Testing (especially pandas object roundtrip)
- [x] Testing of timezone-naive roundtrips
- [x] Testing mixed pandas and datetime objects

Closes #7816 from kszucs/tz

Lead-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Co-authored-by: Micah Kornfield <emkornfield@gmail.com>
Signed-off-by: Wes McKinney <wesm@apache.org>
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
Follow up of:
- ARROW-9223: [Python] Propagate timezone information in pandas conversion
- ARROW-9528: [Python] Honor tzinfo when converting from datetime (apache/arrow#7805)

TODOs:
- [x] Store all Timestamp values normalized to UTC
- [x] Infer timezone from the array values if no explicit type was given
- [x] Testing (especially pandas object roundtrip)
- [x] Testing of timezone-naive roundtrips
- [x] Testing mixed pandas and datetime objects

Closes #7816 from kszucs/tz

Lead-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Co-authored-by: Micah Kornfield <emkornfield@gmail.com>
Signed-off-by: Wes McKinney <wesm@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants