-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14822: [C++] Implement floor/ceil/round for temporal objects #11818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
c986aeb to
5b1fe4b
Compare
0988d48 to
0814df6
Compare
|
This is getting review ready, couple of issues remaining:
|
|
@lidavidm @jorisvandenbossche could you take a quick glance to see if something needs to be done fundamentally different here? If not I hope to have this review ready tomorrow. |
Would we want to eventually support rounding in a timezone? (That can be deferred since I can see that getting tricky.) e.g. rounding to the hour may differ in a timezone that is xx:30 offset from UTC. |
I've been thinking about that exact situation (xx:30) and I'm not sure yet how to tackle it but I'd like to in this ticket. |
|
I only took a quick scan but I don't see anything fundamentally wrong.
|
I had it originally but it felt redundant as it's just |
|
Ah, that is true - we could just document it. |
|
@lidavidm - I think this is ready for first round of review. |
|
@pitrou any thoughts on this? |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, taking into account the TODOs.
(didn't yet look at the code) Did you then also remove support rounding timestamps with a timezone? Or how does rounding tz-aware timestamps work? It's also not really clear to me what |
At the moment some of the code (ns, us, ms, s, m, h, d, w) is UTC based even when timezones are present. UTC always converts to a local time so nonexistent is not an issue while ambiguous could be.
Origin is meant like the origin in Euclidean space. Let's say you want to floor today's date to floor("2021-12-22", unit="year", multiple=100, origin="1970-01-01") == "1970-01-01"
floor("2021-12-22", unit="year", multiple=100, origin="0-01-01") == "2000-01-01"R's clock has this concept. |
|
I think tests are now mostly in order. Remaining work:
Am I forgetting something? Which of these could be pushed out of this scope? I'm thinking shorthand and perhaps perhaps timezone handling. |
b241c3c to
e1e55d0
Compare
4429c20 to
d84408d
Compare
ed1f711 to
b93f075
Compare
|
@lidavidm @jorisvandenbossche this now rounds time since epoch in local time (see here). It works ok outside of DST and discussible within. It's a different approach to what e.g. Pandas does. |
python/pyarrow/tests/test_compute.py
Outdated
| timestamps = [ | ||
| # "1899-04-18 01:57:09.190202880", | ||
| # "1899-09-12 07:03:30.080325120", | ||
| # "1904-06-21 20:55:36.493869056", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these still meant to be commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some odd changes in timezone offsets that are causing problems. I'll try changing ceil to match behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the kind of error I'm seeing for these three timestamps:
actual:
Timestamp('1899-04-19 00:31:50+0553', tz='Asia/Kolkata')
Timestamp('1899-09-13 00:31:50+0553', tz='Asia/Kolkata')
Timestamp('1904-06-22 23:59:50+0521', tz='Asia/Kolkata')
expected:
Timestamp('1899-04-19 00:00:00+0553', tz='Asia/Kolkata'),
Timestamp('1899-09-13 00:00:00+0553', tz='Asia/Kolkata'),
Timestamp('1904-06-23 00:00:00+0521', tz='Asia/Kolkata')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calcutta time was one of the two time zones established in British India in 1884. It was established during the International Meridian Conference held at Washington, D.C. in the United States. It was decided that India had two time zones: Calcutta (now Kolkata) would use the 90th meridian east and Bombay (now Mumbai) the 75th meridian east. It was determined as 5 hours, 53 minutes and 20 seconds ahead of Greenwich Mean Time(UTC+5:53:20).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh boy. So, if I'm reading this right, it's rounding using the "modern" TZ definition instead of the "contemporary" one? Maybe we should just not test this case then, since this is down to whatever the timezone database for a system contains…? Or am I misunderstanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's that but I'm not 100% and I really want to be :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I mislabeled that (switched actual and expected, fixed now). It appears my rounding in local time is not done correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm back to thinking Pandas only uses "modern" TZs while date.h uses a historical database.
I've added a test for this in c++ and removed it from Python.
4be4c66 to
e2b6d9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for context, can we note that this test is here to test the a case where date (and hence our kernels) uses historical TZ info while Pandas (and possibly other libraries) do not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise I fear confusion when someone in the future looks and wonders why Kolkata was an especially important time zone to test :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems reality is more nuanced than Wikipedia states.
|
Oh, and we need to add this to the python api.rst as well. |
c7cf795 to
e8f9a62
Compare
|
Thanks @lidavidm & @jorisvandenbossche ! :) Follow-up Jiras:
|
|
Benchmark runs are scheduled for baseline = acce03b and contender = edab145. edab145 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This is to resolve ARROW-14822.