Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Nov 30, 2021

This is to resolve ARROW-14822.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@rok rok force-pushed the ARROW-14822 branch 4 times, most recently from c986aeb to 5b1fe4b Compare December 6, 2021 10:50
@rok rok force-pushed the ARROW-14822 branch 2 times, most recently from 0988d48 to 0814df6 Compare December 14, 2021 23:37
@rok
Copy link
Member Author

rok commented Dec 14, 2021

This is getting review ready, couple of issues remaining:

  • RoundTemporal, CeilTemporal and FloorTemporal could maybe be templated from a single struct
  • Nonexistent in RoundTemporalOptions might be unnecessary as we assume timestamps are in UTC so they always exist. Probably holds for Ambiguous too.
  • Test data needs some work, especially pre-1970 and edge-cases.

@rok
Copy link
Member Author

rok commented Dec 14, 2021

@lidavidm @jorisvandenbossche could you take a quick glance to see if something needs to be done fundamentally different here? If not I hope to have this review ready tomorrow.

@lidavidm
Copy link
Member

  • Nonexistent in RoundTemporalOptions might be unnecessary as we assume timestamps are in UTC so they always exist. Probably holds for Ambiguous too.

Would we want to eventually support rounding in a timezone? (That can be deferred since I can see that getting tricky.) e.g. rounding to the hour may differ in a timezone that is xx:30 offset from UTC.

@rok
Copy link
Member Author

rok commented Dec 14, 2021

  • Nonexistent in RoundTemporalOptions might be unnecessary as we assume timestamps are in UTC so they always exist. Probably holds for Ambiguous too.

Would we want to eventually support rounding in a timezone? (That can be deferred since I can see that getting tricky.) e.g. rounding to the hour may differ in a timezone that is xx:30 offset from UTC.

I've been thinking about that exact situation (xx:30) and I'm not sure yet how to tackle it but I'd like to in this ticket.

@lidavidm
Copy link
Member

I only took a quick scan but I don't see anything fundamentally wrong.

  • It gets confusing with all the chrono types but we should stick to UpperCamelCase for Arrow functions/methods
  • Do we want QUARTER in addition to SEASON?

@rok
Copy link
Member Author

rok commented Dec 15, 2021

  • Do we want QUARTER in addition to SEASON?

I had it originally but it felt redundant as it's just RoundTemporal(arg, RoundTemporalOptions(unit=month, multiple=3)). I'll add it back as it will probably get requested.

@lidavidm
Copy link
Member

Ah, that is true - we could just document it.

@rok rok marked this pull request as ready for review December 21, 2021 12:04
@rok
Copy link
Member Author

rok commented Dec 21, 2021

@lidavidm - I think this is ready for first round of review.
I've decided to remove Nonexistent and Ambiguous.
I'm working out some issues with rounding multiple seasons but other than that I think it's good to review. I'll also try adding origin to set the "zero time" in one of the future commits.

@rok
Copy link
Member Author

rok commented Dec 21, 2021

@pitrou any thoughts on this?

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, taking into account the TODOs.

@jorisvandenbossche
Copy link
Member

I've decided to remove Nonexistent and Ambiguous.

(didn't yet look at the code) Did you then also remove support rounding timestamps with a timezone? Or how does rounding tz-aware timestamps work?

It's also not really clear to me what origin would do. Can you explain with an example what it would do? (I also don't see it in lubridate or pandas?)

@rok
Copy link
Member Author

rok commented Dec 22, 2021

I've decided to remove Nonexistent and Ambiguous.

(didn't yet look at the code) Did you then also remove support rounding timestamps with a timezone? Or how does rounding tz-aware timestamps work?

At the moment some of the code (ns, us, ms, s, m, h, d, w) is UTC based even when timezones are present. UTC always converts to a local time so nonexistent is not an issue while ambiguous could be.
Month, quarter, year are currently rounded in local time and integer multiples of those "units" are less problematic because beginnings of months seem to always exist. Perhaps this needs more study.

It's also not really clear to me what origin would do. Can you explain with an example what it would do? (I also don't see it in lubridate or pandas?)

Origin is meant like the origin in Euclidean space. Let's say you want to floor today's date to 100year:

floor("2021-12-22", unit="year", multiple=100, origin="1970-01-01") == "1970-01-01"
floor("2021-12-22", unit="year", multiple=100, origin="0-01-01") == "2000-01-01"

R's clock has this concept.
It's also a "replacement" for week_starts_monday.
e.g.: 1970-01-01 was a Thursday so floor(t, unit="week", multiple=1, origin="1970-01-05") is equivalent to floor(t, unit="week", multiple=1, week_starts_monday=True)

@rok
Copy link
Member Author

rok commented Dec 23, 2021

I think tests are now mostly in order. Remaining work:

  • origin
  • timezone handling (nonexistent, ambiguous)
  • shorthand

Am I forgetting something?

Which of these could be pushed out of this scope? I'm thinking shorthand and perhaps perhaps timezone handling.
@lidavidm @jorisvandenbossche

@rok rok force-pushed the ARROW-14822 branch 2 times, most recently from b241c3c to e1e55d0 Compare December 23, 2021 11:11
@rok rok force-pushed the ARROW-14822 branch 3 times, most recently from 4429c20 to d84408d Compare January 1, 2022 17:43
@rok rok force-pushed the ARROW-14822 branch 2 times, most recently from ed1f711 to b93f075 Compare January 3, 2022 01:25
@rok
Copy link
Member Author

rok commented Jan 3, 2022

@lidavidm @jorisvandenbossche this now rounds time since epoch in local time (see here). It works ok outside of DST and discussible within. It's a different approach to what e.g. Pandas does.
I'm going to look into a better approach for this.

timestamps = [
# "1899-04-18 01:57:09.190202880",
# "1899-09-12 07:03:30.080325120",
# "1904-06-21 20:55:36.493869056",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these still meant to be commented out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some odd changes in timezone offsets that are causing problems. I'll try changing ceil to match behaviour.

Copy link
Member Author

@rok rok Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the kind of error I'm seeing for these three timestamps:

actual:

Timestamp('1899-04-19 00:31:50+0553', tz='Asia/Kolkata')
Timestamp('1899-09-13 00:31:50+0553', tz='Asia/Kolkata')
Timestamp('1904-06-22 23:59:50+0521', tz='Asia/Kolkata')

expected:

Timestamp('1899-04-19 00:00:00+0553', tz='Asia/Kolkata'),
Timestamp('1899-09-13 00:00:00+0553', tz='Asia/Kolkata'),
Timestamp('1904-06-23 00:00:00+0521', tz='Asia/Kolkata')

Copy link
Member Author

@rok rok Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calcutta time was one of the two time zones established in British India in 1884. It was established during the International Meridian Conference held at Washington, D.C. in the United States. It was decided that India had two time zones: Calcutta (now Kolkata) would use the 90th meridian east and Bombay (now Mumbai) the 75th meridian east. It was determined as 5 hours, 53 minutes and 20 seconds ahead of Greenwich Mean Time(UTC+5:53:20).

https://en.wikipedia.org/wiki/Calcutta_Time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy. So, if I'm reading this right, it's rounding using the "modern" TZ definition instead of the "contemporary" one? Maybe we should just not test this case then, since this is down to whatever the timezone database for a system contains…? Or am I misunderstanding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's that but I'm not 100% and I really want to be :))

Copy link
Member Author

@rok rok Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I mislabeled that (switched actual and expected, fixed now). It appears my rounding in local time is not done correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm back to thinking Pandas only uses "modern" TZs while date.h uses a historical database.
I've added a test for this in c++ and removed it from Python.

@rok rok force-pushed the ARROW-14822 branch 3 times, most recently from 4be4c66 to e2b6d9f Compare January 4, 2022 19:35
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for context, can we note that this test is here to test the a case where date (and hence our kernels) uses historical TZ info while Pandas (and possibly other libraries) do not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise I fear confusion when someone in the future looks and wonders why Kolkata was an especially important time zone to test :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added.

It seems reality is more nuanced than Wikipedia states.

@lidavidm
Copy link
Member

lidavidm commented Jan 4, 2022

Oh, and we need to add this to the python api.rst as well.

@rok rok force-pushed the ARROW-14822 branch 2 times, most recently from c7cf795 to e8f9a62 Compare January 4, 2022 20:10
@lidavidm lidavidm closed this in edab145 Jan 4, 2022
@rok
Copy link
Member Author

rok commented Jan 4, 2022

Thanks @lidavidm & @jorisvandenbossche ! :)

Follow-up Jiras:

@ursabot
Copy link

ursabot commented Jan 5, 2022

Benchmark runs are scheduled for baseline = acce03b and contender = edab145. edab145 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.45% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.09% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants