-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15251: [C++] Temporal floor/ceil/round handle ambiguous/nonexistent local time #12528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
354bef6 to
2e018c5
Compare
|
@rok could you check this?, I tested different |
|
A somewhat contrived example that currently gives a nonexistent error with rounding (it requires an atypical |
rok
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @AlvinJ15! This looks pretty complete already. You can find an example test for nonexistent and ambiguous here: https://howardhinnant.github.io/date/tz.html#nonexistent_local_time
I'll do another pass tonight or tomorrow.
cpp/src/arrow/compute/api_scalar.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if AssumeTimezoneOptions::Ambiguous and RoundTemporalOptions::Ambiguous would have the same options long-term (same for nonexistent). For now this change seems like the way to go, I'm just wondering if the name compute::Ambiguous should maybe be compute::AmbiguousTime (and compute::NonexistentTime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take the suggestion and changed compute::Ambiguous to compute::AmbiguousTime and compute::Nonexistent to compute::NonexistentTime
Are you saying that |
2e018c5 to
11c1c52
Compare
|
@rok comments solved, the |
c6cb7e6 to
acdf872
Compare
0c31bd1 to
b16eeed
Compare
rok
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments. Looks good overall!
We need a review from a commiter as well @jorisvandenbossche @pitrou
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps AssumeTimezone kernel could reuse this now that you have it nicely factored out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have arrow_vendored::date::choose as a template parameter rather than passing options every time the kernel is called? (I'm not certain just asking)
Also note that we could probably do more templating for the ceil/floor/round kernels, but that's out of scope here.
|
Can you explain the motivation for this functionality? |
The issue is that rounding is done on local time not UTC. So if the rounded-to moment does not exist in local time rounding will fail and we need to handle it at that point. |
|
Right... but as I said this usually means the original timestamp was already invalid, no? So instead of catching errors, would it be more/less useful to have a function purely to fix invalid timestamps? |
|
Original timestamp can be valid and rounded-to not. Let's take an example from date.h docs: If we start with a valid |
|
Hmm, I see. While ceiling to 2h30min sounds exotic, this is a valid use case. |
|
Still, this example doesn't make sense to me: const char* times = R"(["2018-10-28 01:20:00"])";
const char* times_earliest = R"(["2018-10-28 00:30:00"])";
const char* times_latest = R"(["2018-10-28 01:30:00"])";The only correct answer here is |
I think you can achieve this with less exotic intervals too.
Indeed. So choice should only be raise or latest? |
|
Taken more abstractly, the contract is the following:
So it should be possible to implement the expected semantics without exposing any additional options to the user, possibly by examining the two |
Currently we only raise and we want to control raise vs earliest/latest. That would mean exposing additional option I think? Or are you proposing to not raise? I think that's a valid for nonexistent, but I'm not sure about ambiguous. E.g. ceil(t) falls to exact moment of DST switch and could return t or t + dst_offset. We can take the same approach here and see if users eventually complain :). Here's an ambiguous example from date.h. |
Rok already mentioned it, but while it's true that non-existent times from rounding are a bit exotic, the ambiguous is certainly not. To give a concrete example, assume the local time "2021-10-31 02:25:00" in Europe (during a DST switch) and rounding that to the hour: But indeed, also in this case we can know that "00:00::00 UTC" is closer to the original timestamp than "01:00:00 UTC" (since the original timestamp in UTC was "00:25:00 UTC"). That adds some more logic to this kernel, but this would actually make those round kernels more useful!
If there is no ambiguity left (eg as in the example above), I think we should not raise by default. But it might be that for some cases it's still better to raise by default. For example in the case of "non-existent" times, we are actually changing the resulting timestamp, and thus that also means it will not necessarily "follow" the rounding |
By definition of ceil(), it should return the smallest applicable value, so there is no ambiguity.
That's true, though it only seems to trigger for "unusual" roundings. So we may want to add an option for non-existent timestamps, but it sounds less important than getting ambiguous timestamps right (which shouldn't require an option). |
We could catch these cases and implement logic to return correct multiple rounding. Then we wouldn't need any options and would never have to raise. I agree with the other conclusions. |
|
@AlvinJ15 any progress on this? Can I help somehow? |
|
I think monotonicity in UTC is important for analytics use cases (such as bucketing). Monotonicity in wall time not so much.
I'm not sure I understand precisely what you mean, but is that part of an analytics workload? |
Let's say you want to analyse trades in arbitrary intervals. In UTC things would make sense but when you view them in wall time buckets with broken order you could have negative inventory because a sell could come before a buy. But I'm not sure anyone wants this right now and I'm ok dropping it.
Another example would be creating a histogram of taxi pick-up times in local time. I think we had local rounding discussion here. |
If you want your histogram to be useful, the buckets have to have equal durations in physical time. Looking at your graph, that would not be the case for the "preserve wall" variants? |
Physically equal buckets would be great but we can't have them given the constraints. I chose to flatten first fold for floor and second for ceil because I felt functions imply that. What we also could do is double-size buckets in wall time to distribute the flattening over the ambiguous period. Either way we don't have a "good excuse" for changing buckets. Again we can kick "preserve wall" out from this scope and make a jira linking to this discussion and see if there's a need to implement this later on. |
|
@jorisvandenbossche what's your take on regular and "preserve_wall" behaviours here? |
|
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍 |
fff7bd4 to
a48f69d
Compare
a48f69d to
c027b76
Compare
f77fd93 to
107c413
Compare
Tweaking nonexistent/ambiguous rounding Moving nonexistent/ambiguous logic to AssumeTimezone Revert AssumeTimezoneOptions::Nonexistent changes Fixing compiler warnings Fixing ceil/round issues Apply suggestions from code review Review feedback Review feedback Changes to ceil/floor, more tests Refactoring refactoring Review feedback review feedback Review feedback adding python tests adding ambiguous round test python Update cpp/src/arrow/compute/kernels/scalar_temporal_test.cc change nonexistent/ambiguous behaviour Add preserve_wall_time_order flag
107c413 to
35cab06
Compare
|
Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer. |
Temporal floor/ceil/round handle ambiguous/nonexistent local time