Skip to content

Conversation

@AlvinJ15
Copy link
Contributor

@AlvinJ15 AlvinJ15 commented Mar 1, 2022

Temporal floor/ceil/round handle ambiguous/nonexistent local time

@github-actions
Copy link

github-actions bot commented Mar 1, 2022

@github-actions
Copy link

github-actions bot commented Mar 1, 2022

⚠️ Ticket has no components in JIRA, make sure you assign one.

@AlvinJ15 AlvinJ15 force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch from 354bef6 to 2e018c5 Compare March 1, 2022 08:34
@AlvinJ15
Copy link
Contributor Author

AlvinJ15 commented Mar 1, 2022

@rok could you check this?, I tested different NonExistentTimeError but the fllor/ceil/random didn't raise the exception, it seems like the FloorTimePoint handle this

@jorisvandenbossche
Copy link
Member

A somewhat contrived example that currently gives a nonexistent error with rounding (it requires an atypical multiple to end up in a gap):

>>> arr = pc.assume_timezone(pa.array([pd.Timestamp("2015-03-29 02:30:00")]), "Europe/Brussels", nonexistent="latest")
>>> pc.round_temporal(arr, 16, "minute")
...
ArrowInvalid: Local time does not exist: 2015-03-29 02:56:00.000000 is in a gap between
2015-03-29 02:00:00 CET and
2015-03-29 03:00:00 CEST which are both equivalent to
2015-03-29 01:00:00 UTC

Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @AlvinJ15! This looks pretty complete already. You can find an example test for nonexistent and ambiguous here: https://howardhinnant.github.io/date/tz.html#nonexistent_local_time
I'll do another pass tonight or tomorrow.

Comment on lines 103 to 106
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if AssumeTimezoneOptions::Ambiguous and RoundTemporalOptions::Ambiguous would have the same options long-term (same for nonexistent). For now this change seems like the way to go, I'm just wondering if the name compute::Ambiguous should maybe be compute::AmbiguousTime (and compute::NonexistentTime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take the suggestion and changed compute::Ambiguous to compute::AmbiguousTime and compute::Nonexistent to compute::NonexistentTime

@rok
Copy link
Member

rok commented Mar 1, 2022

it seems like the FloorTimePoint handle this

Are you saying that CeilTimePoint and RoundTimePoint don't raise at all? Or that FloorTimePoint raises for them?

@AlvinJ15 AlvinJ15 force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch from 2e018c5 to 11c1c52 Compare March 2, 2022 05:54
@AlvinJ15
Copy link
Contributor Author

AlvinJ15 commented Mar 2, 2022

@rok comments solved, the re-request review button doesn't work for me

@AlvinJ15 AlvinJ15 force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch 2 times, most recently from c6cb7e6 to acdf872 Compare March 2, 2022 06:22
@AlvinJ15 AlvinJ15 force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch 2 times, most recently from 0c31bd1 to b16eeed Compare March 2, 2022 07:04
Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. Looks good overall!
We need a review from a commiter as well @jorisvandenbossche @pitrou

Comment on lines 124 to 367
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps AssumeTimezone kernel could reuse this now that you have it nicely factored out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have arrow_vendored::date::choose as a template parameter rather than passing options every time the kernel is called? (I'm not certain just asking)

Also note that we could probably do more templating for the ceil/floor/round kernels, but that's out of scope here.

@pitrou
Copy link
Member

pitrou commented Mar 7, 2022

Can you explain the motivation for this functionality?
Usually, if the rounded time is ambiguous/non-existent, the input time was already ambiguous/non-existent, no?

@rok
Copy link
Member

rok commented Mar 7, 2022

Can you explain the motivation for this functionality? Usually, if the rounded time is ambiguous/non-existent, the input time was already ambiguous/non-existent, no?

The issue is that rounding is done on local time not UTC. So if the rounded-to moment does not exist in local time rounding will fail and we need to handle it at that point.

@pitrou
Copy link
Member

pitrou commented Mar 7, 2022

Right... but as I said this usually means the original timestamp was already invalid, no? So instead of catching errors, would it be more/less useful to have a function purely to fix invalid timestamps?

@rok
Copy link
Member

rok commented Mar 7, 2022

Original timestamp can be valid and rounded-to not. Let's take an example from date.h docs:

2016-03-13 02:30:00 is in a gap between
2016-03-13 02:00:00 EST and
2016-03-13 03:00:00 EDT which are both equivalent to
2016-03-13 07:00:00 UTC

If we start with a valid 2016-03-13 00:01:00 EST and ceil it to 2h30min, then the local time will fall into nonexistent gap.

@pitrou
Copy link
Member

pitrou commented Mar 7, 2022

Hmm, I see. While ceiling to 2h30min sounds exotic, this is a valid use case.

@pitrou
Copy link
Member

pitrou commented Mar 7, 2022

Still, this example doesn't make sense to me:

  const char* times = R"(["2018-10-28 01:20:00"])";
  const char* times_earliest = R"(["2018-10-28 00:30:00"])";
  const char* times_latest = R"(["2018-10-28 01:30:00"])";

The only correct answer here is 2018-10-28 01:30:00 (because ceil should produce a timestamp that is not before the input timestamp). And so there is no ambiguity.

@rok
Copy link
Member

rok commented Mar 7, 2022

Hmm, I see. While ceiling to 2h30min sounds exotic, this is a valid use case.

I think you can achieve this with less exotic intervals too.

The only correct answer here is 2018-10-28 01:30:00 (because ceil should produce a timestamp that is not before the input timestamp). And so there is no ambiguity.

Indeed. So choice should only be raise or latest?

@pitrou
Copy link
Member

pitrou commented Mar 7, 2022

Taken more abstractly, the contract is the following:

  • round returns the possible output that is closest to the input timestamp
  • floor returns the possible output that is closest to but not after the input timestamp
  • ceil returns the possible output that is closest to but not before the input timestamp

So it should be possible to implement the expected semantics without exposing any additional options to the user, possibly by examining the two earliest and latest values and choosing the best one.

@rok
Copy link
Member

rok commented Mar 8, 2022

So it should be possible to implement the expected semantics without exposing any additional options to the user, possibly by examining the two earliest and latest values and choosing the best one.

Currently we only raise and we want to control raise vs earliest/latest. That would mean exposing additional option I think?

Or are you proposing to not raise? I think that's a valid for nonexistent, but I'm not sure about ambiguous. E.g. ceil(t) falls to exact moment of DST switch and could return t or t + dst_offset. We can take the same approach here and see if users eventually complain :). Here's an ambiguous example from date.h.

@jorisvandenbossche
Copy link
Member

While ceiling to 2h30min sounds exotic, this is a valid use case.

Rok already mentioned it, but while it's true that non-existent times from rounding are a bit exotic, the ambiguous is certainly not.

To give a concrete example, assume the local time "2021-10-31 02:25:00" in Europe (during a DST switch) and rounding that to the hour:

>>> arr = pa.array([pd.Timestamp("2021-10-31 02:25:00")])
>>> arr = pc.assume_timezone(arr, "Europe/Brussels", ambiguous="earliest")
>>> arr
<pyarrow.lib.TimestampArray object at 0x7f00c1e04760>
[
  2021-10-31 00:25:00.000000
]

>>> pc.round_temporal(arr, 1, "hour")
...
ArrowInvalid: Local time is ambiguous: 2021-10-31 02:00:00.000000 is ambiguous.  It could be
2021-10-31 02:00:00.000000 CEST == 2021-10-31 00:00:00.000000 UTC or
2021-10-31 02:00:00.000000 CET == 2021-10-31 01:00:00.000000 UTC

But indeed, also in this case we can know that "00:00::00 UTC" is closer to the original timestamp than "01:00:00 UTC" (since the original timestamp in UTC was "00:25:00 UTC").

That adds some more logic to this kernel, but this would actually make those round kernels more useful!
(for example, if you have a regular timeseries (say of minute interval) and you round it to the hour, you could never pick an ambiguous="latest"/"earliest" option that is correct for all values in your timeseries)

Or are you proposing to not raise?

If there is no ambiguity left (eg as in the example above), I think we should not raise by default.

But it might be that for some cases it's still better to raise by default. For example in the case of "non-existent" times, we are actually changing the resulting timestamp, and thus that also means it will not necessarily "follow" the rounding multiple and unit. I think in such cases, it might still be better to raise by default?

@pitrou
Copy link
Member

pitrou commented Mar 9, 2022

@rok

Or are you proposing to not raise? I think that's a valid for nonexistent, but I'm not sure about ambiguous. E.g. ceil(t) falls to exact moment of DST switch and could return t or t + dst_offset.

By definition of ceil(), it should return the smallest applicable value, so there is no ambiguity.

@jorisvandenbossche

For example in the case of "non-existent" times, we are actually changing the resulting timestamp, and thus that also means it will not necessarily "follow" the rounding multiple and unit. I think in such cases, it might still be better to raise by default?

That's true, though it only seems to trigger for "unusual" roundings. So we may want to add an option for non-existent timestamps, but it sounds less important than getting ambiguous timestamps right (which shouldn't require an option).

@rok
Copy link
Member

rok commented Mar 9, 2022

For example in the case of "non-existent" times, we are actually changing the resulting timestamp, and thus that also means it will not necessarily "follow" the rounding multiple and unit. I think in such cases, it might still be better to raise by default?

We could catch these cases and implement logic to return correct multiple rounding. Then we wouldn't need any options and would never have to raise.

I agree with the other conclusions.

@rok
Copy link
Member

rok commented Mar 30, 2022

@AlvinJ15 any progress on this? Can I help somehow?
I have another rounding PR and I'd like to use your changes there.

@pitrou
Copy link
Member

pitrou commented Jun 8, 2022

I think monotonicity in UTC is important for analytics use cases (such as bucketing). Monotonicity in wall time not so much.

Use case would be something like when do users form a certain country come to my website. Rounding in UTC would be inconsistent. I believe it is a needed feature.

I'm not sure I understand precisely what you mean, but is that part of an analytics workload?

@rok
Copy link
Member

rok commented Jun 8, 2022

I think monotonicity in UTC is important for analytics use cases (such as bucketing). Monotonicity in wall time not so much.

Let's say you want to analyse trades in arbitrary intervals. In UTC things would make sense but when you view them in wall time buckets with broken order you could have negative inventory because a sell could come before a buy. But I'm not sure anyone wants this right now and I'm ok dropping it.

Use case would be something like when do users form a certain country come to my website. Rounding in UTC would be inconsistent. I believe it is a needed feature.

I'm not sure I understand precisely what you mean, but is that part of an analytics workload?

Another example would be creating a histogram of taxi pick-up times in local time. I think we had local rounding discussion here.

@pitrou
Copy link
Member

pitrou commented Jun 8, 2022

Another example would be creating a histogram of taxi pick-up times in local time.

If you want your histogram to be useful, the buckets have to have equal durations in physical time. Looking at your graph, that would not be the case for the "preserve wall" variants?

@rok
Copy link
Member

rok commented Jun 8, 2022

Another example would be creating a histogram of taxi pick-up times in local time.

If you want your histogram to be useful, the buckets have to have equal durations in physical time. Looking at your graph, that would not be the case for the "preserve wall" variants?

Physically equal buckets would be great but we can't have them given the constraints. I chose to flatten first fold for floor and second for ceil because I felt functions imply that. What we also could do is double-size buckets in wall time to distribute the flattening over the ambiguous period. Either way we don't have a "good excuse" for changing buckets.
Also user would have to choose "preserve wall" so they would presumably know what they're doing and why.

Again we can kick "preserve wall" out from this scope and make a jira linking to this discussion and see if there's a need to implement this later on.

@rok
Copy link
Member

rok commented Jun 9, 2022

@jorisvandenbossche what's your take on regular and "preserve_wall" behaviours here?

@rok
Copy link
Member

rok commented Jun 16, 2022

@raulcd

@rok rok self-assigned this Jan 14, 2023
@amol-
Copy link
Member

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
@rok rok reopened this Mar 30, 2023
@rok rok requested review from AlenkaF and westonpace as code owners March 30, 2023 17:51
@westonpace westonpace removed their request for review July 6, 2023 14:09
@rok rok force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch from fff7bd4 to a48f69d Compare December 23, 2023 22:31
@github-actions github-actions bot added the awaiting review Awaiting review label Dec 23, 2023
@rok rok force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch from a48f69d to c027b76 Compare April 8, 2024 00:22
@rok rok force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch 2 times, most recently from f77fd93 to 107c413 Compare April 19, 2024 17:12
AlvinJ15 and others added 5 commits April 20, 2024 23:57
Tweaking nonexistent/ambiguous rounding
Moving nonexistent/ambiguous logic to AssumeTimezone
Revert AssumeTimezoneOptions::Nonexistent changes
Fixing compiler warnings
Fixing ceil/round issues
Apply suggestions from code review
Review feedback
Review feedback
Changes to ceil/floor, more tests
Refactoring
refactoring
Review feedback
review feedback
Review feedback
adding python tests
adding ambiguous round test python
Update cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
change nonexistent/ambiguous behaviour
Add preserve_wall_time_order flag
@rok rok force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch from 107c413 to 35cab06 Compare April 28, 2024 19:56
@github-actions
Copy link

Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer.

@github-actions github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@github-actions github-actions bot closed this Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review Awaiting review Component: C++ Component: Python Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants