-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Form the comments, we've decided to go with option 3:
-
Set the timezone to local time without changing the integer value fo the timestamp. We store whatever integer R passes to us (21600), with CST as the timezone set. Display is then "1970-01-01 00:00:00 CST"
This is surprising because we are asserting the local timezone when that is not specified in R.============================================
POSIXctin R can have timezones specified as""which is typically interpreted as the session local timezone.This can lead to surprising results like:
> Sys.timezone() [1] "America/Chicago" > as.integer(as.POSIXct("1970-01-01")) [1] 21600 > Sys.setenv(TZ = "UTC") > as.integer(as.POSIXct("1970-01-01")) [1] 0 > Sys.setenv(TZ = "Australia/Brisbane") > as.integer(as.POSIXct("1970-01-01")) [1] -36000
This runs counter to what timestamps without timezones are interpreted as in Arrow:
Lines 333 to 336 in 0366943
/// stored as a struct with Date and Time fields. However, it may also be /// encoded into a Timestamp column with an empty timezone. The timestamp /// values should be computed "as if" the timezone of the date-time values /// was UTC; for example, the naive date-time "January 1st 1970, 00h00" would However, it may also be encoded into a Timestamp column with an empty timezone. The timestamp values should be computed "as if" the timezone of the date-time values was UTC; for example, the naive date-time "January 1st 1970, 00h00" would be encoded as timestamp value 0.
Critically in R, when
as.POSIXct("1970-01-01 00:00:00")is run, the timestamp value is computed "as if" the timezone of the date-time values was the local timezone (and not UTC like the Arrow spec says).This can lead to some surprising results when converting these timezoneless timestamps from R to Arrow. Using
as.POSIXct("1970-01-01 00:00:00")as an example, and presume US Central time. We have a few options: -
Warn when the timezone is "" or not set that the behavior might be surprising
We store whatever integer R passes to us (21600), with no timezone set. When someone sees this formatted, the times/dates will be what the time was at UTC ("1970-01-01 06:00:00") -
Set the timezone to UTC without changing the integer value of the timestamp. We store whatever integer R passes to us (21600), with UTC as the timezone set. When someone sees this formatted, the times/dates will be in UTC ("1970-01-01 06:00:00 UTC") This might be surprising / counterintuitive because the timestamps will suddenly be different and will be based in UTC and not local time like people are expecting.
-
Set the timezone to local time without changing the integer value fo the timestamp. We store whatever integer R passes to us (21600), with CST as the timezone set. Display is then "1970-01-01 00:00:00 CST"
This is surprising because we are asserting the local timezone when that is not specified in R.If someone is using a timestamp without tzone in R to represent a timezoneless timestamp, options 2 and 3 above violate that when it is put into Arrow. Whereas, if someone is using a timestamp that just so happens to be without a tzone but they assume it's in local time, option 1 leads to (very) surprising results
Reporter: Jonathan Keane / @jonkeane
Assignee: Dragoș Moldovan-Grünfeld / @dragosmg
Watchers: Rok Mihevc / @rok
Related issues:
- [C++] Timezone database configuration and access (is blocked by)
- [C++][Python][R] PrettyPrint ignores timezone (is related to)
- [R] Revisit binding_format_datetime and remove manual casting (is depended upon by)
PRs and other links:
Note: This issue was originally created as ARROW-14442. Please see the migration documentation for further details.