-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12198: [R] bindings for strptime #10334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Some questions and suggestions.
r/R/compute.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you want collect_arrays_from_dots here. This function exists to support the base R behavior like:
> sum(1, 2)
[1] 3
But strptime doesn't take ... like that.
r/R/compute.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how useful this function is since it is a thin wrapper around call_function() and we can't set it as an S3 method.. More useful would be to add a version of this in the nse_funcs in dplyr-functions.R.
In either case, we should match the base::strptime() signature: function (x, format, tz = "") with the possible addition of unit if that's an Arrow feature.
Also, should format and unit have default arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
Arrow function uses format and unit as an FunctionOption if I understand correctly, haven't found tz yet.
I think they should have defaults, yes: format = "%Y-%m-%d %H:%M:%S" and unit = TimeUnit$MICRO/2L/"us".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nealrichardson I have trouble with calling strptime() function from nse_funcs - possible name collision with base. Am I missing something? Thank you!
As for defaults I correct myself, format shouldn't have default argument - to match base::strptime() signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was suggesting was
nse_funcs$strptime <- function(x, format = "%Y-%m-%d %H:%M:%S", tz = "", unit = "ms") {
}following the model of the other functions there that build Expressions. And if tz is not supported somehow, we stop() if tz is provided.
r/src/compute.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does StrptimeOptions have a Defaults() method in C++? If so, we should call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that in scalar_string.cc line 1744 would suggest StrptimeOptions do not have Defaults() in C++?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah looks like it. I made ARROW-12809 to evaluate whether that's correct, but for the purposes of the PR, it's fine.
r/tests/testthat/test-Array.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't figured out yet how to pass tz argument to strptime_arrow wrapper function as it is written now. After going through your comments, making the necessary changes then it will not be an issue I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that strptime in the C++ library doesn't take a timezone argument. Maybe it is expected that if there is a timezone, it will be encoded in the string and parseable by strptime (with the right format string)? But this gets us into the always tricky area of timezone-aware vs. timezone-naive data. @jorisvandenbossche do you have any thoughts/experience with this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly system strptime is used so %Z or %z would work. E.g.: 2020-01-01 23:23:14 Europe/Amsterdam would be captured by format = "%Y-%m-%d %H:%M:%S %Z".
Capturing timezones would be great IMO but I would listen to Joris here for sure :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rok do you know how the system strptime handles such a timezone if it is present? (the docs don't specify that, and the output struct doesn't have an entry for that information)
Maybe it is expected that if there is a timezone, it will be encoded in the string and parseable by strptime (with the right format string)?
The problem here is that if a timezone is recorded in the Timestamp type's tz field, then the timestamp value is expected to be in UTC, and not localized to the timezone in question (which is what you get from just parsing the string without the timezone information). So basically that means the timestamp needs to be converted from the specific timezone to UTC (if strptime doesn't do that for us). And for now, that's not yet something we have implemented, I think (although at some point we probably should?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I'm not sure, can you cast timestamp[ms] to timestamp[ms, tz="UTC"] or whatever (without modifying the values in the array, just to set the tz)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche it seems that we don't really use or pass zone information even if strptime captures it. The following passes:
options.timestamp_parsers = {TimestampParser::MakeStrptime("%Y-%m-%d %H:%M:%S %Z")};
AssertConversion<TimestampType, int64_t>(type, {"1970-01-01 00:00:00 Etc/GMT+6,1970-01-01 00:00:00 UTC\n"}, {{0}, {0}}, options);
So timestamp's timezone is currently ignored and the local time is returned. It might be good to document this or even block %z and %Z to avoid surprises?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, can you cast
timestamp[ms]totimestamp[ms, tz="UTC"]or whatever (without modifying the values in the array, just to set the tz)?
Casting actually works, but it's simply setting the tz and not changing the actual values, so it's not necessarily the behaviour you would expect (the behaviour would be correct if you assume the tz-naive data to be in UTC, but that seems a wrong assumption).
So timestamp's timezone is currently ignored and the local time is returned. It might be good to document this or even block
%zand%Zto avoid surprises?
Indeed, it seems we now simply ignore any timezone information in strptime:
>>> pc.strptime(["2012-01-01 01:02:03+01:00"], format="%Y-%m-%d %H:%M:%S%z", unit="s")
<pyarrow.lib.TimestampArray object at 0x7fad84855220>
[
2012-01-01 01:02:03
]
>>> pc.strptime(["2012-01-01 01:02:03+01:00"], format="%Y-%m-%d %H:%M:%S%Z", unit="s").type
TimestampType(timestamp[s])
I can see some value in keeping that working, so you can at least parse strings that include such information (otherwise you would always get an error with arrow, or you would need to do some string preprocessing to be able to pass them to strptime). But then we certainly need to document that.
On the other hand, if we want to support it in the future, that would change behaviour and erroring now might then be better ..
It seems that at least some strptime implementation support %z offsets, and store that in tm->gmt_offset, which we currently don't use (https://code.woboq.org/userspace/glibc/time/strptime_l.c.html#776).
At least supporting fixed offsets (%z) seems doable (and the result could then be a timestamp type with tz="UTC"), properly supporting %Z timezone names will be harder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created ARROW-12820 and referenced this discussion.
In context of this issue we could leave a reference to ARROW-12820 in the tests and postpone the timezone functionally?
r/tests/testthat/test-Array.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unit should also take human-friendly strings ("s", "ms", etc.); see how this is done in the timestamp() function in type.R.
|
@jonkeane @nealrichardson - in failed test Thank you for the help! |
|
Oh, this was a fun one dig through and figure out what was going on. As I'm sure you've seen, the failure is only in the devel build, and it turns out that Interestingly, comment 4 seems to indicate For now, I think if you add > all.equal(
list(lubridate::ymd_hms("2018-10-07 19:04:05", tz = NULL)),
list(lubridate::ymd_hms("2018-10-07 19:04:05")),
check.attributes = FALSE
)
[1] "Component 1: 'tzone' attributes are inconsistent ('' and 'UTC')"
> all.equal(
list(lubridate::ymd_hms("2018-10-07 19:04:05", tz = NULL)),
list(lubridate::ymd_hms("2018-10-07 19:04:05")),
check.tzone = FALSE
)
[1] TRUE
> |
|
Thank you @jonkeane for your feedback. |
jonkeane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! I have a few comments (though mostly about the tests)
Would it be good to test that we error like we think we do when a tz argument is given?
r/R/dplyr-functions.R
Outdated
| grepl("[.\\|()[{^$*+?]", string) | ||
| } | ||
|
|
||
| nse_funcs$strptime <- function(x, format = "%Y-%m-%d %H:%M:%S", tz = NULL, unit = 1L) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this default be the more readable "s"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, of course. Sorry about that.
But would do "ms" as @neal already mentioned in ARROW-12809 to match with https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L1236.
|
|
||
| t_string <- tibble(x = c("2018-10-07 19:04:05", NA)) | ||
| t_stamp <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05"), NA)) | ||
| t_stampPDT <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "PDT"), NA)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like this is used later, though I could be missing something. If not, could you remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed a typo in code, line 554: t_stampPDT should be used instead of t_stamp.
But it relates to your comment 1 and comment 2.
I added example t_stampPDT to the test to see if I get a warning as tz agrument is given. I do but then data is pulled into R. Test now correctly fails as lubridate converts time to match PDT time zone. But then it should stop() as Neal suggested but I am not sure I know how to do that.
Adding separate test to check if we error correctly could be something in the lines of:
test_that("errors in strptime", {
# Error when tz is passed
x <- Expression$field_ref("x")
expect_error(
nse_funcs$strptime(x, tz = "PDT"),
'Time zone argument not supported by Arrow'
)
})
and then lines from comment are redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that that's what we want actually.
In many other cases where something isn't (yet) supported in Arrow we automatically pull the data into R with a warning (in some circumstances like this). You might have found this already, but the pattern you propose for the test in your comment matches what we do elsewhere https://github.com/apache/arrow/blob/master/r/tests/testthat/test-dplyr-string-functions.R#L360-L369 which is good (comments about that test also have a bit more explanation about what's going on when the data warnings+is pulled in)
| check.tzone = FALSE | ||
| ) | ||
|
|
||
| expect_equivalent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if these would be equal? I'm not super familiar with how this precision is measured/handled in lubridate/R.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use check.tzone = FALSE they are equal. Should I use expect_equal() instead of expect_equivalent() in the test?
| collect(), | ||
| t_stamp, | ||
| check.tzone = FALSE | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what you're testing with this test. Could you explain a little bit more about the case that it's testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| tstring <- tibble(x = c("08-05-2008", NA)) | ||
| tstamp <- tibble(x = c(lubridate::mdy("08/05/2008"), NA)) | ||
| tstamp[[1]] <- as.POSIXct(tstamp[[1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be clearer to do something like strptime("08-05-2008", format = "%m-%d-%Y") to generate the expectation here?
r/R/dplyr-functions.R
Outdated
| Expression$create("strptime", | ||
| x, | ||
| options = list( | ||
| format = format, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation seems off here (and you may just be able to do options = list(format =...) inline
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
|
@jonkeane @nealrichardson I need approval for the check and the code is ready for another review round. Thank you! |
jonkeane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to go. Thank you so much or your contribution, and bearing with us as we went through reviews.
No description provided.