Skip to content

Conversation

@AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented May 15, 2021

No description provided.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@AlenkaF AlenkaF changed the title Arrow 12198: [R] bindings for strptime ARROW-12198: [R] bindings for strptime May 15, 2021
@github-actions
Copy link

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Some questions and suggestions.

r/R/compute.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you want collect_arrays_from_dots here. This function exists to support the base R behavior like:

> sum(1, 2)
[1] 3

But strptime doesn't take ... like that.

r/R/compute.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how useful this function is since it is a thin wrapper around call_function() and we can't set it as an S3 method.. More useful would be to add a version of this in the nse_funcs in dplyr-functions.R.

In either case, we should match the base::strptime() signature: function (x, format, tz = "") with the possible addition of unit if that's an Arrow feature.

Also, should format and unit have default arguments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.
Arrow function uses format and unit as an FunctionOption if I understand correctly, haven't found tz yet.

I think they should have defaults, yes: format = "%Y-%m-%d %H:%M:%S" and unit = TimeUnit$MICRO/2L/"us".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nealrichardson I have trouble with calling strptime() function from nse_funcs - possible name collision with base. Am I missing something? Thank you!

As for defaults I correct myself, format shouldn't have default argument - to match base::strptime() signature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was suggesting was

nse_funcs$strptime <- function(x, format = "%Y-%m-%d %H:%M:%S", tz = "", unit = "ms") {

}

following the model of the other functions there that build Expressions. And if tz is not supported somehow, we stop() if tz is provided.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does StrptimeOptions have a Defaults() method in C++? If so, we should call it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that in scalar_string.cc line 1744 would suggest StrptimeOptions do not have Defaults() in C++?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks like it. I made ARROW-12809 to evaluate whether that's correct, but for the purposes of the PR, it's fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't figured out yet how to pass tz argument to strptime_arrow wrapper function as it is written now. After going through your comments, making the necessary changes then it will not be an issue I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that strptime in the C++ library doesn't take a timezone argument. Maybe it is expected that if there is a timezone, it will be encoded in the string and parseable by strptime (with the right format string)? But this gets us into the always tricky area of timezone-aware vs. timezone-naive data. @jorisvandenbossche do you have any thoughts/experience with this code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly system strptime is used so %Z or %z would work. E.g.: 2020-01-01 23:23:14 Europe/Amsterdam would be captured by format = "%Y-%m-%d %H:%M:%S %Z".
Capturing timezones would be great IMO but I would listen to Joris here for sure :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rok do you know how the system strptime handles such a timezone if it is present? (the docs don't specify that, and the output struct doesn't have an entry for that information)

Maybe it is expected that if there is a timezone, it will be encoded in the string and parseable by strptime (with the right format string)?

The problem here is that if a timezone is recorded in the Timestamp type's tz field, then the timestamp value is expected to be in UTC, and not localized to the timezone in question (which is what you get from just parsing the string without the timezone information). So basically that means the timestamp needs to be converted from the specific timezone to UTC (if strptime doesn't do that for us). And for now, that's not yet something we have implemented, I think (although at some point we probably should?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'm not sure, can you cast timestamp[ms] to timestamp[ms, tz="UTC"] or whatever (without modifying the values in the array, just to set the tz)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche it seems that we don't really use or pass zone information even if strptime captures it. The following passes:

  options.timestamp_parsers = {TimestampParser::MakeStrptime("%Y-%m-%d %H:%M:%S %Z")};
  AssertConversion<TimestampType, int64_t>(type, {"1970-01-01 00:00:00 Etc/GMT+6,1970-01-01 00:00:00 UTC\n"}, {{0}, {0}}, options);

So timestamp's timezone is currently ignored and the local time is returned. It might be good to document this or even block %z and %Z to avoid surprises?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, can you cast timestamp[ms] to timestamp[ms, tz="UTC"] or whatever (without modifying the values in the array, just to set the tz)?

Casting actually works, but it's simply setting the tz and not changing the actual values, so it's not necessarily the behaviour you would expect (the behaviour would be correct if you assume the tz-naive data to be in UTC, but that seems a wrong assumption).

So timestamp's timezone is currently ignored and the local time is returned. It might be good to document this or even block %z and %Z to avoid surprises?

Indeed, it seems we now simply ignore any timezone information in strptime:

>>> pc.strptime(["2012-01-01 01:02:03+01:00"], format="%Y-%m-%d %H:%M:%S%z", unit="s")
<pyarrow.lib.TimestampArray object at 0x7fad84855220>
[
  2012-01-01 01:02:03
]
>>> pc.strptime(["2012-01-01 01:02:03+01:00"], format="%Y-%m-%d %H:%M:%S%Z", unit="s").type
TimestampType(timestamp[s])

I can see some value in keeping that working, so you can at least parse strings that include such information (otherwise you would always get an error with arrow, or you would need to do some string preprocessing to be able to pass them to strptime). But then we certainly need to document that.
On the other hand, if we want to support it in the future, that would change behaviour and erroring now might then be better ..

It seems that at least some strptime implementation support %z offsets, and store that in tm->gmt_offset, which we currently don't use (https://code.woboq.org/userspace/glibc/time/strptime_l.c.html#776).
At least supporting fixed offsets (%z) seems doable (and the result could then be a timestamp type with tz="UTC"), properly supporting %Z timezone names will be harder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created ARROW-12820 and referenced this discussion.
In context of this issue we could leave a reference to ARROW-12820 in the tests and postpone the timezone functionally?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit should also take human-friendly strings ("s", "ms", etc.); see how this is done in the timestamp() function in type.R.

@rok
Copy link
Member

rok commented May 24, 2021

@kszucs - could you you enable @AlenkaF to build on CI please?

@kszucs
Copy link
Member

kszucs commented May 24, 2021

@kszucs - could you you enable @AlenkaF to build on CI please?

Enabled.

@AlenkaF
Copy link
Member Author

AlenkaF commented May 24, 2021

@kszucs - could you you enable @AlenkaF to build on CI please?

Enabled.

Thanks, I guess I need help (approval for CI build) once more =)

@AlenkaF
Copy link
Member Author

AlenkaF commented May 25, 2021

@jonkeane @nealrichardson - in failed test expect_equivalent() fails due to time zone being stored in lubridate timestamp but should ignore attributes if I understand correctly. I am running out of ideas how to solve this one :(

Thank you for the help!

@jonkeane
Copy link
Member

Oh, this was a fun one dig through and figure out what was going on. As I'm sure you've seen, the failure is only in the devel build, and it turns out that all.equal.POSIXt() has changed recently: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17277 is the bug report that adds timezone checking.

Interestingly, comment 4 seems to indicate check.attributes should be respected, thought it is not (currently) being respected. I've sent an email to the r-devel list and asking if check.attributes is supposed to be ignored in this case (but you can see a replication of this in the code below).

For now, I think if you add check.tzone = FALSE to the expect_equivalent() calls that will fix this (and not cause problems in other versions).

> all.equal(
  list(lubridate::ymd_hms("2018-10-07 19:04:05", tz = NULL)),
  list(lubridate::ymd_hms("2018-10-07 19:04:05")),
  check.attributes = FALSE
)
[1] "Component 1: 'tzone' attributes are inconsistent ('' and 'UTC')"
> all.equal(
  list(lubridate::ymd_hms("2018-10-07 19:04:05", tz = NULL)),
  list(lubridate::ymd_hms("2018-10-07 19:04:05")),
  check.tzone = FALSE
)
[1] TRUE
> 

@AlenkaF
Copy link
Member Author

AlenkaF commented May 27, 2021

Thank you @jonkeane for your feedback.
I corrected the code and would like to ask you for a review, if I may.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! I have a few comments (though mostly about the tests)

Would it be good to test that we error like we think we do when a tz argument is given?

grepl("[.\\|()[{^$*+?]", string)
}

nse_funcs$strptime <- function(x, format = "%Y-%m-%d %H:%M:%S", tz = NULL, unit = 1L) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this default be the more readable "s"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, of course. Sorry about that.

But would do "ms" as @neal already mentioned in ARROW-12809 to match with https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L1236.


t_string <- tibble(x = c("2018-10-07 19:04:05", NA))
t_stamp <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05"), NA))
t_stampPDT <- tibble(x = c(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "PDT"), NA))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like this is used later, though I could be missing something. If not, could you remove it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed a typo in code, line 554: t_stampPDT should be used instead of t_stamp.

But it relates to your comment 1 and comment 2.

I added example t_stampPDT to the test to see if I get a warning as tz agrument is given. I do but then data is pulled into R. Test now correctly fails as lubridate converts time to match PDT time zone. But then it should stop() as Neal suggested but I am not sure I know how to do that.

Adding separate test to check if we error correctly could be something in the lines of:

test_that("errors in strptime", {
  # Error when tz is passed

  x <- Expression$field_ref("x")
  expect_error(
    nse_funcs$strptime(x, tz = "PDT"),
    'Time zone argument not supported by Arrow'
  )
})

and then lines from comment are redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that that's what we want actually.

In many other cases where something isn't (yet) supported in Arrow we automatically pull the data into R with a warning (in some circumstances like this). You might have found this already, but the pattern you propose for the test in your comment matches what we do elsewhere https://github.com/apache/arrow/blob/master/r/tests/testthat/test-dplyr-string-functions.R#L360-L369 which is good (comments about that test also have a bit more explanation about what's going on when the data warnings+is pulled in)

check.tzone = FALSE
)

expect_equivalent(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if these would be equal? I'm not super familiar with how this precision is measured/handled in lubridate/R.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I use check.tzone = FALSE they are equal. Should I use expect_equal() instead of expect_equivalent() in the test?

collect(),
t_stamp,
check.tzone = FALSE
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what you're testing with this test. Could you explain a little bit more about the case that it's testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


tstring <- tibble(x = c("08-05-2008", NA))
tstamp <- tibble(x = c(lubridate::mdy("08/05/2008"), NA))
tstamp[[1]] <- as.POSIXct(tstamp[[1]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be clearer to do something like strptime("08-05-2008", format = "%m-%d-%Y") to generate the expectation here?

Expression$create("strptime",
x,
options = list(
format = format,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation seems off here (and you may just be able to do options = list(format =...) inline

@AlenkaF
Copy link
Member Author

AlenkaF commented Jun 2, 2021

@jonkeane @nealrichardson I need approval for the check and the code is ready for another review round. Thank you!

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to go. Thank you so much or your contribution, and bearing with us as we went through reviews.

@jonkeane jonkeane closed this in 99fd3b8 Jun 4, 2021
@AlenkaF AlenkaF deleted the ARROW-12198 branch June 5, 2021 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants