-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14819: [R] Binding for lubridate::qday #13440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
77dc96c to
8a8e467
Compare
|
(I'm on vacation this week but look forward to taking a look on Monday!) |
thisisnic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just one comment needs adding to clarify things, but otherwise looks fine to me. Thanks!
409b3e1 to
b1926a3
Compare
More updates were needed - review no longer valid
Merged, thank you! |
4cb6f6b to
37ee4c1
Compare
|
@thisisnic after the rebase this is now (almost) c++ free :). |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than test_df, it might make sense for this binding to use a sequence of dates (or datetimes) that spans a year (or maybe that spans a year and a leap year if that matters here). Maybe something like
tibble::tibble(date = seq(as.Date("2000-01-01"), as.Date("2000-12-31"), by = "day"))
#> # A tibble: 366 × 1
#> date
#> <date>
#> 1 2000-01-01
#> 2 2000-01-02
#> 3 2000-01-03
#> 4 2000-01-04
#> 5 2000-01-05
#> 6 2000-01-06
#> 7 2000-01-07
#> 8 2000-01-08
#> 9 2000-01-09
#> 10 2000-01-10
#> # … with 356 more rows
tibble::tibble(datetime = seq(as.POSIXct("2000-01-01 00:00:00"), as.POSIXct("2000-12-31 23:00:00"), by = "day"))
#> # A tibble: 366 × 1
#> datetime
#> <dttm>
#> 1 2000-01-01 00:00:00
#> 2 2000-01-02 00:00:00
#> 3 2000-01-03 00:00:00
#> 4 2000-01-04 00:00:00
#> 5 2000-01-05 00:00:00
#> 6 2000-01-06 00:00:00
#> 7 2000-01-07 00:00:00
#> 8 2000-01-08 00:00:00
#> 9 2000-01-09 00:00:00
#> 10 2000-01-10 00:00:00
#> # … with 356 more rowsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the idea of more complete tests! I would actually propose a development-time test suite (it seems overkill for CI) that tests every moment over the past century.
The test you're proposing however hits this bug where rounding kernels interpret 32 bit arrays as 64 bit ones (ARROW-16142) so I suppose we really need to fix this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could try that and see how long it takes...it might only be a few ms and then I'd say keep it in the regular test suite. We do have a mechanism for running extra tests but right now it's limited to the large memory tests (via the env var ARROW_LARGE_MEMORY_TESTS). Given that a single real-world poke at this exposed an error, I'd say at least a year is a must in our normal test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, why not. I'll give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, ARROW-16142 was resolved so this is now ready.
e372aec to
c0a8da9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. One very minor suggestion for even more comment 🚲 🏠 ing (do feel free to ignore it thought, it is really minor)
r/R/dplyr-funcs-datetime.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is extremely minor, and take it or leave it: but it did take me a second to think through why we are adding 1 here. Maybe we could add it to the comment up above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # We calculate day of quarter by flooring timestamp to beginning of quarter and | |
| # calculating days between beginning of quarter and timestamp/date in question. | |
| floored_x <- build_expr("floor_temporal", x, options = list(unit = 9L)) | |
| build_expr("days_between", floored_x, x) + Expression$scalar(1L) | |
| # We calculate day of quarter by flooring timestamp to beginning of quarter and | |
| # calculating days between beginning of quarter and timestamp/date in question. | |
| # Since we use one one-based numbering we add one. | |
| floored_x <- build_expr("floor_temporal", x, options = list(unit = 9L)) | |
| build_expr("days_between", floored_x, x) + Expression$scalar(1L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonkeane that's a fair question especially since documentation on qday is not really available.
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one test will be better if you make the timestamp UTC but other than that this looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would seq(as.POSIXct("1999-12-31", tz = "UTC"), as.POSIXct("2001-01-01", tz = "UTC"), by = "day") let you drop the ignore_attr bit below? As a new reader of this code I'm wondering why the timezone needs to be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point @paleolimbot! It's as if calling qday with mutate in compare_dplyr_binding returns an int64 with tzone = "UTC". Meanwhile calling it with transmute returns correctly.
Would this be considered a sharp edge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that's an error in restoring the R metadata (this PR doesn't touch R attributes to my reading). It would be helpful to file a reprex() in a JIRA so that we can fix this later (an int64 should never have a timezone attribute!), but it's definitely not this PR's problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I'll open a Jira and merge this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3a9fba4 to
cbfb599
Compare
Co-authored-by: Dragoș Moldovan-Grünfeld <dragos.mold@gmail.com>
|
Benchmark runs are scheduled for baseline = 39980dc and contender = 0330353. 0330353 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This adds lubridate-like
qdayfunction. Counts number of days elapsed since beginning of the quarter.