Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions r/NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
Instead of these, use the `read_ipc_file()` and `write_ipc_file()` for IPC files, or,
`read_ipc_stream()` and `write_ipc_stream()` for IPC streams.
* `write_parquet()` now defaults to writing Parquet format version 2.4 (was 1.0). Previously deprecated arguments `properties` and `arrow_properties` have been removed; if you need to deal with these lower-level properties objects directly, use `ParquetFileWriter`, which `write_parquet()` wraps.
* added `lubridate::qday()` (day of quarter)

# arrow 8.0.0

Expand Down
8 changes: 8 additions & 0 deletions r/R/dplyr-funcs-datetime.R
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,14 @@ register_bindings_datetime_components <- function() {
build_expr("month", x)
})

register_binding("lubridate::qday", function(x) {
# We calculate day of quarter by flooring timestamp to beginning of quarter and
# calculating days between beginning of quarter and timestamp/date in question.
# Since we use one one-based numbering we add one.
floored_x <- build_expr("floor_temporal", x, options = list(unit = 9L))
build_expr("days_between", floored_x, x) + Expression$scalar(1L)
})

register_binding("lubridate::am", function(x) {
hour <- Expression$create("hour", x)
hour < 12
Expand Down
29 changes: 29 additions & 0 deletions r/src/compute.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -519,6 +519,35 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(
return out;
}

if (func_name == "round_temporal" || func_name == "floor_temporal" ||
func_name == "ceil_temporal") {
using Options = arrow::compute::RoundTemporalOptions;

int64_t multiple = 1;
enum arrow::compute::CalendarUnit unit = arrow::compute::CalendarUnit::DAY;
bool week_starts_monday = true;
bool ceil_is_strictly_greater = true;
bool calendar_based_origin = true;

if (!Rf_isNull(options["multiple"])) {
multiple = cpp11::as_cpp<int64_t>(options["multiple"]);
}
if (!Rf_isNull(options["unit"])) {
unit = cpp11::as_cpp<enum arrow::compute::CalendarUnit>(options["unit"]);
}
if (!Rf_isNull(options["week_starts_monday"])) {
week_starts_monday = cpp11::as_cpp<bool>(options["week_starts_monday"]);
}
if (!Rf_isNull(options["ceil_is_strictly_greater"])) {
ceil_is_strictly_greater = cpp11::as_cpp<bool>(options["ceil_is_strictly_greater"]);
}
if (!Rf_isNull(options["calendar_based_origin"])) {
calendar_based_origin = cpp11::as_cpp<bool>(options["calendar_based_origin"]);
}
return std::make_shared<Options>(multiple, unit, week_starts_monday,
ceil_is_strictly_greater, calendar_based_origin);
}

if (func_name == "round_to_multiple") {
using Options = arrow::compute::RoundToMultipleOptions;
auto out = std::make_shared<Options>(Options::Defaults());
Expand Down
40 changes: 40 additions & 0 deletions r/tests/testthat/test-dplyr-funcs-datetime.R
Original file line number Diff line number Diff line change
Expand Up @@ -574,6 +574,26 @@ test_that("extract yday from timestamp", {
)
})

test_that("extract qday from timestamp", {
test_df <- tibble::tibble(
time = as.POSIXct(seq(as.Date("1999-12-31", tz = "UTC"), as.Date("2001-01-01", tz = "UTC"), by = "day"))
)

compare_dplyr_binding(
.input %>%
transmute(x = qday(time)) %>%
collect(),
test_df
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than test_df, it might make sense for this binding to use a sequence of dates (or datetimes) that spans a year (or maybe that spans a year and a leap year if that matters here). Maybe something like

tibble::tibble(date = seq(as.Date("2000-01-01"), as.Date("2000-12-31"), by = "day"))
#> # A tibble: 366 × 1
#>    date      
#>    <date>    
#>  1 2000-01-01
#>  2 2000-01-02
#>  3 2000-01-03
#>  4 2000-01-04
#>  5 2000-01-05
#>  6 2000-01-06
#>  7 2000-01-07
#>  8 2000-01-08
#>  9 2000-01-09
#> 10 2000-01-10
#> # … with 356 more rows
tibble::tibble(datetime = seq(as.POSIXct("2000-01-01 00:00:00"), as.POSIXct("2000-12-31 23:00:00"), by = "day"))
#> # A tibble: 366 × 1
#>    datetime           
#>    <dttm>             
#>  1 2000-01-01 00:00:00
#>  2 2000-01-02 00:00:00
#>  3 2000-01-03 00:00:00
#>  4 2000-01-04 00:00:00
#>  5 2000-01-05 00:00:00
#>  6 2000-01-06 00:00:00
#>  7 2000-01-07 00:00:00
#>  8 2000-01-08 00:00:00
#>  9 2000-01-09 00:00:00
#> 10 2000-01-10 00:00:00
#> # … with 356 more rows

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea of more complete tests! I would actually propose a development-time test suite (it seems overkill for CI) that tests every moment over the past century.

The test you're proposing however hits this bug where rounding kernels interpret 32 bit arrays as 64 bit ones (ARROW-16142) so I suppose we really need to fix this now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could try that and see how long it takes...it might only be a few ms and then I'd say keep it in the regular test suite. We do have a mechanism for running extra tests but right now it's limited to the large memory tests (via the env var ARROW_LARGE_MEMORY_TESTS). Given that a single real-world poke at this exposed an error, I'd say at least a year is a must in our normal test suite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, why not. I'll give it a try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, ARROW-16142 was resolved so this is now ready.

)

compare_dplyr_binding(
.input %>%
transmute(x = qday(as.POSIXct("2022-06-29 12:35"))) %>%
collect(),
test_df
)
})

test_that("extract hour from timestamp", {
compare_dplyr_binding(
.input %>%
Expand Down Expand Up @@ -790,6 +810,26 @@ test_that("extract yday from date", {
)
})

test_that("extract qday from date", {
test_df <- tibble::tibble(
date = seq(as.Date("1999-12-31"), as.Date("2001-01-01"), by = "day")
)

compare_dplyr_binding(
.input %>%
mutate(x = qday(date)) %>%
collect(),
test_df
)

compare_dplyr_binding(
.input %>%
mutate(y = qday(as.Date("2022-06-29"))) %>%
collect(),
test_df
)
})

test_that("leap_year mirror lubridate", {
compare_dplyr_binding(
.input %>%
Expand Down