diff --git a/r/_pkgdown.yml b/r/_pkgdown.yml index 565a8cdd640..38e808a6dd4 100644 --- a/r/_pkgdown.yml +++ b/r/_pkgdown.yml @@ -84,6 +84,8 @@ navbar: href: articles/developers/install_details.html - text: Docker href: articles/developers/docker.html + - text: Writing Bindings + href: articles/developers/bindings.html reference: - title: Multi-file datasets contents: diff --git a/r/vignettes/developers/bindings.Rmd b/r/vignettes/developers/bindings.Rmd new file mode 100644 index 00000000000..e2878fdeafc --- /dev/null +++ b/r/vignettes/developers/bindings.Rmd @@ -0,0 +1,225 @@ +# Writing Bindings + +```{r, include=FALSE} +library(arrow, warn.conflicts = FALSE) +library(dplyr, warn.conflicts = FALSE) +``` + +When writing bindings between C++ compute functions and R functions, the aim is +to expose the C++ functionality via the same interface as existing R functions. The syntax and +functionality should match that of the existing R functions +(though there are some exceptions) so that users are able to use existing tidyverse +or base R syntax, whilst taking advantage of the speed and functionality of the +underlying arrow package. + +One of main ways in which users interact with arrow is via +[dplyr](https://dplyr.tidyverse.org/) syntax called on Arrow objects. For +example, when a user calls `dplyr::mutate()` on an Arrow Tabular, +Dataset, or arrow data query object, the Arrow implementation of `mutate()` is +used and under the hood, translates the dplyr code into Arrow C++ code. + +When using `dplyr::mutate()` or `dplyr::filter()`, you may want to use functions +from other packages. The example below uses `stringr::str_detect()`. + +```{r} +library(dplyr) +library(stringr) +starwars %>% + filter(str_detect(name, "Darth")) +``` +This functionality has also been implemented in Arrow, e.g.: + +```{r} +library(arrow) +arrow_table(starwars) %>% + filter(str_detect(name, "Darth")) %>% + collect() +``` + +This is possible as a **binding** has been created between the call to the +stringr function `str_detect()` and the Arrow C++ code, here as a direct mapping +to `match_substring_regex`. You can see this for yourself by inspecting the +arrow data query object without retrieving the results via `collect()`. + + +```{r} +arrow_table(starwars) %>% + filter(str_detect(name, "Darth")) +``` + +In the following sections, we'll walk through how to create a binding between an +R function and an Arrow C++ function. + +# Walkthrough + +Imagine you are writing the bindings for the C++ function +[`starts_with()`](https://arrow.apache.org/docs/cpp/compute.html#containment-tests) +and want to bind it to the (base) R function `startsWith()`. + +First, take a look at the docs for both of those functions. + +## Examining the R function + +Here are the docs for R's `startsWith()` (also available at https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html) + +```{r, echo=FALSE, out.width="50%"} +knitr::include_graphics("./startswithdocs.png") +``` + +It takes 2 parameters; `x` - the input, and `prefix` - the characters to check +if `x` starts with. + +## Examining the C++ function + +Now, go to +[the compute function documentation](https://arrow.apache.org/docs/cpp/compute.html#containment-tests) +and look for the Arrow C++ library's `starts_with()` function: + +```{r, echo=FALSE, out.width="100%"} +knitr::include_graphics("./starts_with_docs.png") +``` + +The docs show that `starts_with()` is a unary function, which means that it takes a +single data input. The data input must be a string-like class, and the returned +value is boolean, both of which match up to R's `startsWith()`. + +There is an options class associated with `starts_with()` - called [`MatchSubstringOptions`](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute21MatchSubstringOptionsE) +- so let's take a look at that. + +```{r, echo=FALSE, out.width="100%"} +knitr::include_graphics("./matchsubstringoptions.png") +``` + +Options classes allow the user to control the behaviour of the function. In +this case, there are two possible options which can be supplied - `pattern` and +`ignore_case`, which are described in the docs shown above. + +## Comparing the R and C++ functions + +What conclusions can be drawn from what you've seen so far? + +Base R's `startsWith()` and Arrow's `starts_with()` operate on equivalent data +types, return equivalent data types, and as there are no options implemented in +R that Arrow doesn't have, this should be fairly simple to map without a great +deal of extra work. + +As `starts_with()` has an options class associated with it, we'll need to make +sure that it's linked up with this in the R code. + +In case you're wondering about the difference between arguments in R and options +in Arrow, in R, arguments to functions can include the actual data to be +analysed as well as options governing how the function works, whereas in the +C++ compute functions, the arguments are the data to be analysed and the +options are for specifying how exactly the function works. + +So let's get started. + +## Step 1 - add unit tests + +We recommend a test-driven-development approach - write failing tests first, +then check that they fail, and then write the code needed to make them pass. +Thinking up-front about the behavior which needs testing can make it easier to +reason about the code which needs writing later. + +Look up the R function that you want to bind the compute kernel to, and write a +set of unit tests that use a dplyr pipeline and `compare_dplyr_binding()` (and +perhaps even `compare_dplyr_error()` if necessary. These functions compare the +output of the original function with the dplyr bindings and make sure they match. +We recommend looking at the documentation next to the source code for these +functions to get a better understanding of how they work. + +You should make sure you're testing all parameters of the R function in your +tests. + +Below is a possible example test for `startsWith()`. + +```{r, eval = FALSE} +test_that("startsWith behaves identically in dplyr and Arrow", { + df <- tibble(x = c("Foo", "bar", "baz", "qux")) + compare_dplyr_binding( + .input %>% + filter(startsWith(x, "b")) %>% + collect(), + df + ) + +}) +``` + +## Step 2 - Hook up the compute function with options class if necessary + +If the C++ compute function can have options specified, make sure that the +function is linked with its options class in `make_compute_options()` in the +file `arrow/r/src/compute.cpp`. You can find out if a compute function requires +options by looking in the docs here: https://arrow.apache.org/docs/cpp/compute.html + +In the case of `starts_with()`, it looks something like this: + +```cpp + if (func_name == "starts_with") { + using Options = arrow::compute::MatchSubstringOptions; + bool ignore_case = false; + if (!Rf_isNull(options["ignore_case"])) { + ignore_case = cpp11::as_cpp(options["ignore_case"]); + } + return std::make_shared(cpp11::as_cpp(options["pattern"]), + ignore_case); + } +``` + +You can usually copy and paste from a similar existing example. In this case, +as the option `ignore_case` doesn't map to any parameters of `startsWith()`, we +give it a default value of `false` but if it's been set, use the set value +instead. As the `pattern` argument maps directly to `prefix` in `startsWith()` +we can pass it straight through. + +## Step 3 - Map the R function to the C++ kernel + +The next task is writing the code which binds the R function to the C++ kernel. + +### Step 3a - See if direct mapping is appropriate +Compare the C++ function and R function. If they are simple functions with no +options, it might be possible to directly map between the C++ and R in +`unary_function_map`, in the case of compute functions that operate on single +columns of data, or `binary_function_map` for those which operate on 2 columns +of data. + +As `startsWith()` requires options, direct mapping is not appropriate. + +### Step 3b - If direct mapping not possible, try a modified implementation +If the function cannot be mapped directly, some extra work may be needed to +ensure that calling the arrow version of the function results in the same result +as calling the R version of the function. In this case, the function will need +adding to the `nse_funcs` list in `arrow/r/R/dplyr-functions.R`. Here is how +this might look for `startsWith()`: + +```{r, eval = FALSE} +nse_funcs$startsWith <- function(x, prefix) { + Expression$create( + "starts_with", + x, + options = list(pattern = prefix) + ) +} +``` + +Hint: you can use `call_function()` to call a compute function directly from R. +This might be useful if you want to experiment with a compute function while +you're writing bindings for it, e.g. + +```{r} +call_function( + "starts_with", + Array$create(c("Apache", "Arrow", "R", "package")), + options = list(pattern = "A") +) +``` + +## Step 4 - Run (and potentially add to) your tests. + +In the process of implementing the function, you may end up implementing more +tests, for example if you discover unusual edge cases. This is fine - add them +to the ones you wrote originally, and run them all. If they pass, you're done! +Submit a PR. If you've modified the C++ code in the +R package (for example, when hooking up a binding to its options class), you +should make sure to run `arrow/r/lint.sh` to lint the code. diff --git a/r/vignettes/developers/matchsubstringoptions.png b/r/vignettes/developers/matchsubstringoptions.png new file mode 100644 index 00000000000..2dff3c5858e Binary files /dev/null and b/r/vignettes/developers/matchsubstringoptions.png differ diff --git a/r/vignettes/developers/starts_with_docs.png b/r/vignettes/developers/starts_with_docs.png new file mode 100644 index 00000000000..a55e888128f Binary files /dev/null and b/r/vignettes/developers/starts_with_docs.png differ diff --git a/r/vignettes/developers/startswithdocs.png b/r/vignettes/developers/startswithdocs.png new file mode 100644 index 00000000000..6e1f3df1b9b Binary files /dev/null and b/r/vignettes/developers/startswithdocs.png differ