Skip to content

Conversation

@thisisnic
Copy link
Member

Adds bindings for stddev and variance, and also uses these kernels in the dplyr expressions when calls to sd/var

@github-actions
Copy link

r/R/dplyr.R Outdated
sd = function(x, na.rm = FALSE){
if (!na.rm && x$null_count > 0) {
return(Scalar$create(NA_real_))
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is wrong somehow, as when I call the code below, I get the warning Warning: Expression mass - sd(mass, na.rm = FALSE) not supported in Arrow; pulling data into R, but yet when I set na.rm to TRUE, I don't.

Table$create(starwars) %>%
     select(name, mass, species) %>%
     mutate(mass_diff = mass-sd(mass, na.rm = FALSE)) %>%
     collect()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support aggregations in our dplyr backend yet, so this should never succeed. If sd() doesn't cleanly and always error when called on an arrow Expression, we should force it to--see the "fail" handling inside of arrow_eval where this is done for mean.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

r/R/compute.R Outdated
}


#' `variance` and `stddev` for Arrow objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. Unfortunately, sd() and var() aren't generics so we can't just define methods for them. So it might not be worth adding these wrappers at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, removed now.

r/R/compute.R Outdated
#' @return A `Scalar` containing the calculated value.
#' @export
stddev <- function(x, ddof = 0) {
call_function("stddev", x, options = list(ddof = ddof))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no na.rm handling in the Arrow stddev and variance functions? If not, there should be (please JIRA).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r/R/dplyr.R Outdated
sd = function(x, na.rm = FALSE){
if (!na.rm && x$null_count > 0) {
return(Scalar$create(NA_real_))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support aggregations in our dplyr backend yet, so this should never succeed. If sd() doesn't cleanly and always error when called on an arrow Expression, we should force it to--see the "fail" handling inside of arrow_eval where this is done for mean.

}


if (func_name == "variance" || func_name == "stddev") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH this is probably the only code addition we want to keep here.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a test of call_function("stddev", array, options = list(ddof = something)) just to confirm we're passing the argument? ("no" is an acceptable answer)

@thisisnic
Copy link
Member Author

Should we add a test of call_function("stddev", array, options = list(ddof = something)) just to confirm we're passing the argument? ("no" is an acceptable answer)

I want to say "yes" to be consistent with the logic I've applied elsewhere and I don't see any harm in doing it, so I'll add one unless there are any compelling reasons not to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants