-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12199: [R] bindings for stddev, variance #10215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
r/R/dplyr.R
Outdated
| sd = function(x, na.rm = FALSE){ | ||
| if (!na.rm && x$null_count > 0) { | ||
| return(Scalar$create(NA_real_)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is wrong somehow, as when I call the code below, I get the warning Warning: Expression mass - sd(mass, na.rm = FALSE) not supported in Arrow; pulling data into R, but yet when I set na.rm to TRUE, I don't.
Table$create(starwars) %>%
select(name, mass, species) %>%
mutate(mass_diff = mass-sd(mass, na.rm = FALSE)) %>%
collect()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support aggregations in our dplyr backend yet, so this should never succeed. If sd() doesn't cleanly and always error when called on an arrow Expression, we should force it to--see the "fail" handling inside of arrow_eval where this is done for mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
r/R/compute.R
Outdated
| } | ||
|
|
||
|
|
||
| #' `variance` and `stddev` for Arrow objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this. Unfortunately, sd() and var() aren't generics so we can't just define methods for them. So it might not be worth adding these wrappers at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, removed now.
r/R/compute.R
Outdated
| #' @return A `Scalar` containing the calculated value. | ||
| #' @export | ||
| stddev <- function(x, ddof = 0) { | ||
| call_function("stddev", x, options = list(ddof = ddof)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no na.rm handling in the Arrow stddev and variance functions? If not, there should be (please JIRA).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not. Done - https://issues.apache.org/jira/browse/ARROW-12615
r/R/dplyr.R
Outdated
| sd = function(x, na.rm = FALSE){ | ||
| if (!na.rm && x$null_count > 0) { | ||
| return(Scalar$create(NA_real_)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support aggregations in our dplyr backend yet, so this should never succeed. If sd() doesn't cleanly and always error when called on an arrow Expression, we should force it to--see the "fail" handling inside of arrow_eval where this is done for mean.
| } | ||
|
|
||
|
|
||
| if (func_name == "variance" || func_name == "stddev") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH this is probably the only code addition we want to keep here.
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a test of call_function("stddev", array, options = list(ddof = something)) just to confirm we're passing the argument? ("no" is an acceptable answer)
I want to say "yes" to be consistent with the logic I've applied elsewhere and I don't see any harm in doing it, so I'll add one unless there are any compelling reasons not to. |
Adds bindings for stddev and variance, and also uses these kernels in the dplyr expressions when calls to sd/var