-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12105: [R] Replace vars_select, vars_rename with eval_select, eval_rename #14371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| ) | ||
| }) | ||
|
|
||
| test_that("multiple select/rename and group_by", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these in because they are needed to make sure the implementation of the utility function column_select() is working properly.
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on!
r/R/arrow-package.R
Outdated
| #' @importFrom rlang quo_set_env quo_get_env is_formula quo_is_call f_rhs parse_expr f_env new_quosure | ||
| #' @importFrom rlang new_quosures expr_text | ||
| #' @importFrom tidyselect vars_pull vars_rename vars_select eval_select | ||
| #' @importFrom tidyselect vars_pull vars_select eval_select eval_rename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update the other usages of vars_select in the package too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - this is still on my to-do list, but I think all other feedback has been addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done now
| # TODO(ARROW-17384): implement where | ||
| "Use of `where()` selection helper not yet supported" | ||
| ) | ||
| docs[["dplyr::across"]] <- character(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
r/R/util.R
Outdated
| } | ||
|
|
||
| simulate_data_frame <- function(schema) { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining why we need this function.
And do we need to export this? I think I saw some discussion between @paleolimbot and @krlmlr about needing this function in DBI or adbc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In DBI, we need to infer the SQL data types from a RecordBatchReader, even if the DBI backend is not aware of Arrow (yet).
I wonder if there should be a way to convert a schema to a zero-row Arrow table. One way to do this is via schema -> zero-row table -> data frame. For DBI, schema -> data frame is sufficient, but perhaps schema -> zero-row table is easier to implemented in C++, and the table -> data frame operation is already efficient enough for zero-row tables.
r/R/util.R
Outdated
| abort(msg, call = call) | ||
| } | ||
|
|
||
| simulate_data_frame <- function(schema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is going to be called every time someone does select/rename/relocate, I'd like for this function to be cheaper. In other PRs I've been noticing the overhead of creating R6 objects, which generally is not terrible (~150 microseconds on my machine) but it adds up. And here, we're creating lots of objects we're throwing away: for each column, we create a Field, then a DataType from that, then in concat_arrays, we create a null DataType, an Array with that, and a new Array that is cast to the correct DataType. That adds up to around 1ms per column, every time this function is called. That's enough to get noticed.
Can we move this to C++? Should be a simple enough switch statement to map Arrow type ids to the corresponding R length-0 vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, I benchmarked this function on the schema of the taxi dataset (20 columns), and the median time was 15ms, so a little under 1ms per column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for confirming! Currently working on a C++ simplification though may ask for help to get it over the line ahead of the release if I can't figure it out by the end of tomorrow.
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
|
Currently getting this error message in the unit tests, due to weirdness with trying to create 0-length arrays using extension types 😬 Will try to create a nice reprex/solution/workaround, but if anyone has any suggestions in the meantime, let me know. It can be reproduced here via: |
|
I'm sure it's solvable since we can roundtrip extension type data with >0 elements. Dewey may be best positioned to advise. Can you defer that to a followup, and find some workaround here, perhaps swap in |
My favourite 5 words. Opened ARROW-18043 |
|
@nealrichardson Mind giving this another look over? I'm going to get to the "swap |
|
Looks great to me! |
|
@nealrichardson Waiting on the CI, but otherwise I think this is ready for a final round of reviewing :D |
|
Benchmark runs are scheduled for baseline = bd785c9 and contender = 2cbf489. 2cbf489 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
|
@nealrichardson I think I may need to refactor this in a follow-up PR - check out the regressions :\ Any suggestion for what I can do instead? |
No description provided.