ARROW-12105: [R] Replace vars_select, vars_rename with eval_select, eval_rename #14371

thisisnic · 2022-10-11T09:58:33Z

No description provided.

thisisnic · 2022-10-11T11:41:18Z

r/tests/testthat/test-dplyr-select.R

  )
 })
+
+test_that("multiple select/rename and group_by", {


I added these in because they are needed to make sure the implementation of the utility function column_select() is working properly.

nealrichardson

Thanks for taking this on!

nealrichardson · 2022-10-11T13:05:59Z

r/R/arrow-package.R

 #' @importFrom rlang quo_set_env quo_get_env is_formula quo_is_call f_rhs parse_expr f_env new_quosure
 #' @importFrom rlang new_quosures expr_text
-#' @importFrom tidyselect vars_pull vars_rename vars_select eval_select
+#' @importFrom tidyselect vars_pull vars_select eval_select eval_rename


Do we need to update the other usages of vars_select in the package too?

Yep - this is still on my to-do list, but I think all other feedback has been addressed.

r/R/util.R

nealrichardson · 2022-10-11T13:11:47Z

r/data-raw/docgen.R

-  # TODO(ARROW-17384): implement where
-  "Use of `where()` selection helper not yet supported"
-)
+docs[["dplyr::across"]] <- character(0)


r/R/dplyr-select.R

nealrichardson · 2022-10-11T13:15:19Z

r/R/util.R

 }
+
+simulate_data_frame <- function(schema) {
+


Add a comment explaining why we need this function.

And do we need to export this? I think I saw some discussion between @paleolimbot and @krlmlr about needing this function in DBI or adbc.

In DBI, we need to infer the SQL data types from a RecordBatchReader, even if the DBI backend is not aware of Arrow (yet).

I wonder if there should be a way to convert a schema to a zero-row Arrow table. One way to do this is via schema -> zero-row table -> data frame. For DBI, schema -> data frame is sufficient, but perhaps schema -> zero-row table is easier to implemented in C++, and the table -> data frame operation is already efficient enough for zero-row tables.

r/R/util.R

nealrichardson · 2022-10-11T13:38:33Z

r/R/util.R

  abort(msg, call = call)
 }
+
+simulate_data_frame <- function(schema) {


Since this is going to be called every time someone does select/rename/relocate, I'd like for this function to be cheaper. In other PRs I've been noticing the overhead of creating R6 objects, which generally is not terrible (~150 microseconds on my machine) but it adds up. And here, we're creating lots of objects we're throwing away: for each column, we create a Field, then a DataType from that, then in concat_arrays, we create a null DataType, an Array with that, and a new Array that is cast to the correct DataType. That adds up to around 1ms per column, every time this function is called. That's enough to get noticed.

Can we move this to C++? Should be a simple enough switch statement to map Arrow type ids to the corresponding R length-0 vector.

Just to confirm, I benchmarked this function on the schema of the taxi dataset (20 columns), and the median time was 15ms, so a little under 1ms per column.

Thanks for confirming! Currently working on a C++ simplification though may ask for help to get it over the line ahead of the release if I can't figure it out by the end of tomorrow.

r/src/table.cpp

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

thisisnic · 2022-10-13T14:48:06Z

Currently getting this error message in the unit tests, due to weirdness with trying to create 0-length arrays using extension types 😬

Will try to create a nice reprex/solution/workaround, but if anyone has any suggestions in the meantime, let me know.

Error (test-extension.R:340:3): Dataset/arrow_dplyr_query can roundtrip extension types
Error: NotImplemented: MakeBuilder: cannot construct builder for type <arrow_custom_vctr[0]>
/home/nic2/arrow/cpp/src/arrow/builder.cc:289  VisitTypeInline(*type, &impl)
Backtrace:
 1. ... %>% dplyr::collect()
      at test-extension.R:340:2
 4. arrow::select.Dataset(., number, letter, extension)
 5. arrow::column_select(.data, enquos(...), op = "select")
      at r/R/dplyr-select.R:24:2
 6. arrow::simulate_data_frame(implicit_schema(.data))
      at r/R/dplyr-select.R:97:2
 8. arrow::Table__from_schema(schema)

It can be reproduced here via:

extension_vec <- vctrs::new_vctr(letters[1:10], class = "arrow_custom_vctr")
df <- tibble::tibble(x = extension_vec)
arrow_df <- arrow_table(df)
simulate_data_frame(arrow_df$schema)


Error: NotImplemented: MakeBuilder: cannot construct builder for type <arrow_custom_vctr[0]>
/home/nic2/arrow/cpp/src/arrow/builder.cc:289  VisitTypeInline(*type, &impl)

nealrichardson · 2022-10-13T15:17:55Z

I'm sure it's solvable since we can roundtrip extension type data with >0 elements. Dewey may be best positioned to advise. Can you defer that to a followup, and find some workaround here, perhaps swap in null() type for extension types?

…into ARROW-12105_eval_select

thisisnic · 2022-10-13T18:16:24Z

defer that to a followup

My favourite 5 words. Opened ARROW-18043

thisisnic · 2022-10-13T18:37:42Z

@nealrichardson Mind giving this another look over? I'm going to get to the "swap vars_select for eval_select" remaining bits tomorrow morning, but it'd be good to have you look at the rest of it in case there's any changes I need to make there too.

nealrichardson · 2022-10-14T00:02:38Z

Looks great to me!

thisisnic · 2022-10-14T11:49:13Z

@nealrichardson Waiting on the CI, but otherwise I think this is ready for a final round of reviewing :D

ursabot · 2022-10-16T13:31:37Z

Benchmark runs are scheduled for baseline = bd785c9 and contender = 2cbf489. 2cbf489 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️53.33% ⬆️0.0%] test-mac-arm
[Failed ⬇️28.22% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.39% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 2cbf4891 ec2-t3-xlarge-us-east-2
[Failed] 2cbf4891 test-mac-arm
[Failed] 2cbf4891 ursa-i9-9960x
[Finished] 2cbf4891 ursa-thinkcentre-m75q
[Finished] bd785c99 ec2-t3-xlarge-us-east-2
[Failed] bd785c99 test-mac-arm
[Failed] bd785c99 ursa-i9-9960x
[Finished] bd785c99 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-10-16T13:31:51Z

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm
ursa-i9-9960x

thisisnic · 2022-10-17T13:58:12Z

@nealrichardson I think I may need to refactor this in a follow-up PR - check out the regressions :\ Any suggestion for what I can do instead?

thisisnic added 10 commits October 11, 2022 08:54

Add simulate_data_frame helper function

ce5b967

Use eval_select instead of vars_select

4b35726

Enable test for where()

2fcab49

Don't check for usage of where()

aa2ed7e

Import eval_rename instead of vars_rename

b3663da

Reimplement column_select

7827b50

Refactor function back into next level

86867e0

Move helper function to bottom of file

3396ad6

Update tests where where() now works and update corresponding docs

0bcf4b3

If can't find type, just use NULL

3d381b7

thisisnic commented Oct 11, 2022

View reviewed changes

thisisnic requested a review from nealrichardson October 11, 2022 11:41

github-actions bot added the Component: R label Oct 11, 2022

nealrichardson requested changes Oct 11, 2022

View reviewed changes

thisisnic added 2 commits October 13, 2022 13:07

Add schema to 0-row Table C++ function

3b092c6

Update simulate_data_frame to use C++ function instead

6d01a13

nealrichardson reviewed Oct 13, 2022

View reviewed changes

r/src/table.cpp Show resolved Hide resolved

Clearer var names

991feb2

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

thisisnic added 3 commits October 13, 2022 18:51

Handle extension types

c0b890a

Use as S3 method instead of own function

8196763

Merge branch 'ARROW-12105_eval_select' of github.com:thisisnic/arrow …

0e24c99

…into ARROW-12105_eval_select

thisisnic added 4 commits October 14, 2022 12:25

Update feather reader to use eval_select not vars_select

a317245

Update parquet reader to use eval_select

c5c7642

Update JSON reader to use eval_select

3711de8

Update CSV reader to use eval_select

8b68607

thisisnic added 2 commits October 14, 2022 12:45

Remove import of vars_select, and run devtools::document()

6a9122a

Merge branch 'master' into ARROW-12105_eval_select

5d37f1f

thisisnic requested a review from nealrichardson October 14, 2022 11:47

github-actions bot added the Component: Parquet label Oct 14, 2022

nealrichardson approved these changes Oct 14, 2022

View reviewed changes

thisisnic merged commit 2cbf489 into apache:master Oct 14, 2022

asfimport mentioned this pull request Oct 17, 2022

[R] Replace vars_select, vars_rename with eval_select, eval_rename #18578

Closed

ARROW-12105: [R] Replace vars_select, vars_rename with eval_select, eval_rename #14371

ARROW-12105: [R] Replace vars_select, vars_rename with eval_select, eval_rename #14371

Uh oh!

Conversation

thisisnic commented Oct 11, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thisisnic commented Oct 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nealrichardson commented Oct 13, 2022

Uh oh!

thisisnic commented Oct 13, 2022

Uh oh!

thisisnic commented Oct 13, 2022

Uh oh!

nealrichardson commented Oct 14, 2022

Uh oh!

thisisnic commented Oct 14, 2022

Uh oh!

ursabot commented Oct 16, 2022

Uh oh!

ursabot commented Oct 16, 2022

Uh oh!

thisisnic commented Oct 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

thisisnic commented Oct 13, 2022 •

edited

Loading