Skip to content

Conversation

@thisisnic
Copy link
Member

No description provided.

)
})

test_that("multiple select/rename and group_by", {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these in because they are needed to make sure the implementation of the utility function column_select() is working properly.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on!

#' @importFrom rlang quo_set_env quo_get_env is_formula quo_is_call f_rhs parse_expr f_env new_quosure
#' @importFrom rlang new_quosures expr_text
#' @importFrom tidyselect vars_pull vars_rename vars_select eval_select
#' @importFrom tidyselect vars_pull vars_select eval_select eval_rename
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update the other usages of vars_select in the package too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - this is still on my to-do list, but I think all other feedback has been addressed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done now

# TODO(ARROW-17384): implement where
"Use of `where()` selection helper not yet supported"
)
docs[["dplyr::across"]] <- character(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

r/R/util.R Outdated
}

simulate_data_frame <- function(schema) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining why we need this function.

And do we need to export this? I think I saw some discussion between @paleolimbot and @krlmlr about needing this function in DBI or adbc.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DBI, we need to infer the SQL data types from a RecordBatchReader, even if the DBI backend is not aware of Arrow (yet).

I wonder if there should be a way to convert a schema to a zero-row Arrow table. One way to do this is via schema -> zero-row table -> data frame. For DBI, schema -> data frame is sufficient, but perhaps schema -> zero-row table is easier to implemented in C++, and the table -> data frame operation is already efficient enough for zero-row tables.

r/R/util.R Outdated
abort(msg, call = call)
}

simulate_data_frame <- function(schema) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is going to be called every time someone does select/rename/relocate, I'd like for this function to be cheaper. In other PRs I've been noticing the overhead of creating R6 objects, which generally is not terrible (~150 microseconds on my machine) but it adds up. And here, we're creating lots of objects we're throwing away: for each column, we create a Field, then a DataType from that, then in concat_arrays, we create a null DataType, an Array with that, and a new Array that is cast to the correct DataType. That adds up to around 1ms per column, every time this function is called. That's enough to get noticed.

Can we move this to C++? Should be a simple enough switch statement to map Arrow type ids to the corresponding R length-0 vector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, I benchmarked this function on the schema of the taxi dataset (20 columns), and the median time was 15ms, so a little under 1ms per column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming! Currently working on a C++ simplification though may ask for help to get it over the line ahead of the release if I can't figure it out by the end of tomorrow.

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
@thisisnic
Copy link
Member Author

thisisnic commented Oct 13, 2022

Currently getting this error message in the unit tests, due to weirdness with trying to create 0-length arrays using extension types 😬

Will try to create a nice reprex/solution/workaround, but if anyone has any suggestions in the meantime, let me know.

Error (test-extension.R:340:3): Dataset/arrow_dplyr_query can roundtrip extension types
Error: NotImplemented: MakeBuilder: cannot construct builder for type <arrow_custom_vctr[0]>
/home/nic2/arrow/cpp/src/arrow/builder.cc:289  VisitTypeInline(*type, &impl)
Backtrace:
 1. ... %>% dplyr::collect()
      at test-extension.R:340:2
 4. arrow::select.Dataset(., number, letter, extension)
 5. arrow::column_select(.data, enquos(...), op = "select")
      at r/R/dplyr-select.R:24:2
 6. arrow::simulate_data_frame(implicit_schema(.data))
      at r/R/dplyr-select.R:97:2
 8. arrow::Table__from_schema(schema)

It can be reproduced here via:

extension_vec <- vctrs::new_vctr(letters[1:10], class = "arrow_custom_vctr")
df <- tibble::tibble(x = extension_vec)
arrow_df <- arrow_table(df)
simulate_data_frame(arrow_df$schema)


Error: NotImplemented: MakeBuilder: cannot construct builder for type <arrow_custom_vctr[0]>
/home/nic2/arrow/cpp/src/arrow/builder.cc:289  VisitTypeInline(*type, &impl)

@nealrichardson
Copy link
Member

I'm sure it's solvable since we can roundtrip extension type data with >0 elements. Dewey may be best positioned to advise. Can you defer that to a followup, and find some workaround here, perhaps swap in null() type for extension types?

@thisisnic
Copy link
Member Author

defer that to a followup

My favourite 5 words. Opened ARROW-18043

@thisisnic
Copy link
Member Author

@nealrichardson Mind giving this another look over? I'm going to get to the "swap vars_select for eval_select" remaining bits tomorrow morning, but it'd be good to have you look at the rest of it in case there's any changes I need to make there too.

@nealrichardson
Copy link
Member

Looks great to me!

@thisisnic
Copy link
Member Author

@nealrichardson Waiting on the CI, but otherwise I think this is ready for a final round of reviewing :D

@thisisnic thisisnic merged commit 2cbf489 into apache:master Oct 14, 2022
@ursabot
Copy link

ursabot commented Oct 16, 2022

Benchmark runs are scheduled for baseline = bd785c9 and contender = 2cbf489. 2cbf489 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️53.33% ⬆️0.0%] test-mac-arm
[Failed ⬇️28.22% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.39% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 2cbf4891 ec2-t3-xlarge-us-east-2
[Failed] 2cbf4891 test-mac-arm
[Failed] 2cbf4891 ursa-i9-9960x
[Finished] 2cbf4891 ursa-thinkcentre-m75q
[Finished] bd785c99 ec2-t3-xlarge-us-east-2
[Failed] bd785c99 test-mac-arm
[Failed] bd785c99 ursa-i9-9960x
[Finished] bd785c99 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Oct 16, 2022

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm
ursa-i9-9960x

@thisisnic
Copy link
Member Author

@nealrichardson I think I may need to refactor this in a follow-up PR - check out the regressions :\ Any suggestion for what I can do instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants