Skip to content

[R] experimental map_batches cannot find columns #27303

@asfimport

Description

@asfimport

With dataset:

 

Schema
X3: timestamp[us]
user_id: dictionary<values=string, indices=int32>
classification_name: dictionary<values=string, indices=int32>
X2: string
X1: dictionary<values=string, indices=int32>
X4: string
X5: dictionary<values=string, indices=int32>
X6: dictionary<values=string, indices=int32>

The following succeeds:

chunk <- ds %>%
    select(user_id) %>%
    collect() %>%
    count(user_id) %>%
    as_tibble() %>%
    count(user_id, wt=n)

While the following fails:

chunk <- ds %>%
    select(user_id) %>%
    arrow::map_batches(~count(., user_id)) %>%
    as_tibble() %>%
    count(user_id, wt=x)

With error:

Error: Can't subset columns that don't exist.
✖ Column `.drop` doesn't exist.
Traceback:

1. ds %>% select(user_id) %>% arrow::map_batches(~count(., 
 .     user_id)) %>% as_tibble() %>% count(user_id, wt = x)
2. count(., user_id, wt = x)
3. group_by(x, ..., .add = TRUE, .drop = .drop)
4. as_tibble(.)
5. arrow::map_batches(., ~count(., user_id))
6. lapply(scanner$Scan(), function(scan_task) {
 .     lapply(scan_task$Execute(), function(batch) {
 .         FUN(batch, ...)
 .     })
 . })
7. map(.x, .f, ...)
8. .f(.x[[i]], ...)
9. lapply(scan_task$Execute(), function(batch) {
 .     FUN(batch, ...)
 . })
10. map(.x, .f, ...)
11. .f(.x[[i]], ...)
12. FUN(batch, ...)
13. count(., user_id)
14. tally(out, wt = !!enquo(wt), sort = sort, name = name)
15. (function() {
  .     old.options <- options(dplyr.summarise.inform = FALSE)
  .     on.exit(options(old.options))
  .     summarise(x, `:=`(!!name, !!n))
  . })()
16. summarise(x, `:=`(!!name, !!n))
17. summarise.arrow_dplyr_query(x, `:=`(!!name, !!n))
18. dplyr::select(.data, vars_to_keep)
19. select.arrow_dplyr_query(.data, vars_to_keep)
20. column_select(arrow_dplyr_query(.data), !!!enquos(...))
21. .FUN(names(.data), !!!enquos(...))
22. eval_select_impl(NULL, .vars, expr(c(!!!dots)), include = .include, 
  .     exclude = .exclude, strict = .strict, name_spec = unique_name_spec, 
  .     uniquely_named = TRUE)
23. with_subscript_errors(vars_select_eval(vars, expr, strict, data = x, 
  .     name_spec = name_spec, uniquely_named = uniquely_named, allow_rename = allow_rename, 
  .     type = type), type = type)
24. tryCatch(instrument_base_errors(expr), vctrs_error_subscript = function(cnd) {
  .     cnd$subscript_action <- subscript_action(type)
  .     cnd$subscript_elt <- "column"
  .     cnd_signal(cnd)
  . })
25. tryCatchList(expr, classes, parentenv, handlers)
26. tryCatchOne(expr, names, parentenv, handlers[[1L]])
27. value[[3L]](cond)
28. cnd_signal(cnd)
29. rlang:::signal_abort(x)

The dataset is 8 parquet files with no hive partitioning.

 

sessionInfo():

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTSMatrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.solocale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     other attached packages:
 [1] forcats_0.5.0     stringr_1.4.0     dplyr_1.0.2       purrr_0.3.4      
 [5] readr_1.4.0       tidyr_1.1.2       tibble_3.0.4      ggplot2_3.3.2    
 [9] tidyverse_1.3.0   dtplyr_1.0.1      data.table_1.13.2loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5            lubridate_1.7.9.2     aws.ec2metadata_0.2.0
 [4] ps_1.5.0              arrow_2.0.0           assertthat_0.2.1     
 [7] digest_0.6.27         utf8_1.1.4            aws.signature_0.6.0  
[10] mime_0.9              IRdisplay_0.7.0       R6_2.5.0             
[13] cellranger_1.1.0      repr_1.1.0            backports_1.2.0      
[16] reprex_0.3.0          evaluate_0.14         httr_1.4.2           
[19] pillar_1.4.7          rlang_0.4.9           curl_4.3             
[22] uuid_0.1-4            readxl_1.3.1          rstudioapi_0.13      
[25] bit_4.0.4             munsell_0.5.0         broom_0.7.2          
[28] compiler_4.0.3        modelr_0.1.8          pkgconfig_2.0.3      
[31] base64enc_0.1-3       htmltools_0.5.0       tidyselect_1.1.0     
[34] fansi_0.4.1           crayon_1.3.4          dbplyr_2.0.0         
[37] withr_2.3.0           grid_4.0.3            jsonlite_1.7.1       
[40] gtable_0.3.0          lifecycle_0.2.0       DBI_1.1.0            
[43] magrittr_2.0.1        scales_1.1.1          cli_2.2.0            
[46] stringi_1.5.3         fs_1.5.0              xml2_1.3.2           
[49] ellipsis_0.3.1        generics_0.1.0        vctrs_0.3.5          
[52] IRkernel_1.1.1        tools_4.0.3           bit64_4.0.5          
[55] glue_1.4.2            hms_0.5.3             aws.s3_0.3.22        
[58] colorspace_2.0-0      rvest_0.3.6           pbdZMQ_0.3-3.1

 

Reporter: Will Jones / @wjones127
Assignee: Will Jones / @wjones127

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-11415. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions