Skip to content

Inconsistent behavior when combining GForce, non-GForce functions in single j expression #5554

@berg-michael

Description

@berg-michael

It seems likely to me that this behavior has already been reported, but I was unable to find an issue for it.

Basically, it appears that differences in how gmax handles NAs vs. how max handles NAs means that two operations in j which perform fine on their own (in this case, calls to max and any) might throw an error when called together when creating the same two columns in a single j. I have a minimal example below.

My understanding is that because any is not GForce optimized, when it appears in j with max, we will call the standard max function. In the case where no members of the group have non-NA values this will return -Inf, a double; and for all other cases it will return an integer. Meanwhile, gmax seems to recognize this problem and coerce the integer groups to double.

I think this could lead to confusion as the output of one function is determined by the presence of another.

library(data.table)
options(datatable.verbose = TRUE)

dt <- data.table(group = c("a", "b", "c"),
                 var1 = c(1L, NA, 2L),
                 var2 = c(F, F, F))
# Works
dt[, .(max_var1 = max(var1, na.rm = T)), group]
#> Detected that j uses these columns: var1 
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'list(max(var1, na.rm = T))'
#> GForce optimized j to 'list(gmax(var1, na.rm = TRUE))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.000
#> Warning in gmax(var1, na.rm = TRUE): No non-missing values found in at least
#> one group. Coercing to numeric type and returning 'Inf' for such groups to be
#> consistent with base
#> gforce eval took 0.000
#> 0.001s elapsed (0.000s cpu)
#>    group max_var1
#> 1:     a        1
#> 2:     b     -Inf
#> 3:     c        2
dt[, .(any_var2 = any(var2, na.rm = T)), group]
#> Detected that j uses these columns: var2 
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'list(any(var2, na.rm = T))'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ... 
#>   memcpy contiguous groups took 0.000s for 3 groups
#>   eval(j) took 0.000s for 3 calls
#> 0.000s elapsed (0.000s cpu)
#>    group any_var2
#> 1:     a    FALSE
#> 2:     b    FALSE
#> 3:     c    FALSE

# Breaks
dt[, .(max_var1 = max(var1, na.rm = T),
       any_var2 = any(var2, na.rm = T)),
   group]
#> Detected that j uses these columns: var1,var2 
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'list(max(var1, na.rm = T), any(var2, na.rm = T))'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ...
#> Warning in max(var1, na.rm = T): no non-missing arguments to max; returning -Inf
#> Error in `[.data.table`(dt, , .(max_var1 = max(var1, na.rm = T), any_var2 = any(var2, : Column 1 of result for group 2 is type 'double' but expecting type 'integer'. Column types must be consistent for each group.

# Works
dt[, .(max_var1 = max(var1, na.rm = T),
       max_var2 = max(var2, na.rm = T)),
   group]
#> Detected that j uses these columns: var1,var2 
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'list(max(var1, na.rm = T), max(var2, na.rm = T))'
#> GForce optimized j to 'list(gmax(var1, na.rm = TRUE), gmax(var2, na.rm = TRUE))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.000
#> Warning in gmax(var1, na.rm = TRUE): No non-missing values found in at least
#> one group. Coercing to numeric type and returning 'Inf' for such groups to be
#> consistent with base
#> gforce eval took 0.000
#> 0.000s elapsed (0.000s cpu)
#>    group max_var1 max_var2
#> 1:     a        1        0
#> 2:     b     -Inf        0
#> 3:     c        2        0

# Without GForce optimization, original command breaks
options(datatable.optimize=0L)
dt[, .(max_var1 = max(var1, na.rm = T)), group]
#> Detected that j uses these columns: var1 
#> Finding groups using forderv ... forder.c received 3 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> All optimizations are turned off
#> Making each group and running j (GForce FALSE) ...
#> Warning in max(var1, na.rm = T): no non-missing arguments to max; returning -Inf
#> Error in `[.data.table`(dt, , .(max_var1 = max(var1, na.rm = T)), group): Column 1 of result for group 2 is type 'double' but expecting type 'integer'. Column types must be consistent for each group.

sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.14.4
#> 
#> loaded via a namespace (and not attached):
#>  [1] withr_2.5.0     digest_0.6.30   lifecycle_1.0.3 magrittr_2.0.3 
#>  [5] reprex_2.0.2    evaluate_0.17   highr_0.9       stringi_1.7.8  
#>  [9] rlang_1.0.6     cli_3.4.1       rstudioapi_0.14 fs_1.5.2       
#> [13] rmarkdown_2.17  tools_4.2.2     stringr_1.4.1   glue_1.6.2     
#> [17] xfun_0.34       yaml_2.3.6      fastmap_1.1.0   compiler_4.2.2 
#> [21] htmltools_0.5.3 knitr_1.40

Created on 2022-12-05 with reprex v2.0.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    GForceissues relating to optimized grouping calculations (GForce)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions