Skip to content

keyby= may use the wrong index #3498

@shrektan

Description

@shrektan

The keyby may use the wrong index if the keyby column name is the leading part of the index column name. For example, the index is CLASS_L3 while keyby is CLASS.

After a shallow investigation, I suspect the below line is the cause of this bug. I don't understand the reason of using substring() at all. It seems not necessary. @mattdowle Any ideas?

w = which.first(substring(indices(x),1L,nchar(tt)) == tt) # substring to avoid the overhead of grep

I will be happy to file a PR with tests if there's no particular reason to compare to a subset of the indices.

Thanks.

The reproducible example

(Note, although the below example is running under R3.4.4, they are reproducible on R3.5.3 as well.)

See after set up the indices, the keyby expression returns wrong results. In comparison, using by still returns correct results.

library(data.table)
dt <- data.table(
  CLASS_L3 = c("gggg", "iiii", "bbbb", "bbbb", "gggg", 
               "ffff", "bbbb", "Repo", "bbbb", "dddd", "hhhh", 
               "dddd", "gggg", "dddd", 
               "dddd", "hhhh", "dddd", "dddd", 
               "Repo", "bbbb", "dddd", "dddd", "dddd", 
               "dddd", "cccc", "aaaa", "dddd", 
               "cccc", "dddd", "dddd", "dddd", 
               "dddd", "dddd", "bbbb", "dddd", 
               "dddd", "cccc", "dddd", "dddd", 
               "dddd", "dddd", "dddd", "bbbb", 
               "dddd", "cccc", "cccc", "dddd", 
               "bbbb", "cccc", "aaaa", "cccc", "dddd", 
               "cccc", "cccc", "cccc", "aaaa", 
               "dddd", "dddd", "dddd", "dddd", 
               "b1111", "cccc", "dddd", "dddd", 
               "cccc", "cccc", "cccc", "Repo", 
               "bbbb", "bbbb", "dddd", "dddd", "cccc", 
               "dddd", "cccc", "dddd"),
  CLASS = c("aaaa", "dddd", "gggg", "gggg", 
            "aaaa", "eeee", "eeee", "ffff", "gggg", 
            "aaaa", "aaaa", "aaaa", "aaaa", "aaaa", 
            "aaaa", "aaaa", "aaaa", "aaaa", "ffff", 
            "cccc", "bbbb", "bbbb", 
            "aaaa", "aaaa", "aaaa", "dddd", "aaaa", 
            "aaaa", "aaaa", "aaaa", "aaaa", "aaaa", 
            "aaaa", "gggg", "bbbb", 
            "bbbb", "aaaa", "bbbb", 
            "aaaa", "aaaa", "aaaa", "aaaa", "cccc", 
            "aaaa", "aaaa", "aaaa", "aaaa", "eeee", 
            "aaaa", "dddd", "aaaa", "aaaa", "aaaa", 
            "aaaa", "aaaa", "dddd", "bbbb", 
            "aaaa", "aaaa", "aaaa", "eeee", "aaaa", 
            "aaaa", "aaaa", "aaaa", "aaaa", "aaaa", 
            "ffff", "gggg", "cccc", "bbbb", 
            "aaaa", "aaaa", "aaaa", "aaaa", "bbbb"
  )
)
indices(dt)
#> NULL
dt[, .N, keyby = CLASS, verbose = TRUE]
#> Detected that j uses these columns: <none> 
#> Finding groups using forderv ... 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as '.N'
#> GForce optimized j to '.N'
#> Making each group and running j (GForce TRUE) ... 0.000s elapsed (0.000s cpu)
#>    CLASS  N
#> 1:  aaaa 49
#> 2:  bbbb  8
#> 3:  cccc  3
#> 4:  dddd  4
#> 5:  eeee  4
#> 6:  ffff  3
#> 7:  gggg  5
setindex(dt, CLASS_L3)
indices(dt)
#> [1] "CLASS_L3"
dt[, .N, keyby = CLASS, verbose = TRUE]
#> Detected that j uses these columns: <none> 
#> Finding groups using uniqlist on index 'CLASS_L3' ... 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as '.N'
#> GForce optimized j to '.N'
#> Making each group and running j (GForce TRUE) ... 0.000s elapsed (0.000s cpu)
#>     CLASS  N
#>  1:  ffff  3
#>  2:  dddd  3
#>  3:  eeee  1
#>  4:  gggg  2
#>  5:  eeee  1
#>  6:  gggg  1
#>  7:  cccc  1
#>  8:  gggg  1
#>  9:  cccc  1
#> 10:  eeee  1
#> 11:  gggg  1
#> 12:  cccc  1
#> 13:  aaaa 22
#> 14:  bbbb  2
#> 15:  aaaa  8
#> 16:  bbbb  3
#> 17:  aaaa  7
#> 18:  bbbb  1
#> 19:  aaaa  5
#> 20:  bbbb  1
#> 21:  aaaa  2
#> 22:  bbbb  1
#> 23:  eeee  1
#> 24:  aaaa  5
#> 25:  dddd  1
#>     CLASS  N
dt[, .N, by = CLASS, verbose = TRUE]
#> Detected that j uses these columns: <none> 
#> Finding groups using forderv ... 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as '.N'
#> GForce optimized j to '.N'
#> Making each group and running j (GForce TRUE) ... 0.000s elapsed (0.000s cpu)
#>    CLASS  N
#> 1:  aaaa 49
#> 2:  dddd  4
#> 3:  gggg  5
#> 4:  eeee  4
#> 5:  ffff  3
#> 6:  cccc  3
#> 7:  bbbb  8

Created on 2019-04-10 by the reprex package (v0.2.1)

Session info
devtools::session_info()
#> - Session info ----------------------------------------------------------
#>  setting  value                                              
#>  version  R version 3.4.4 (2018-03-15)                       
#>  os       Windows 7 x64 SP 1                                 
#>  system   x86_64, mingw32                                    
#>  ui       RTerm                                              
#>  language (EN)                                               
#>  collate  Chinese (Simplified)_People's Republic of China.936
#>  ctype    Chinese (Simplified)_People's Republic of China.936
#>  tz       Asia/Taipei                                        
#>  date     2019-04-10                                         
#> 
#> - Packages --------------------------------------------------------------
#>  package     * version    date       lib source        
#>  assertthat    0.2.1      2019-03-21 [2] CRAN (R 3.4.4)
#>  backports     1.1.3      2018-12-14 [2] CRAN (R 3.4.4)
#>  callr         3.2.0      2019-03-15 [2] CRAN (R 3.4.4)
#>  cli           1.1.0      2019-03-19 [2] CRAN (R 3.4.4)
#>  crayon        1.3.4      2017-09-16 [2] CRAN (R 3.4.4)
#>  data.table  * 1.12.3     2019-04-10 [1] local         
#>  desc          1.2.0      2018-05-01 [2] CRAN (R 3.4.4)
#>  devtools      2.0.1      2018-10-26 [2] CRAN (R 3.4.4)
#>  digest        0.6.18     2018-10-10 [2] CRAN (R 3.4.4)
#>  evaluate      0.13       2019-02-12 [2] CRAN (R 3.4.4)
#>  fs            1.2.7      2019-03-19 [2] CRAN (R 3.4.4)
#>  glue          1.3.1      2019-03-12 [2] CRAN (R 3.4.4)
#>  highr         0.8        2019-03-20 [2] CRAN (R 3.4.4)
#>  htmltools     0.3.6.9000 2018-05-07 [2] local         
#>  knitr         1.22       2019-03-08 [2] CRAN (R 3.4.4)
#>  magrittr      1.5        2014-11-22 [2] CRAN (R 3.4.4)
#>  memoise       1.1.0      2017-04-21 [2] CRAN (R 3.4.4)
#>  pkgbuild      1.0.3      2019-03-20 [2] CRAN (R 3.4.4)
#>  pkgload       1.0.2      2018-10-29 [2] CRAN (R 3.4.4)
#>  prettyunits   1.0.2      2015-07-13 [2] CRAN (R 3.4.1)
#>  processx      3.3.0      2019-03-10 [2] CRAN (R 3.4.4)
#>  ps            1.3.0      2018-12-21 [2] CRAN (R 3.4.4)
#>  R6            2.4.0      2019-02-14 [2] CRAN (R 3.4.4)
#>  Rcpp          1.0.1      2019-03-17 [2] CRAN (R 3.4.4)
#>  remotes       2.0.2      2018-10-30 [2] CRAN (R 3.4.4)
#>  rlang         0.3.3      2019-03-29 [2] CRAN (R 3.4.4)
#>  rmarkdown     1.12       2019-03-14 [2] CRAN (R 3.4.4)
#>  rprojroot     1.3-2      2018-01-03 [2] CRAN (R 3.4.3)
#>  sessioninfo   1.1.1      2018-11-05 [2] CRAN (R 3.4.4)
#>  stringi       1.4.3      2019-03-12 [2] CRAN (R 3.4.4)
#>  stringr       1.4.0      2019-02-10 [2] CRAN (R 3.4.4)
#>  testthat      2.0.1      2018-10-13 [2] CRAN (R 3.4.4)
#>  usethis       1.4.0      2018-08-14 [2] CRAN (R 3.4.4)
#>  withr         2.1.2      2018-03-15 [2] CRAN (R 3.4.4)
#>  xfun          0.6        2019-04-02 [2] CRAN (R 3.4.4)
#>  yaml          2.2.0      2018-07-25 [2] CRAN (R 3.4.4)
#> 
#> [1] D:/R/R-dev
#> [2] D:/app/R_lib/3.1
#> [3] D:/app/R/library

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugencodingissues related to Encoding

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions