After a shallow investigation, I suspect the below line is the cause of this bug. I don't understand the reason of using substring() at all. It seems not necessary. @mattdowle Any ideas?
I will be happy to file a PR with tests if there's no particular reason to compare to a subset of the indices.
Thanks.
(Note, although the below example is running under R3.4.4, they are reproducible on R3.5.3 as well.)
library(data.table)
dt <- data.table(
CLASS_L3 = c("gggg", "iiii", "bbbb", "bbbb", "gggg",
"ffff", "bbbb", "Repo", "bbbb", "dddd", "hhhh",
"dddd", "gggg", "dddd",
"dddd", "hhhh", "dddd", "dddd",
"Repo", "bbbb", "dddd", "dddd", "dddd",
"dddd", "cccc", "aaaa", "dddd",
"cccc", "dddd", "dddd", "dddd",
"dddd", "dddd", "bbbb", "dddd",
"dddd", "cccc", "dddd", "dddd",
"dddd", "dddd", "dddd", "bbbb",
"dddd", "cccc", "cccc", "dddd",
"bbbb", "cccc", "aaaa", "cccc", "dddd",
"cccc", "cccc", "cccc", "aaaa",
"dddd", "dddd", "dddd", "dddd",
"b1111", "cccc", "dddd", "dddd",
"cccc", "cccc", "cccc", "Repo",
"bbbb", "bbbb", "dddd", "dddd", "cccc",
"dddd", "cccc", "dddd"),
CLASS = c("aaaa", "dddd", "gggg", "gggg",
"aaaa", "eeee", "eeee", "ffff", "gggg",
"aaaa", "aaaa", "aaaa", "aaaa", "aaaa",
"aaaa", "aaaa", "aaaa", "aaaa", "ffff",
"cccc", "bbbb", "bbbb",
"aaaa", "aaaa", "aaaa", "dddd", "aaaa",
"aaaa", "aaaa", "aaaa", "aaaa", "aaaa",
"aaaa", "gggg", "bbbb",
"bbbb", "aaaa", "bbbb",
"aaaa", "aaaa", "aaaa", "aaaa", "cccc",
"aaaa", "aaaa", "aaaa", "aaaa", "eeee",
"aaaa", "dddd", "aaaa", "aaaa", "aaaa",
"aaaa", "aaaa", "dddd", "bbbb",
"aaaa", "aaaa", "aaaa", "eeee", "aaaa",
"aaaa", "aaaa", "aaaa", "aaaa", "aaaa",
"ffff", "gggg", "cccc", "bbbb",
"aaaa", "aaaa", "aaaa", "aaaa", "bbbb"
)
)
indices(dt)
#> NULL
dt[, .N, keyby = CLASS, verbose = TRUE]
#> Detected that j uses these columns: <none>
#> Finding groups using forderv ... 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> lapply optimization is on, j unchanged as '.N'
#> GForce optimized j to '.N'
#> Making each group and running j (GForce TRUE) ... 0.000s elapsed (0.000s cpu)
#> CLASS N
#> 1: aaaa 49
#> 2: bbbb 8
#> 3: cccc 3
#> 4: dddd 4
#> 5: eeee 4
#> 6: ffff 3
#> 7: gggg 5
setindex(dt, CLASS_L3)
indices(dt)
#> [1] "CLASS_L3"
dt[, .N, keyby = CLASS, verbose = TRUE]
#> Detected that j uses these columns: <none>
#> Finding groups using uniqlist on index 'CLASS_L3' ... 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> lapply optimization is on, j unchanged as '.N'
#> GForce optimized j to '.N'
#> Making each group and running j (GForce TRUE) ... 0.000s elapsed (0.000s cpu)
#> CLASS N
#> 1: ffff 3
#> 2: dddd 3
#> 3: eeee 1
#> 4: gggg 2
#> 5: eeee 1
#> 6: gggg 1
#> 7: cccc 1
#> 8: gggg 1
#> 9: cccc 1
#> 10: eeee 1
#> 11: gggg 1
#> 12: cccc 1
#> 13: aaaa 22
#> 14: bbbb 2
#> 15: aaaa 8
#> 16: bbbb 3
#> 17: aaaa 7
#> 18: bbbb 1
#> 19: aaaa 5
#> 20: bbbb 1
#> 21: aaaa 2
#> 22: bbbb 1
#> 23: eeee 1
#> 24: aaaa 5
#> 25: dddd 1
#> CLASS N
dt[, .N, by = CLASS, verbose = TRUE]
#> Detected that j uses these columns: <none>
#> Finding groups using forderv ... 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> Getting back original order ... 0.000s elapsed (0.000s cpu)
#> lapply optimization is on, j unchanged as '.N'
#> GForce optimized j to '.N'
#> Making each group and running j (GForce TRUE) ... 0.000s elapsed (0.000s cpu)
#> CLASS N
#> 1: aaaa 49
#> 2: dddd 4
#> 3: gggg 5
#> 4: eeee 4
#> 5: ffff 3
#> 6: cccc 3
#> 7: bbbb 8
The
keybymay use the wrong index if the keyby column name is the leading part of the index column name. For example, the index isCLASS_L3whilekeybyisCLASS.After a shallow investigation, I suspect the below line is the cause of this bug. I don't understand the reason of using
substring()at all. It seems not necessary. @mattdowle Any ideas?data.table/R/data.table.R
Line 867 in 426b71d
I will be happy to file a PR with tests if there's no particular reason to compare to a subset of the indices.
Thanks.
The reproducible example
(Note, although the below example is running under R3.4.4, they are reproducible on R3.5.3 as well.)
See after set up the indices, the
keybyexpression returns wrong results. In comparison, usingbystill returns correct results.Created on 2019-04-10 by the reprex package (v0.2.1)
Session info