I see the following behavior which I believe indicates a bug in merge.data.table:
library(data.table)
packageDescription("data.table")$Version
# [1] "1.14.2"
some_letters <- c("c", "b", "a")
some_more_letters <- rep(c("a", "b", "c"), 2L)
dt1 <- data.table(x = some_letters, y=1:3)
dt2 <- data.table(x = factor(some_more_letters, levels=some_letters), z=1:6, key=c("x", "z"))
dt3 <- merge(dt1, dt2, by="x")
str(dt3)
# Classes ‘data.table’ and 'data.frame': 6 obs. of 3 variables:
# $ x: chr "c" "c" "b" "b" ...
# $ y: int 1 1 2 2 3 3
# $ z: int 3 6 2 5 1 4
# - attr(*, "sorted")= chr "x"
# - attr(*, ".internal.selfref")=<externalptr>
dt3[x %in% "c", ]
# Empty data.table (0 rows and 3 cols): x,y,z
dt3[(x %in% "c"), ]
# x y z
# 1: c 1 3
# 2: c 1 6
I believe the problem is that dt3 thinks column x is sorted (and it would be if it was a factor), but it is not as a character. I assume that data.tables has an internal optimized %in% operator that uses this information and then gives the wrong result when we attempt to subset on x %in% "c". Finally, I assume that wrapping the subset operation in parenthesis avoids the use of data.tables internal %in%, so the subset works correctly as it no longer used the incorrect sorted attribute on dt3. Even if the last two assumptions are wrong, the behavior above seems incorrect.
The closest thing I could find is issue #499, but I think is is different.
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.14.2
loaded via a namespace (and not attached):
[1] compiler_3.6.3
I see the following behavior which I believe indicates a bug in
merge.data.table:I believe the problem is that
dt3thinks columnxis sorted (and it would be if it was a factor), but it is not as a character. I assume thatdata.tables has an internal optimized%in%operator that uses this information and then gives the wrong result when we attempt to subset onx %in% "c". Finally, I assume that wrapping the subset operation in parenthesis avoids the use ofdata.tables internal%in%, so the subset works correctly as it no longer used the incorrectsortedattribute ondt3. Even if the last two assumptions are wrong, the behavior above seems incorrect.The closest thing I could find is issue #499, but I think is is different.