Skip to content

Unexpected missing matches with non-equi join with grouping by .EACHI #4911

@adamaltmejd

Description

@adamaltmejd

Working with proprietary data so was a bit tricky creating a reproducible example but think this works.

X <- setDT(structure(list(id = c(6456372L, 6456372L, 6456372L, 6456372L, 
6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 
6456372L, 6456372L, 6456372L), id_round = c(197801L, 199405L, 
199501L, 197901L, 197905L, 198001L, 198005L, 198101L, 198105L, 
198201L, 198205L, 198301L, 198305L, 198401L), field = c(NA, NA, 
NA, "medicine", "medicine", "medicine", "medicine", "medicine", 
"medicine", "medicine", "medicine", "medicine", "medicine", "medicine"
)), class = c("data.table", "data.frame"
), sorted = "id"))

Y <- setDT(structure(list(id = c(6456372L, 6456345L, 6456356L), id_round = c(197705L, 
197905L, 201705L), field = c("medicine", "teaching", "health"
), prio = c(6L, 1L, 10L)), class = c("data.table", 
"data.frame"), sorted = c("id_round", 
"id", "prio", "field")))

X[Y, on = .(id, id_round > id_round, field), .(x.id_round[1], i.id_round[1]), by = .EACHI]
id id_round    field     V1     V2
1: 6456372   197705 medicine 197901 197705
2: 6456345   197905 teaching     NA 197905
3: 6456356   201705   health     NA 201705

So everything seems to work fine, but these results are supposed to be merged back into the main data set Y and here is where I run in to trouble. It does not merge and moreover I cannot subset by id anymore:

> X[Y, on = .(id, id_round > id_round, field), .(x.id_round[1], i.id_round[1]), by = .EACHI][id == 6456372]              
Empty data.table (0 rows and 5 cols): id,id_round,field,V1,V2

Expecting to find a match here of course. The strange thing is that it works if I drop by=.EACHI or if I drop the last key column "prio":

> X[Y, on = .(id, id_round > id_round, field), .(id, field, x.id_round[1], i.id_round[1])][id == 6456372]                
         id    field     V3     V4
 1: 6456372 medicine 197901 197705
 2: 6456372 medicine 197901 197705
 3: 6456372 medicine 197901 197705
 4: 6456372 medicine 197901 197705
 5: 6456372 medicine 197901 197705
 6: 6456372 medicine 197901 197705
 7: 6456372 medicine 197901 197705
 8: 6456372 medicine 197901 197705
 9: 6456372 medicine 197901 197705
10: 6456372 medicine 197901 197705
11: 6456372 medicine 197901 197705
> X[Y[, .(id, id_round, field)], on = .(id, id_round > id_round, field), .(x.id_round[1], i.id_round[1]), by = .EACHI][id == 6456372]                                                                                                             
        id id_round    field     V1     V2
1: 6456372   197705 medicine 197901 197705

Y is keyed by "prio" but it is not included in the join. It seems to be related to the id number's relation to the other numbers, cause if I change the number to 6456344 or anything lower I get the expected results.

Running latest dev:

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.13.7 colorout_1.2-2   

loaded via a namespace (and not attached):
[1] compiler_4.0.4 jsonlite_1.7.2 rlang_0.4.10  

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions