This was brought up on SO.
The goal is to find out if each row in DT1 has a match in DT2 in the sense of on=.(RANDOM_STRING, DATE >= START_DATE, DATE <= EXPIRY_DATE):
set.seed(123)
library(data.table)
library(stringi)
# Sorry that it requires stringi; I couldn't find another way forward besides the OP's verbatim example.
n <- 100000
DT1 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
DATE = sample(seq(as.Date('2016-01-01'), as.Date('2016-12-31'), by="day"), n, replace=T))
DT2 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
START_DATE = sample(seq(as.Date('2015-01-01'), as.Date('2017-12-31'), by="day"), n, replace=T))
DT2[, EXPIRY_DATE := START_DATE + floor(runif(1000, 200,300))]
My usual approach is to do a join, counting matches with .N and by=.EACHI. However, the OP found that this fails here:
# correct result (takes a long time)
DT1[, m_ok := DT2[.BY, on=.(RANDOM_STRING), inrange(DATE, START_DATE, EXPIRY_DATE)], by=RANDOM_STRING]
# my usual approach
DT1[, MATCHED := FALSE]
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]
# comparison
DT1[MATCHED != m_ok | (MATCHED & is.na(m_ok))]
# shows many cases of failure from the attempt
# for example...
DT1[RANDOM_STRING == "egkja"]; DT2[RANDOM_STRING == "egkja"]
# RANDOM_STRING DATE MATCHED m_ok
# 1: egkja 2016-05-19 TRUE FALSE
# 2: egkja 2016-06-02 FALSE FALSE
# 3: egkja 2016-05-20 TRUE FALSE
# 4: egkja 2016-03-16 FALSE TRUE
# RANDOM_STRING START_DATE EXPIRY_DATE
# 1: egkja 2015-09-07 2016-04-17
There is probably a way to come up with the correct result in a less slow way (foverlaps?), but my point is that I expect the .N, by=.EACHI]$N > 0L way to work. Is it failing thanks to a bug or am I mistaken in using it here?
I had trouble making a smaller example. Drop the n parameter by a factor of 10 and you'll see that the problem disappears. Stranger, the OP noticed that if you repeatedly run the DT1[!(MATCHED), MATCHED := ... ] line, it will keep making changes over many iterations. Also, the OP said they couldn't construct an example when the on= condition only contained one inequality.
EDIT: one faster way of coming up with the correct result, thanks to SO OP:
w = DT1[DT2, on=.(RANDOM_STRING, DATE >= START_DATE, DATE <= EXPIRY_DATE), which=TRUE, nomatch=0]
DT1[, m := DT1_ID %in% w ]
This was brought up on SO.
The goal is to find out if each row in DT1 has a match in DT2 in the sense of
on=.(RANDOM_STRING, DATE >= START_DATE, DATE <= EXPIRY_DATE):My usual approach is to do a join, counting matches with
.Nandby=.EACHI. However, the OP found that this fails here:There is probably a way to come up with the correct result in a less slow way (foverlaps?), but my point is that I expect the
.N, by=.EACHI]$N > 0Lway to work. Is it failing thanks to a bug or am I mistaken in using it here?I had trouble making a smaller example. Drop the
nparameter by a factor of 10 and you'll see that the problem disappears. Stranger, the OP noticed that if you repeatedly run theDT1[!(MATCHED), MATCHED := ... ]line, it will keep making changes over many iterations. Also, the OP said they couldn't construct an example when theon=condition only contained one inequality.EDIT: one faster way of coming up with the correct result, thanks to SO OP: