Skip to content

Strange results from a non-equi join with multiple conditions #2275

@franknarf1

Description

@franknarf1

This was brought up on SO.

The goal is to find out if each row in DT1 has a match in DT2 in the sense of on=.(RANDOM_STRING, DATE >= START_DATE, DATE <= EXPIRY_DATE):

set.seed(123)
library(data.table)
library(stringi)
# Sorry that it requires stringi; I couldn't find another way forward besides the OP's verbatim example.

n <- 100000

DT1 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
                  DATE = sample(seq(as.Date('2016-01-01'), as.Date('2016-12-31'), by="day"), n, replace=T))

DT2 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
                  START_DATE = sample(seq(as.Date('2015-01-01'), as.Date('2017-12-31'), by="day"), n, replace=T))

DT2[, EXPIRY_DATE := START_DATE + floor(runif(1000, 200,300))]

My usual approach is to do a join, counting matches with .N and by=.EACHI. However, the OP found that this fails here:

# correct result (takes a long time)

    DT1[, m_ok := DT2[.BY, on=.(RANDOM_STRING), inrange(DATE, START_DATE, EXPIRY_DATE)], by=RANDOM_STRING]

# my usual approach

    DT1[, MATCHED := FALSE]

    DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]

# comparison

    DT1[MATCHED != m_ok | (MATCHED & is.na(m_ok))] 

    # shows many cases of failure from the attempt
    # for example...

    DT1[RANDOM_STRING == "egkja"]; DT2[RANDOM_STRING == "egkja"]

    #    RANDOM_STRING       DATE MATCHED  m_ok
    # 1:         egkja 2016-05-19    TRUE FALSE
    # 2:         egkja 2016-06-02   FALSE FALSE
    # 3:         egkja 2016-05-20    TRUE FALSE
    # 4:         egkja 2016-03-16   FALSE  TRUE

    #    RANDOM_STRING START_DATE EXPIRY_DATE
    # 1:         egkja 2015-09-07  2016-04-17

There is probably a way to come up with the correct result in a less slow way (foverlaps?), but my point is that I expect the .N, by=.EACHI]$N > 0L way to work. Is it failing thanks to a bug or am I mistaken in using it here?

I had trouble making a smaller example. Drop the n parameter by a factor of 10 and you'll see that the problem disappears. Stranger, the OP noticed that if you repeatedly run the DT1[!(MATCHED), MATCHED := ... ] line, it will keep making changes over many iterations. Also, the OP said they couldn't construct an example when the on= condition only contained one inequality.

EDIT: one faster way of coming up with the correct result, thanks to SO OP:

w = DT1[DT2, on=.(RANDOM_STRING, DATE >= START_DATE, DATE <= EXPIRY_DATE), which=TRUE, nomatch=0]
DT1[, m := DT1_ID %in% w ]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugnon-equi joinsrolling, overlapping, non-equi joins

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions