Strange results from a non-equi join with multiple conditions

This was brought up [on SO](https://stackoverflow.com/q/45212781/). 

The goal is to find out if each row in DT1 has a match in DT2 in the sense of `on=.(RANDOM_STRING, DATE >= START_DATE, DATE <= EXPIRY_DATE)`:

```
set.seed(123)
library(data.table)
library(stringi)
# Sorry that it requires stringi; I couldn't find another way forward besides the OP's verbatim example.

n <- 100000

DT1 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
                  DATE = sample(seq(as.Date('2016-01-01'), as.Date('2016-12-31'), by="day"), n, replace=T))

DT2 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
                  START_DATE = sample(seq(as.Date('2015-01-01'), as.Date('2017-12-31'), by="day"), n, replace=T))

DT2[, EXPIRY_DATE := START_DATE + floor(runif(1000, 200,300))]
```
My usual approach is to do a join, counting matches with `.N` and `by=.EACHI`. However, the OP found that this fails here:

```
# correct result (takes a long time)

    DT1[, m_ok := DT2[.BY, on=.(RANDOM_STRING), inrange(DATE, START_DATE, EXPIRY_DATE)], by=RANDOM_STRING]

# my usual approach

    DT1[, MATCHED := FALSE]

    DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]

# comparison

    DT1[MATCHED != m_ok | (MATCHED & is.na(m_ok))] 

    # shows many cases of failure from the attempt
    # for example...

    DT1[RANDOM_STRING == "egkja"]; DT2[RANDOM_STRING == "egkja"]

    #    RANDOM_STRING       DATE MATCHED  m_ok
    # 1:         egkja 2016-05-19    TRUE FALSE
    # 2:         egkja 2016-06-02   FALSE FALSE
    # 3:         egkja 2016-05-20    TRUE FALSE
    # 4:         egkja 2016-03-16   FALSE  TRUE

    #    RANDOM_STRING START_DATE EXPIRY_DATE
    # 1:         egkja 2015-09-07  2016-04-17
```

There is probably a way to come up with the correct result in a less slow way (foverlaps?), but my point is that I expect the `.N, by=.EACHI]$N > 0L` way to work. Is it failing thanks to a bug or am I mistaken in using it here?

I had trouble making a smaller example. Drop the `n` parameter by a factor of 10 and you'll see that the problem disappears. Stranger, the OP noticed that if you repeatedly run the `DT1[!(MATCHED), MATCHED := ... ]` line, it will keep making changes over many iterations. Also, the OP said they couldn't construct an example when the `on=` condition only contained one inequality.

EDIT: one faster way of coming up with the correct result, thanks to SO OP:

```
w = DT1[DT2, on=.(RANDOM_STRING, DATE >= START_DATE, DATE <= EXPIRY_DATE), which=TRUE, nomatch=0]
DT1[, m := DT1_ID %in% w ]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange results from a non-equi join with multiple conditions #2275

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Strange results from a non-equi join with multiple conditions #2275

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions