Skip to content

Expand allow.cartesian error hints for non-equi joins #2086

@franknarf1

Description

@franknarf1

Taken from a good question on SO...

library(data.table)
DT = setDT(structure(list(City = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 
1L, 1L), .Label = c("LA", "NY"), class = "factor"), YearFrom = c(2001L, 
2003L, 2002L, 2006L, 2008L, 2004L, 2005L, 2005L, 2002L), YearTo = c(NA, 
2005L, NA, NA, 2009L, NA, 2008L, NA, NA), quasiYearTo = c(2017L, 
2005L, 2017L, 2017L, 2009L, 2017L, 2008L, 2017L, 2017L)), .Names = c("City", 
"YearFrom", "YearTo", "quasiYearTo"), row.names = c(NA, -9L), class = "data.frame"))


   City YearFrom YearTo quasiYearTo
1:   NY     2001     NA        2017
2:   NY     2003   2005        2005
3:   NY     2002     NA        2017
4:   NY     2006     NA        2017
5:   NY     2008   2009        2009
6:   LA     2004     NA        2017
7:   LA     2005   2008        2008
8:   LA     2005     NA        2017
9:   LA     2002     NA        2017

The goal is to count, for every city and year in the data, how many firms are active (have YearFrom <= Year < quasiYearTo). This code works:

df[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE), 
  on=.(City, YearFrom <= Year, quasiYearTo > Year), allow.cartesian = TRUE, 
  .N
, by=.EACHI][, .(City, Year = YearFrom, N)]


    City Year N
 1:   LA 2001 0
 2:   LA 2002 1
 3:   LA 2003 1
 4:   LA 2004 2
 5:   LA 2005 4
 6:   LA 2006 4
 7:   LA 2007 4
 8:   LA 2008 3
 9:   LA 2009 3
10:   NY 2001 1
11:   NY 2002 2
12:   NY 2003 3
13:   NY 2004 3
14:   NY 2005 2
15:   NY 2006 3
16:   NY 2007 3
17:   NY 2008 4
18:   NY 2009 3

However, if I drop allow.cartesian, I get the usual error message:

DT[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE), 
  on=.(City, YearFrom <= Year, quasiYearTo > Year), .N, by=.EACHI]

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 47 rows; more than 27 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

So my i has no duplicates:

DT[, CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE)][, 
  anyDuplicated(.SD)] # 0

...and I am using by=.EACHI already. It would be nice if the warning message said gave a more relevant hint here (though I don't know what that might be, since I'm not actually clear on why I need allow.cartesian on here). Anyway, a minor suggestion.

An alternative would be to turn off the cartesian check (or downgrade it from error to verbose message) if it's a non-equi join and by=.EACHI, since that case seems pretty safe.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions