Taken from a good question on SO...
library(data.table)
DT = setDT(structure(list(City = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L), .Label = c("LA", "NY"), class = "factor"), YearFrom = c(2001L,
2003L, 2002L, 2006L, 2008L, 2004L, 2005L, 2005L, 2002L), YearTo = c(NA,
2005L, NA, NA, 2009L, NA, 2008L, NA, NA), quasiYearTo = c(2017L,
2005L, 2017L, 2017L, 2009L, 2017L, 2008L, 2017L, 2017L)), .Names = c("City",
"YearFrom", "YearTo", "quasiYearTo"), row.names = c(NA, -9L), class = "data.frame"))
City YearFrom YearTo quasiYearTo
1: NY 2001 NA 2017
2: NY 2003 2005 2005
3: NY 2002 NA 2017
4: NY 2006 NA 2017
5: NY 2008 2009 2009
6: LA 2004 NA 2017
7: LA 2005 2008 2008
8: LA 2005 NA 2017
9: LA 2002 NA 2017
The goal is to count, for every city and year in the data, how many firms are active (have YearFrom <= Year < quasiYearTo). This code works:
df[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE),
on=.(City, YearFrom <= Year, quasiYearTo > Year), allow.cartesian = TRUE,
.N
, by=.EACHI][, .(City, Year = YearFrom, N)]
City Year N
1: LA 2001 0
2: LA 2002 1
3: LA 2003 1
4: LA 2004 2
5: LA 2005 4
6: LA 2006 4
7: LA 2007 4
8: LA 2008 3
9: LA 2009 3
10: NY 2001 1
11: NY 2002 2
12: NY 2003 3
13: NY 2004 3
14: NY 2005 2
15: NY 2006 3
16: NY 2007 3
17: NY 2008 4
18: NY 2009 3
However, if I drop allow.cartesian, I get the usual error message:
DT[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE),
on=.(City, YearFrom <= Year, quasiYearTo > Year), .N, by=.EACHI]
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 47 rows; more than 27 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
So my i has no duplicates:
DT[, CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE)][,
anyDuplicated(.SD)] # 0
...and I am using by=.EACHI already. It would be nice if the warning message said gave a more relevant hint here (though I don't know what that might be, since I'm not actually clear on why I need allow.cartesian on here). Anyway, a minor suggestion.
An alternative would be to turn off the cartesian check (or downgrade it from error to verbose message) if it's a non-equi join and by=.EACHI, since that case seems pretty safe.
Taken from a good question on SO...
The goal is to count, for every city and year in the data, how many firms are active (have YearFrom <= Year < quasiYearTo). This code works:
However, if I drop allow.cartesian, I get the usual error message:
So my
ihas no duplicates:...and I am using
by=.EACHIalready. It would be nice if the warning message said gave a more relevant hint here (though I don't know what that might be, since I'm not actually clear on why I need allow.cartesian on here). Anyway, a minor suggestion.An alternative would be to turn off the cartesian check (or downgrade it from error to verbose message) if it's a non-equi join and
by=.EACHI, since that case seems pretty safe.