I have some time series with values. Apart from different values, the structure of the data sets are the same. A small example:
set.seed(2)
d = data.table(time = 1:8, v = sample(8))
d
# time v
# 1: 1 5
# 2: 2 7
# 3: 3 6
# 4: 4 1
# 5: 5 8
# 6: 6 4
# 7: 7 2
# 8: 8 3
For each time, I want to determine the number of rows after the focal time where the values are larger than the focal value. E.g. for time 2 (value 7) there is one subsequent value which is higher (8 at time 5).
There may be better ways to reach my goal, but here's the code I used and which gave me the desired result:
d[d, on = .(time > time, v > v), .N, by = .EACHI]
# time v N
# 1: 1 5 3
# 2: 2 7 1
# 3: 3 6 1
# 4: 4 1 4
# 5: 5 8 0
# 6: 6 4 0
# 7: 7 2 1
# 8: 8 3 0
The code works for most data, but in some rare instances it errors. After digging back and forth, I thought I had found the reason: the error was triggered by setting the value 'v' equal to the time on two rows. E.g.:
d[time == 2L, v := 2L]
d[time == 7L, v := 7L]
d[d, on = .(time > time, v > v), .N, by = .EACHI, verbose = TRUE]
# This chunk of the message is same in the code above:
# i.time has same type (integer) as x.time. No coercion needed.
# i.v has same type (integer) as x.v. No coercion needed.
# Non-equi join operators detected ...
# forder took ... forder.c received 8 rows and 2 columns
# 0.020s elapsed (0.000s cpu)
# Generating non-equi group ids ... done in 0.000s elapsed (0.000s cpu)
# Recomputing forder with non-equi ids ... Assigning to all 8 rows
# RHS_list_of_columns == false
# RHS for item 1 has been duplicated because NAMED==7 MAYBE_SHARED==1, but then is being plonked. length(values)==8; length(cols)==1)
# forder.c received 8 rows and 3 columns
# done in 0.000s elapsed (0.000s cpu)
# From here there are some differences in the message:
# Found 3 non-equi group(s) ... # Above: Found 4 non-equi group(s) ...
# Starting bmerge ...
# forder.c received 8 rows and 2 columns
# bmerge done in 0.000s elapsed (0.000s cpu)
# forder.c received 14 rows and 3 columns # Above: forder.c received 11 rows and 3 columns
# Constructing irows for '!byjoin || nqbyjoin'
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
# Join results in 18 rows; more than 16 = nrow(x)+nrow(i).
# Check for duplicate key values in i each of which join to the same group in x over and over again.
# If that's ok, try by=.EACHI to run j for each group to avoid the large allocation.
However, if the value at time 5 instead of time 7 was changed, it works...
set.seed(2)
d = data.table(time = 1:8, v = sample(8))
d[time == 2L, v := 2L]
d[time == 5L, v := 5L]
d[d, on = .(time > time, v > v), .N, by = .EACHI, verbose = TRUE]
What's going on here? (most parsimonous guess: user error ;) I think I'm slightly confused by the error message. Although Join results in 18 rows; more than 16 = nrow(x)+nrow(i). is true for the join per se, I use .EACHI to aggregate the data to have less rows than the nrow(x)+nrow(i) limit. In addition, try by=.EACHI is indeed a great suggestion, but I already use it. I clearly have misunderstood something here.
Tried on Windows, R version 3.6.3, data.table 1.12.9 IN DEVELOPMENT built 2020-02-20, and on Windows, R version 4.0.0, data.table 1.12.8
I have some time series with values. Apart from different values, the structure of the data sets are the same. A small example:
For each time, I want to determine the number of rows after the focal time where the values are larger than the focal value. E.g. for time 2 (value 7) there is one subsequent value which is higher (8 at time 5).
There may be better ways to reach my goal, but here's the code I used and which gave me the desired result:
The code works for most data, but in some rare instances it errors. After digging back and forth, I thought I had found the reason: the error was triggered by setting the value 'v' equal to the time on two rows. E.g.:
However, if the value at time 5 instead of time 7 was changed, it works...
What's going on here? (most parsimonous guess: user error ;) I think I'm slightly confused by the error message. Although
Join results in 18 rows; more than 16 = nrow(x)+nrow(i).is true for the join per se, I use.EACHIto aggregate the data to have less rows than thenrow(x)+nrow(i)limit. In addition,try by=.EACHIis indeed a great suggestion, but I already use it. I clearly have misunderstood something here.Tried on Windows,
R version 3.6.3,data.table 1.12.9 IN DEVELOPMENT built 2020-02-20, and on Windows,R version 4.0.0,data.table 1.12.8