Skip to content

non-equi join with .EACHI fails unexpectedly (seemingly related to values in join columns) #4489

@Henrik-P

Description

@Henrik-P

I have some time series with values. Apart from different values, the structure of the data sets are the same. A small example:

set.seed(2)
d = data.table(time = 1:8, v = sample(8))
d
#    time v
# 1:    1 5
# 2:    2 7
# 3:    3 6
# 4:    4 1
# 5:    5 8
# 6:    6 4
# 7:    7 2
# 8:    8 3

For each time, I want to determine the number of rows after the focal time where the values are larger than the focal value. E.g. for time 2 (value 7) there is one subsequent value which is higher (8 at time 5).

There may be better ways to reach my goal, but here's the code I used and which gave me the desired result:

d[d, on = .(time > time, v > v), .N, by = .EACHI]
#    time v N
# 1:    1 5 3
# 2:    2 7 1
# 3:    3 6 1
# 4:    4 1 4
# 5:    5 8 0
# 6:    6 4 0
# 7:    7 2 1
# 8:    8 3 0

The code works for most data, but in some rare instances it errors. After digging back and forth, I thought I had found the reason: the error was triggered by setting the value 'v' equal to the time on two rows. E.g.:

d[time == 2L, v := 2L]
d[time == 7L, v := 7L]

d[d, on = .(time > time, v > v), .N, by = .EACHI, verbose = TRUE]

# This chunk of the message is same in the code above:

# i.time has same type (integer) as x.time. No coercion needed.
# i.v has same type (integer) as x.v. No coercion needed.
# Non-equi join operators detected ... 
# forder took ... forder.c received 8 rows and 2 columns
# 0.020s elapsed (0.000s cpu) 
# Generating non-equi group ids ... done in 0.000s elapsed (0.000s cpu) 
# Recomputing forder with non-equi ids ... Assigning to all 8 rows
# RHS_list_of_columns == false
# RHS for item 1 has been duplicated because NAMED==7 MAYBE_SHARED==1, but then is being plonked. length(values)==8; length(cols)==1)
# forder.c received 8 rows and 3 columns
# done in 0.000s elapsed (0.000s cpu)

# From here there are some differences in the message:
 
# Found 3 non-equi group(s) ...     # Above: Found 4 non-equi group(s) ...
# Starting bmerge ...
# forder.c received 8 rows and 2 columns
# bmerge done in 0.000s elapsed (0.000s cpu) 
# forder.c received 14 rows and 3 columns     # Above: forder.c received 11 rows and 3 columns
# Constructing irows for '!byjoin || nqbyjoin'

# Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
# Join results in 18 rows; more than 16 = nrow(x)+nrow(i).
# Check for duplicate key values in i each of which join to the same group in x over and over again.
# If that's ok, try by=.EACHI to run j for each group to avoid the large allocation.

However, if the value at time 5 instead of time 7 was changed, it works...

set.seed(2)
d = data.table(time = 1:8, v = sample(8))
d[time == 2L, v := 2L]
d[time == 5L, v := 5L]

d[d, on = .(time > time, v > v), .N, by = .EACHI, verbose = TRUE]

What's going on here? (most parsimonous guess: user error ;) I think I'm slightly confused by the error message. Although Join results in 18 rows; more than 16 = nrow(x)+nrow(i). is true for the join per se, I use .EACHI to aggregate the data to have less rows than the nrow(x)+nrow(i) limit. In addition, try by=.EACHI is indeed a great suggestion, but I already use it. I clearly have misunderstood something here.


Tried on Windows, R version 3.6.3, data.table 1.12.9 IN DEVELOPMENT built 2020-02-20, and on Windows, R version 4.0.0, data.table 1.12.8

Metadata

Metadata

Assignees

No one assigned

    Labels

    non-equi joinsrolling, overlapping, non-equi joins

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions