Skip to content

Join results in more than 2^31 rows #16

@Arshammik

Description

@Arshammik

We are experiencing an error when creating the M2 matrix for large datasets. This error stems from the limitation on 32-bit R vectors that data.table can handle. R uses 32-bit signed integers for internal vector indexing, regardless of whether the system is running 64-bit R. This constraint exists in R's core C implementation and affects all vector-based operations, including data.table joins. The maximum addressable index is 2^31 - 1 = 2,147,483,647, which represents the theoretical upper bound for any single R object's length.

┌── Step 1 | Modifying the m1_inclusion_matrix
├── Step 2 | Creating M2
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
Calls: <Anonymous> ... merge -> merge.data.table -> [ -> [.data.table -> vecseq

Some modifications are needed, e.g., batching the processes and creating a universal M2 matrix from all the batched M2 matrices.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions