We are experiencing an error when creating the M2 matrix for large datasets. This error stems from the limitation on 32-bit R vectors that data.table can handle. R uses 32-bit signed integers for internal vector indexing, regardless of whether the system is running 64-bit R. This constraint exists in R's core C implementation and affects all vector-based operations, including data.table joins. The maximum addressable index is 2^31 - 1 = 2,147,483,647, which represents the theoretical upper bound for any single R object's length.
┌── Step 1 | Modifying the m1_inclusion_matrix
├── Step 2 | Creating M2
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
Calls: <Anonymous> ... merge -> merge.data.table -> [ -> [.data.table -> vecseq
Some modifications are needed, e.g., batching the processes and creating a universal M2 matrix from all the batched M2 matrices.
We are experiencing an error when creating the M2 matrix for large datasets. This error stems from the limitation on 32-bit R vectors that data.table can handle. R uses 32-bit signed integers for internal vector indexing, regardless of whether the system is running 64-bit R. This constraint exists in R's core C implementation and affects all vector-based operations, including data.table joins. The maximum addressable index is 2^31 - 1 = 2,147,483,647, which represents the theoretical upper bound for any single R object's length.
Some modifications are needed, e.g., batching the processes and creating a universal M2 matrix from all the batched M2 matrices.