I have a data.table with a large number of rows bumping up against the MAXINT row limit. Trying to reduce the table to be a unique on the key is trying to allocate way too much memory.
example
> affils = readRDS(file=sprintf(ALL_DATA, 'affils_all.rds'))
> setDT(affils)
> sapply(affils, class)
$fan_id
[1] "integer"
$contact_id
[1] "integer"
$is_first
[1] "integer"
$is_second
[1] "integer"
$created_at
[1] "POSIXct" "POSIXt"
> nrow(affils)
[1] 2127968526
> key(affils)
[1] "fan_id" "contact_id"
> affils = affils[, .(is_first=max(is_first), is_second=max(is_second),
+ created_at=min(created_at)),
+ keyby=.(fan_id, contact_id)]
Error in uniqlist(byval) :
'Realloc' could not re-allocate memory (18446744065119617024 bytes)
Enter a frame number, or 0 to exit
1: affils[, .(is_first = max(is_first), is_second = max(is_second), created_at = min(created_at)), keyby = .(fan_id, contact_id)]
2: `[.data.table`(affils, , .(is_first = max(is_first), is_second = max(is_second), created_at = min(created_at)), keyby = .(fan_id, contact_id))
3: uniqlist(byval)
I really don't think that it should be necessary to try to allocate that much memory to run that operation. There are 10s of millions of unique users and 10s of millions of unique contacts. It is probable that there are in fact no duplicate values in that table. I was really just running this as a sanity check.
Potentially related, I think I am seeing memory count overflows (i.e. attempts to allocate a negative amount of memory) in rbind and/or forderv. Unfortunately I don't have the output as I had to kill the screen window those R sessions were in. But basically I had structurally similar tables as above, but with much fewer rows and was rbind'ing them and then running that same unique operation as above. I had checked that the total number of rows was less than MAXINT. I do have part of the error from my search history:
failed to realloc working memory stack data.table
None of this should be constrained by the amount of memory on the machine, as the process was only using about 10-15% of total available RAM on the machine.
sessionInfo
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS
Matrix products: default
BLAS: /usr/lib/atlas-base/libf77blas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] Matrix_1.2-12 lubridate_1.7.1 fasttime_1.0-2 data.table_1.10.4-3
loaded via a namespace (and not attached):
[1] compiler_3.4.2 magrittr_1.5 tools_3.4.2 Rcpp_0.12.14 stringi_1.1.6 grid_3.4.2 stringr_1.2.0 lattice_0.20-35
I have a data.table with a large number of rows bumping up against the MAXINT row limit. Trying to reduce the table to be a unique on the key is trying to allocate way too much memory.
example
I really don't think that it should be necessary to try to allocate that much memory to run that operation. There are 10s of millions of unique users and 10s of millions of unique contacts. It is probable that there are in fact no duplicate values in that table. I was really just running this as a sanity check.
Potentially related, I think I am seeing memory count overflows (i.e. attempts to allocate a negative amount of memory) in rbind and/or forderv. Unfortunately I don't have the output as I had to kill the screen window those R sessions were in. But basically I had structurally similar tables as above, but with much fewer rows and was rbind'ing them and then running that same unique operation as above. I had checked that the total number of rows was less than MAXINT. I do have part of the error from my search history:
failed to realloc working memory stack data.tableNone of this should be constrained by the amount of memory on the machine, as the process was only using about 10-15% of total available RAM on the machine.
sessionInfo