Skip to content

regression in big grouping #4818

@jangorecki

Description

@jangorecki

While running grouping benchmark on 2e9 rows dataset (96GB csv) using recent stable data.table 1.13.2 I am getting following exception:

> system.time( DT[, sum(v1), keyby=id1] )
Error in gforce(thisEnv, jsub, o__, f__, len__, irows) :
  Internal error: Failed to allocate counts or TMP when assigning g in gforce
Calls: system.time -> [ -> [.data.table -> gforce

if (!counts || !TMP ) error(_("Internal error: Failed to allocate counts or TMP when assigning g in gforce"));

It is the same machine as the one used in 2014: 32 cores and 244GB memory.


I run data.table 1.9.2 as well to ensure that version which previously worked fine for this data size continue to work on a recent R version.

> system.time( DT[, sum(v1), keyby=id1] )
   user  system elapsed
 58.113  17.098  75.219
> system.time( DT[, sum(v1), keyby=id1] )
   user  system elapsed
 59.185  15.303  74.496
> system.time( DT[, sum(v1), keyby="id1,id2"] )
   user  system elapsed
180.160  19.953 200.137
> system.time( DT[, sum(v1), keyby="id1,id2"] )
   user  system elapsed
204.208  39.651 243.889
> system.time( DT[, list(sum(v1),mean(v3)), keyby=id3] )
    user   system  elapsed
1037.451   51.269 1088.853
> system.time( DT[, list(sum(v1),mean(v3)), keyby=id3] )
    user   system  elapsed
1023.068   29.556 1052.753
> system.time( DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] )
   user  system elapsed
 73.123  18.026  91.160
> system.time( DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] )
   user  system elapsed
 70.523   8.951  79.483
> system.time( DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] )
   user  system elapsed
489.294  36.192 525.548
> system.time( DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] )
   user  system elapsed 
488.316  28.808 517.188 

Timings are slower than they were in the past, but AFAIK this is what we observed in other issues: newer version of R was introducing an overhead that data.table was later addressing in newer versions. So if users upgrade R, then they should also upgrade data.table.

Metadata

Metadata

Assignees

No one assigned

    Labels

    GForceissues relating to optimized grouping calculations (GForce)benchmarkregression

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions