Hi, I found a very weird behavior that could be critical for all other data.table operation. When you have a large data set, and try to take the sum of a binary indicator by a group indicator, you will get the inconsistent and wrong results.
# [Minimal reproducible example] here's the code
N = 19926003
dt = data.table(
id = as.integer64(sample(1:3000,N,replace=TRUE)),
X = ifelse(rnorm(N) > 1, 1, 0))
dt[id == 2, sum(X)]
sum_dt = dt[, .(sumX = sum(X)),by='id']
sum_dt[id==2,]
When you run this code, you get the following output.
> dt[id == 2, sum(X)]
[1] 1069
> sum_dt[id==2,]
id sumX
1: 2 1081
These two numbers must be the same, but different. But interestingly, the following code will give you the correct results.
> unique(dt[, sumX := sum(X),by='id'][, c('id','sumX')])[id==2,]
id sumX
1: 2 1069
That said, actually you can fix this issue by setting your data type as integer, for example,
> dt = data.table(
+ id = as.integer64(sample(1:3000,N,replace=TRUE)),
+ X = as.integer(ifelse(rnorm(N) > 1, 1, 0)))
>
> dt[id == 2, sum(X)]
sum_dt = dt[, .(sumX = sum(X)),by='id']
sum_dt[id==2,]
[1] 1108
> sum_dt = dt[, .(sumX = sum(X)),by='id']
> sum_dt[id==2,]
id sumX
1: 2 1108
I start to make it very explicit about the variable type in my code, but I think that it is a big issue given that people in general use numeric as a default data type.
# Output of sessionInfo()
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: SUSE Linux Enterprise Server 15 SP1
Matrix products: default
BLAS: /geode2/soft/hps/cle7/r/4.0.3/lib64/R/lib/libRblas.so
LAPACK: /geode2/soft/hps/cle7/r/4.0.3/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bit64_4.0.5 bit_4.0.4 data.table_1.14.0
loaded via a namespace (and not attached):
[1] compiler_4.0.3
Hi, I found a very weird behavior that could be critical for all other data.table operation. When you have a large data set, and try to take the sum of a binary indicator by a group indicator, you will get the inconsistent and wrong results.
#[Minimal reproducible example] here's the codeWhen you run this code, you get the following output.
These two numbers must be the same, but different. But interestingly, the following code will give you the correct results.
That said, actually you can fix this issue by setting your data type as integer, for example,
I start to make it very explicit about the variable type in my code, but I think that it is a big issue given that people in general use
numericas a default data type.#Output of sessionInfo()