Skip to content

Bug in grouping external variables - related to #495. #875

@arunsrinivasan

Description

@arunsrinivasan

This bug is related to #495.

require(data.table)
DT = data.table(x=c(1,1,1,2,2), y=1:5)
z = 1:5
options(datatable.verbose=TRUE)

# correct answer (?)
options(datatable.optimize = Inf)
DT[, list(mean(z), mean(y)), by=x]
# GForce optimized j to 'list(gmean(z), gmean(y))'
#    x  V1  V2
#1: 1 2.0 2.0
#2: 2 4.5 4.5

# incorrect answer (?)
options(datatable.optimize = 1L) # no GForce
DT[, list(mean(z), mean(y)), by=x]
#    x V1  V2
#1: 1  3 2.0
#2: 2  3 4.5

Basically mean is computed on entire z in the second case (where mean gets optimised to fastmean internally). This is most likely because .SD doesn't have this variable in it. So it comes back to #495.

For the same reason, say calculating variance or standard deviation won't work, even if optimise value if Inf (because GForce isn't implemented for those functions).

options(datatable.optimize=Inf)
DT[, list(sd(z), sd(y)), by=x]
#    x       V1        V2
#1: 1 1.581139 1.0000000
#2: 2 1.581139 0.7071068

For now, using external variables for grouping has a bug. This observation came from this SO post. Thanks to drstevok.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions