Skip to content

Unexpected results when combining on with keyby or by  #1943

@Henrik-P

Description

@Henrik-P

I was working my way through the data.table vignette Secondary indices and auto indexing. To make it easier to compare original data and results, I replaced the non-minimal "flights" data with much smaller datasets throughout.

It worked fine until section 2f "Aggregation using by", where on is combined with keyby. There we find code to "Get the maximum departure delay for each month corresponding to origin = "JFK". Order the result by month":

flights["JFK", max(dep_delay), keyby = month, on = "origin"]

When I tried the equivalent code on two different data sets, two different issues appeared:

1. Label of the keyby and by variable identical for different levels

d1 <- data.table(x = rep(c("a", "b"), each = 4), y = 1:0, z = c(3, 6, 8, 5, 4, 1, 2, 7))
d1
#    x y z
# 1: a 1 3
# 2: a 0 6
# 3: a 1 8
# 4: a 0 5
# 5: b 1 4 <~~ max z for x = b & y = 1
# 6: b 0 1
# 7: b 1 2
# 8: b 0 7 <~~ max z for x = b & y = 0

Translating the code from the vignette: Get the maximum z for each y corresponding to x = "b". Order the result by y

d1["b", max(z), keyby = y, on = "x"]
#    y V1
# 1: 1  7 <~~ y should be 0 here
# 2: 1  4

The label of the keyby variable y is erroneously 1 also for the 0 level.

Also when using by instead of keyby together with on the labels are wrong:

d1["b", max(z), by = y, on = "x"]
#    y V1
# 1: 1  4 
# 2: 1  7 <~~ y should be 0 here

Just to verify that corresponding code without on works fine:

d1[x == "b", max(z), keyby = y]
#    y V1
# 1: 0  7
# 2: 1  4

d1[x == "b", max(z), by = y]
#    y V1
# 1: 1  4
# 2: 0  7

2. (a) Label of the keyby (or by) variable and the resulting value do not match. (b) Result not sorted by the keyby variable.

d2 <- as.data.table(mtcars[mtcars$cyl %in% c(6, 8), c("am", "vs", "hp")])

Again, equivalent desired outcome: get the maximum hp for each vs corresponding to am = 0. Order the result by vs.

First, just look at the data corresponding to am = 0 to easier spot the desired result:

d2[.(0), on = "am"]

#     am vs  hp
# 1:   0  1 110
# 2:   0  0 175
# 3:   0  1 105
# 4:   0  0 245 <~~ max hp for vs = 0
# 5:   0  1 123 <~~ max hp for vs = 1 
# 6:   0  1 123
# 7:   0  0 180
# 8:   0  0 180
# 9:   0  0 180
# 10:  0  0 205
# 11:  0  0 215
# 12:  0  0 230
# 13:  0  0 150
# 14:  0  0 150
# 15:  0  0 245
# 16:  0  0 175

When combining keyby and on, the result is not sorted and the labels don't match the values:

d2[.(0), max(hp), keyby = vs, on = "am"]
#    vs  V1
# 1:  1 245
# 2:  0 123

When combining by and on, the labels don't match the values:

d2[.(0), max(hp), by = vs, on = "am"]
#    vs  V1
# 1:  0 123
# 2:  1 245

Subsetting without on works fine:

d2[am == 0, max(hp), keyby = vs]
#    vs  V1
# 1:  0 245
# 2:  1 123


d2[am == 0 , max(hp), by = vs]
#    vs  V1
# 1:  1 123
# 2:  0 245

So far I have not been able to discern any particular pattern in the data which generates these results (e.g. any particular order of on and/or keyby variables in the original and/or subset data).

Can you spot any mistakes in my code or is there something strange going on here?


R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
data.table_1.9.8

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions