I was working my way through the data.table vignette Secondary indices and auto indexing. To make it easier to compare original data and results, I replaced the non-minimal "flights" data with much smaller datasets throughout.
It worked fine until section 2f "Aggregation using by", where on is combined with keyby. There we find code to "Get the maximum departure delay for each month corresponding to origin = "JFK". Order the result by month":
flights["JFK", max(dep_delay), keyby = month, on = "origin"]
When I tried the equivalent code on two different data sets, two different issues appeared:
1. Label of the keyby and by variable identical for different levels
d1 <- data.table(x = rep(c("a", "b"), each = 4), y = 1:0, z = c(3, 6, 8, 5, 4, 1, 2, 7))
d1
# x y z
# 1: a 1 3
# 2: a 0 6
# 3: a 1 8
# 4: a 0 5
# 5: b 1 4 <~~ max z for x = b & y = 1
# 6: b 0 1
# 7: b 1 2
# 8: b 0 7 <~~ max z for x = b & y = 0
Translating the code from the vignette: Get the maximum z for each y corresponding to x = "b". Order the result by y
d1["b", max(z), keyby = y, on = "x"]
# y V1
# 1: 1 7 <~~ y should be 0 here
# 2: 1 4
The label of the keyby variable y is erroneously 1 also for the 0 level.
Also when using by instead of keyby together with on the labels are wrong:
d1["b", max(z), by = y, on = "x"]
# y V1
# 1: 1 4
# 2: 1 7 <~~ y should be 0 here
Just to verify that corresponding code without on works fine:
d1[x == "b", max(z), keyby = y]
# y V1
# 1: 0 7
# 2: 1 4
d1[x == "b", max(z), by = y]
# y V1
# 1: 1 4
# 2: 0 7
2. (a) Label of the keyby (or by) variable and the resulting value do not match. (b) Result not sorted by the keyby variable.
d2 <- as.data.table(mtcars[mtcars$cyl %in% c(6, 8), c("am", "vs", "hp")])
Again, equivalent desired outcome: get the maximum hp for each vs corresponding to am = 0. Order the result by vs.
First, just look at the data corresponding to am = 0 to easier spot the desired result:
d2[.(0), on = "am"]
# am vs hp
# 1: 0 1 110
# 2: 0 0 175
# 3: 0 1 105
# 4: 0 0 245 <~~ max hp for vs = 0
# 5: 0 1 123 <~~ max hp for vs = 1
# 6: 0 1 123
# 7: 0 0 180
# 8: 0 0 180
# 9: 0 0 180
# 10: 0 0 205
# 11: 0 0 215
# 12: 0 0 230
# 13: 0 0 150
# 14: 0 0 150
# 15: 0 0 245
# 16: 0 0 175
When combining keyby and on, the result is not sorted and the labels don't match the values:
d2[.(0), max(hp), keyby = vs, on = "am"]
# vs V1
# 1: 1 245
# 2: 0 123
When combining by and on, the labels don't match the values:
d2[.(0), max(hp), by = vs, on = "am"]
# vs V1
# 1: 0 123
# 2: 1 245
Subsetting without on works fine:
d2[am == 0, max(hp), keyby = vs]
# vs V1
# 1: 0 245
# 2: 1 123
d2[am == 0 , max(hp), by = vs]
# vs V1
# 1: 1 123
# 2: 0 245
So far I have not been able to discern any particular pattern in the data which generates these results (e.g. any particular order of on and/or keyby variables in the original and/or subset data).
Can you spot any mistakes in my code or is there something strange going on here?
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
data.table_1.9.8
I was working my way through the
data.tablevignette Secondary indices and auto indexing. To make it easier to compare original data and results, I replaced the non-minimal "flights" data with much smaller datasets throughout.It worked fine until section 2f "Aggregation using
by", whereonis combined withkeyby. There we find code to "Get the maximum departure delay for each month corresponding to origin = "JFK". Order the result by month":When I tried the equivalent code on two different data sets, two different issues appeared:
1. Label of the
keybyandbyvariable identical for different levelsTranslating the code from the vignette: Get the maximum
zfor eachycorresponding tox = "b". Order the result byyThe label of the
keybyvariableyis erroneously1also for the0level.Also when using
byinstead ofkeybytogether withonthe labels are wrong:Just to verify that corresponding code without
onworks fine:2. (a) Label of the
keyby(orby) variable and the resulting value do not match. (b) Result not sorted by thekeybyvariable.Again, equivalent desired outcome: get the maximum
hpfor eachvscorresponding toam = 0. Order the result byvs.First, just look at the data corresponding to
am = 0to easier spot the desired result:When combining
keybyandon, the result is not sorted and the labels don't match the values:When combining
byandon, the labels don't match the values:Subsetting without
onworks fine:So far I have not been able to discern any particular pattern in the data which generates these results (e.g. any particular order of
onand/orkeybyvariables in the original and/or subset data).Can you spot any mistakes in my code or is there something strange going on here?
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
data.table_1.9.8