I don't see any reason for the following two to have substantially different runtimes:
set.seed(210349)
NN = 1e6
DT = data.table(l1 = sample(letters, NN, TRUE),
l2 = sample(letters, NN, TRUE))
library(microbenchmark)
times = matrix(nrow = 500, ncol = 2)
for (ii in seq_len(nrow(times))) {
DT_copy = copy(DT)
t0 = get_nanotime()
DT_copy[l1 == 'm' & l2 == 'd']
t1 = get_nanotime()
times[ii, 1L] = t1 - t0
DT_copy = copy(DT)
t0 = get_nanotime()
DT_copy[l1 == 'm'][l2 == 'd']
t1 = get_nanotime()
times[ii, 2L] = t1 - t0
}
median(times[ , 1L])
# [1] 17620043
median(times[ , 2L])
# [1] 12605714
mean(times[ , 1L]/times[ , 2L])
# [1] 1.888558
It surprised me all the more so that it continues to be true when DT has these columns as an index or even a key:
# INDEXING
# benchmarking code same as above but add setindex(DT, l1, l2 before the loop
mean(times[ , 1L]/times[ , 2L])
# [1] 1.47573
# KEYING
mean(times[ , 1L]/times[ , 2L])
# [1] 7.293718
EDIT: Previously stated a large difference with pre-declaring l1 & l2 as logical... but this went away once I fixed the benchmarks to overcome auto-indexing's influence on the timings, though there's something to be said about why this made a difference -- in another issue (for posterity, mean ratio of the logical benchmark: 1.55)
I don't see any reason for the following two to have substantially different runtimes:
It surprised me all the more so that it continues to be true when
DThas these columns as anindexor even akey:EDIT: Previously stated a large difference with pre-declaring
l1&l2as logical... but this went away once I fixed the benchmarks to overcome auto-indexing's influence on the timings, though there's something to be said about why this made a difference -- in another issue (for posterity, mean ratio of the logical benchmark: 1.55)