Skip to content

data.table should be smarter about compound logical subsetting #2472

@MichaelChirico

Description

@MichaelChirico

I don't see any reason for the following two to have substantially different runtimes:

set.seed(210349)

NN = 1e6
DT = data.table(l1 = sample(letters, NN, TRUE),
                l2 = sample(letters, NN, TRUE))

library(microbenchmark)

times = matrix(nrow = 500, ncol = 2)

for (ii in seq_len(nrow(times))) {
  DT_copy = copy(DT)
  t0 = get_nanotime()
  DT_copy[l1 == 'm' & l2 == 'd']
  t1 = get_nanotime()
  times[ii, 1L] = t1 - t0
  
  DT_copy = copy(DT)
  t0 = get_nanotime()
  DT_copy[l1 == 'm'][l2 == 'd']
  t1 = get_nanotime()
  times[ii, 2L] = t1 - t0
}

median(times[ , 1L])
# [1] 17620043
median(times[ , 2L])
# [1] 12605714
mean(times[ , 1L]/times[ , 2L])
# [1] 1.888558

It surprised me all the more so that it continues to be true when DT has these columns as an index or even a key:

# INDEXING
# benchmarking code same as above but add setindex(DT, l1, l2 before the loop
mean(times[ , 1L]/times[ , 2L])
# [1] 1.47573

# KEYING
mean(times[ , 1L]/times[ , 2L])
# [1] 7.293718

EDIT: Previously stated a large difference with pre-declaring l1 & l2 as logical... but this went away once I fixed the benchmarks to overcome auto-indexing's influence on the timings, though there's something to be said about why this made a difference -- in another issue (for posterity, mean ratio of the logical benchmark: 1.55)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions