data.table should be smarter about compound logical subsetting

I don't see any reason for the following two to have substantially different runtimes:

```
set.seed(210349)

NN = 1e6
DT = data.table(l1 = sample(letters, NN, TRUE),
                l2 = sample(letters, NN, TRUE))

library(microbenchmark)

times = matrix(nrow = 500, ncol = 2)

for (ii in seq_len(nrow(times))) {
  DT_copy = copy(DT)
  t0 = get_nanotime()
  DT_copy[l1 == 'm' & l2 == 'd']
  t1 = get_nanotime()
  times[ii, 1L] = t1 - t0
  
  DT_copy = copy(DT)
  t0 = get_nanotime()
  DT_copy[l1 == 'm'][l2 == 'd']
  t1 = get_nanotime()
  times[ii, 2L] = t1 - t0
}

median(times[ , 1L])
# [1] 17620043
median(times[ , 2L])
# [1] 12605714
mean(times[ , 1L]/times[ , 2L])
# [1] 1.888558
```

It surprised me all the more so that it continues to be true when `DT` has these columns as an `index` or even a `key`:

```
# INDEXING
# benchmarking code same as above but add setindex(DT, l1, l2 before the loop
mean(times[ , 1L]/times[ , 2L])
# [1] 1.47573

# KEYING
mean(times[ , 1L]/times[ , 2L])
# [1] 7.293718
```

---
EDIT: Previously stated a large difference with pre-declaring `l1` & `l2` as logical... but this went away once I fixed the benchmarks to overcome auto-indexing's influence on the timings, though there's something to be said about why this made a difference -- in another issue (for posterity, mean ratio of the logical benchmark: 1.55)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.table should be smarter about compound logical subsetting #2472

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

data.table should be smarter about compound logical subsetting #2472

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions