This would allow taking unique() on a subset of columns to save memory overhead. It's basically equivalent to
DT[, unique(.SD, by = BY_COLS), .SDcols = c(BY_COLS, KEEP_COLS)]
with a more natural API:
unique(DT, by = BY_COLS, cols = KEEP_COLS)
I believe other workarounds are still memory-inefficient (as well as clunkier), e.g.
unique(DT[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)], by = BY_COLS)
while the first .SD approach (IINM) is using a shallow copy and thus faster.
NN = 1e7
DT = data.table(grp = sample(c(letters, LETTERS, 0:9), NN, TRUE))
JJ = 100
for (jj in seq_len(JJ)) set(DT, NULL, paste0("V", jj), rnorm(NN))
BY_COLS = "grp"
KEEP_COLS = paste0("V", 1:5)
f1 <- function() DT[, unique(.SD, by = BY_COLS), .SDcols = c(BY_COLS, KEEP_COLS)]
f2 <- function() unique(DT[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)], by = BY_COLS)
f3 <- function() unique(DT, by = BY_COLS)[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)]
f4 <- function() DT[, head(.SD, 1L), by = BY_COLS, .SDcols = KEEP_COLS]
bench::mark(min_iterations = 10L, f1(), f2(), f3(), f4())
# A tibble: 4 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
# 1 f1() 39.7ms 56.4ms 19.3 NA 0 10 0 519.25ms
# 2 f2() 223ms 225.7ms 4.43 NA 1.90 7 3 1.58s
# 3 f3() 38.5ms 39.4ms 25.2 NA 2.29 11 1 436.45ms
# 4 f4() 97.6ms 107.3ms 9.26 NA 1.03 9 1 972.44ms
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
This would allow taking
unique()on a subset of columns to save memory overhead. It's basically equivalent towith a more natural API:
I believe other workarounds are still memory-inefficient (as well as clunkier), e.g.
while the first
.SDapproach (IINM) is using a shallow copy and thus faster.