Skip to content

unique.data.table could get a cols argument #5243

@MichaelChirico

Description

@MichaelChirico

This would allow taking unique() on a subset of columns to save memory overhead. It's basically equivalent to

DT[, unique(.SD, by = BY_COLS), .SDcols = c(BY_COLS, KEEP_COLS)]

with a more natural API:

unique(DT, by = BY_COLS, cols = KEEP_COLS)

I believe other workarounds are still memory-inefficient (as well as clunkier), e.g.

unique(DT[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)], by = BY_COLS)

while the first .SD approach (IINM) is using a shallow copy and thus faster.

NN = 1e7
DT = data.table(grp = sample(c(letters, LETTERS, 0:9), NN, TRUE))
JJ = 100
for (jj in seq_len(JJ)) set(DT, NULL, paste0("V", jj), rnorm(NN))

BY_COLS = "grp"
KEEP_COLS = paste0("V", 1:5)

f1 <- function() DT[, unique(.SD, by = BY_COLS), .SDcols = c(BY_COLS, KEEP_COLS)]
f2 <- function() unique(DT[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)], by = BY_COLS)
f3 <- function() unique(DT, by = BY_COLS)[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)]
f4 <- function() DT[, head(.SD, 1L), by = BY_COLS, .SDcols = KEEP_COLS]

bench::mark(min_iterations = 10L, f1(), f2(), f3(), f4())
# A tibble: 4 x 13
#   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 f1()        39.7ms  56.4ms     19.3         NA     0       10     0   519.25ms
# 2 f2()         223ms 225.7ms      4.43        NA     1.90     7     3      1.58s
# 3 f3()        38.5ms  39.4ms     25.2         NA     2.29    11     1   436.45ms
# 4 f4()        97.6ms 107.3ms      9.26        NA     1.03     9     1   972.44ms
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions