unique.data.table could get a cols argument

This would allow taking `unique()` on a subset of columns to save memory overhead. It's basically equivalent to

```
DT[, unique(.SD, by = BY_COLS), .SDcols = c(BY_COLS, KEEP_COLS)]
```

with a more natural API:

```
unique(DT, by = BY_COLS, cols = KEEP_COLS)
```

I believe other workarounds are still memory-inefficient (as well as clunkier), e.g.

```
unique(DT[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)], by = BY_COLS)
```

while the first `.SD` approach (IINM) is using a shallow copy and thus faster.

```
NN = 1e7
DT = data.table(grp = sample(c(letters, LETTERS, 0:9), NN, TRUE))
JJ = 100
for (jj in seq_len(JJ)) set(DT, NULL, paste0("V", jj), rnorm(NN))

BY_COLS = "grp"
KEEP_COLS = paste0("V", 1:5)

f1 <- function() DT[, unique(.SD, by = BY_COLS), .SDcols = c(BY_COLS, KEEP_COLS)]
f2 <- function() unique(DT[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)], by = BY_COLS)
f3 <- function() unique(DT, by = BY_COLS)[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)]
f4 <- function() DT[, head(.SD, 1L), by = BY_COLS, .SDcols = KEEP_COLS]

bench::mark(min_iterations = 10L, f1(), f2(), f3(), f4())
# A tibble: 4 x 13
#   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 f1()        39.7ms  56.4ms     19.3         NA     0       10     0   519.25ms
# 2 f2()         223ms 225.7ms      4.43        NA     1.90     7     3      1.58s
# 3 f3()        38.5ms  39.4ms     25.2         NA     2.29    11     1   436.45ms
# 4 f4()        97.6ms 107.3ms      9.26        NA     1.03     9     1   972.44ms
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unique.data.table could get a cols argument #5243

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

unique.data.table could get a cols argument #5243

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions