Skip to content

cols argument for unique.data.table#5244

Merged
mattdowle merged 6 commits intomasterfrom
unique-cols
Dec 3, 2021
Merged

cols argument for unique.data.table#5244
mattdowle merged 6 commits intomasterfrom
unique-cols

Conversation

@MichaelChirico
Copy link
Copy Markdown
Member

Closes #5243.

Extending the benchmark in the issue to include the new approach:

NN = 2e7
DT = data.table(grp = sample(c(letters, LETTERS, 0:9), NN, TRUE))
JJ = 100
for (jj in seq_len(JJ)) set(DT, NULL, paste0("V", jj), rnorm(NN))

BY_COLS = "grp"
KEEP_COLS = paste0("V", 1:5)

f1 <- function() DT[, unique(.SD, by = BY_COLS), .SDcols = c(BY_COLS, KEEP_COLS)]
f2 <- function() unique(DT[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)], by = BY_COLS)
f3 <- function() unique(DT, by = BY_COLS)[, .SD, .SDcols = c(BY_COLS, KEEP_COLS)]
f4 <- function() DT[, head(.SD, 1L), by = BY_COLS, .SDcols = KEEP_COLS]
f5 <- function() unique(DT, by=BY_COLS, cols=KEEP_COLS)

bench::mark(min_iterations = 10L, f1(), f2(), f3(), f4(), f5())
# A tibble: 5 x 13
#  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 f1()        75.7ms  77.7ms     12.9         NA    0        10     0   776.75ms
# 2 f2()       443.2ms 445.5ms      2.24        NA    0.960     7     3      3.12s
# 3 f3()        75.5ms  75.8ms     13.1         NA    0        10     0   763.13ms
# 4 f4()         191ms 219.7ms      4.41        NA    0.490     9     1      2.04s
# 5 f5()        75.1ms  76.5ms     13.0         NA    0        10     0   768.13ms
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>

So, we've achieved the goal of having a clean API for doing something fast previously available only through more convoluted code.

@codecov
Copy link
Copy Markdown

codecov bot commented Oct 31, 2021

Codecov Report

Merging #5244 (255ea43) into master (0404ed8) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5244   +/-   ##
=======================================
  Coverage   99.50%   99.50%           
=======================================
  Files          77       77           
  Lines       14645    14647    +2     
=======================================
+ Hits        14573    14575    +2     
  Misses         72       72           
Impacted Files Coverage Δ
R/duplicated.R 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0404ed8...255ea43. Read the comment docs.

@ben-schwen
Copy link
Copy Markdown
Member

If we include it in unique.data.table shouldn't we also include it in duplicated.data.table and anyDuplicated.data.table since these go hand in hand?

@MichaelChirico
Copy link
Copy Markdown
Member Author

not really -- cols only affects the output object. for both of those functions, the output wouldn't be affected, right?

@mattdowle mattdowle added this to the 1.14.3 milestone Dec 3, 2021
@mattdowle mattdowle merged commit 6f3b7c1 into master Dec 3, 2021
@mattdowle mattdowle deleted the unique-cols branch December 3, 2021 21:46
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unique.data.table could get a cols argument

4 participants