Rdatatable · mattdowle · Dec 3, 2021 · Oct 31, 2021 · Oct 31, 2021 · Oct 31, 2021
@@ -272,6 +272,8 @@
 
 36. `as.xts.data.table()` now supports non-numeric xts coredata matrixes, [5268](https://github.com/Rdatatable/data.table/issues/5268). Existing numeric only functionality is supported by a new `numeric.only` parameter, which defaults to `TRUE` for backward compatability and the most common use case. To convert non-numeric columns, set this parameter to `FALSE`. Conversions of `data.table` columns to a `matrix` now uses `data.table::as.matrix`, with all its performance benefits. Thanks to @ethanbsmith for the report and fix.
 
+37. `unique.data.table()` gains `cols` to specify a subset of columns to include in the resulting `data.table`, [#5243](https://github.com/Rdatatable/data.table/issues/5243). This saves the memory overhead of subsetting unneeded columns, and provides a cleaner API for a common operation previously needing more convoluted code. Thanks to @MichaelChirico for the suggestion & implementation.
+
 ## BUG FIXES
 
 1. `by=.EACHI` when `i` is keyed but `on=` different columns than `i`'s key could create an invalidly keyed result, [#4603](https://github.com/Rdatatable/data.table/issues/4603) [#4911](https://github.com/Rdatatable/data.table/issues/4911). Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where a `data.table` is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries.

@@ -23,14 +23,17 @@ duplicated.data.table = function(x, incomparables=FALSE, fromLast=FALSE, by=seq_
   res
 }
 
-unique.data.table = function(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...) {
+unique.data.table = function(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), cols=NULL, ...) {
   if (!cedta()) return(NextMethod("unique")) # nocov
   if (!isFALSE(incomparables)) {
     .NotYetUsed("incomparables != FALSE")
   }
   if (nrow(x) <= 1L) return(x)
   if (!length(by)) by = NULL  #4594
   o = forderv(x, by=by, sort=FALSE, retGrp=TRUE)
+  if (!is.null(cols)) {
+      x = .shallow(x, c(by, cols), retain.key=TRUE)
+  }
   # if by=key(x), forderv tests for orderedness within it quickly and will short-circuit
   # there isn't any need in unique() to call uniqlist like duplicated does; uniqlist returns a new nrow(x) vector anyway and isn't
   # as efficient as forderv returning empty o when input is already ordered

@@ -18550,3 +18550,12 @@ test(2231.51, DT[, weighted.mean(x, w, na.rm=FALSE), g, verbose=TRUE], data.tabl
 test(2231.52, DT[, weighted.mean(x, w, na.rm=TRUE), g, verbose=TRUE], data.table(g=c(1L,2L), V1=c(2, 5)), output="GForce optimized j to")
 options(old)
 
+# cols argument for unique.data.table, #5243
+DT = data.table(g = rep(letters, 3), v1=1:78, v2=78:1)
+test(2232.1, unique(DT, by='g', cols='v1'), DT[1:26, !'v2'])
+test(2232.2, unique(DT, by='g', cols='v2'), DT[1:26, !'v1'])
+## no duplicates
+test(2232.3, unique(DT[1:26], by='g', cols='v1'), DT[1:26, !'v2'])
+## invalid columns fail as expected
+test(2232.4, unique(DT, by='g', cols='v3'), error="non-existing column(s)")
+
@@ -28,7 +28,8 @@ memory efficient.
 \usage{
 \method{duplicated}{data.table}(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), \dots)
 
-\method{unique}{data.table}(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), \dots)
+\method{unique}{data.table}(x, incomparables=FALSE, fromLast=FALSE,
+by=seq_along(x), cols=NULL, \dots)
 
 \method{anyDuplicated}{data.table}(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), \dots)
 
@@ -46,6 +47,8 @@ correspond to \code{duplicated = FALSE}.}
 of columns from \code{x} to use for uniqueness checks. By default all columns
 are being used. That was changed recently for consistency to data.frame methods.
 In version \code{< 1.9.8} default was \code{key(x)}.}
+\item{cols}{Columns (in addition to \code{by}) from \code{x} to include in the
+  resulting \code{data.table}.}
 \item{na.rm}{Logical (default is \code{FALSE}). Should missing values (including
 \code{NaN}) be removed?}
 }
@@ -59,7 +62,11 @@ handle cases where limitations in floating point representation is undesirable.
 
 \code{v1.9.4} introduces \code{anyDuplicated} method for data.tables and is
 similar to base in functionality. It also implements the logical argument
-\code{fromLast} for all three functions, with default value \code{FALSE}.
+\code{fromLast} for all three functions, with default value
+\code{FALSE}.
+
+Note: When \code{cols} is specified, the resulting table will have
+columns \code{c(by, cols)}, in that order.
 }
 \value{
 \code{duplicated} returns a logical vector of length \code{nrow(x)}