diff --git a/NEWS.md b/NEWS.md index ea55888f23..859a8fc1b0 100644 --- a/NEWS.md +++ b/NEWS.md @@ -272,6 +272,8 @@ 36. `as.xts.data.table()` now supports non-numeric xts coredata matrixes, [5268](https://github.com/Rdatatable/data.table/issues/5268). Existing numeric only functionality is supported by a new `numeric.only` parameter, which defaults to `TRUE` for backward compatability and the most common use case. To convert non-numeric columns, set this parameter to `FALSE`. Conversions of `data.table` columns to a `matrix` now uses `data.table::as.matrix`, with all its performance benefits. Thanks to @ethanbsmith for the report and fix. +37. `unique.data.table()` gains `cols` to specify a subset of columns to include in the resulting `data.table`, [#5243](https://github.com/Rdatatable/data.table/issues/5243). This saves the memory overhead of subsetting unneeded columns, and provides a cleaner API for a common operation previously needing more convoluted code. Thanks to @MichaelChirico for the suggestion & implementation. + ## BUG FIXES 1. `by=.EACHI` when `i` is keyed but `on=` different columns than `i`'s key could create an invalidly keyed result, [#4603](https://github.com/Rdatatable/data.table/issues/4603) [#4911](https://github.com/Rdatatable/data.table/issues/4911). Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where a `data.table` is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries. diff --git a/R/duplicated.R b/R/duplicated.R index 4fc7c8d166..901d6e3c01 100644 --- a/R/duplicated.R +++ b/R/duplicated.R @@ -23,7 +23,7 @@ duplicated.data.table = function(x, incomparables=FALSE, fromLast=FALSE, by=seq_ res } -unique.data.table = function(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...) { +unique.data.table = function(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), cols=NULL, ...) { if (!cedta()) return(NextMethod("unique")) # nocov if (!isFALSE(incomparables)) { .NotYetUsed("incomparables != FALSE") @@ -31,6 +31,9 @@ unique.data.table = function(x, incomparables=FALSE, fromLast=FALSE, by=seq_alon if (nrow(x) <= 1L) return(x) if (!length(by)) by = NULL #4594 o = forderv(x, by=by, sort=FALSE, retGrp=TRUE) + if (!is.null(cols)) { + x = .shallow(x, c(by, cols), retain.key=TRUE) + } # if by=key(x), forderv tests for orderedness within it quickly and will short-circuit # there isn't any need in unique() to call uniqlist like duplicated does; uniqlist returns a new nrow(x) vector anyway and isn't # as efficient as forderv returning empty o when input is already ordered diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index 54f60d31ca..a2e9fa6e04 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -18550,3 +18550,12 @@ test(2231.51, DT[, weighted.mean(x, w, na.rm=FALSE), g, verbose=TRUE], data.tabl test(2231.52, DT[, weighted.mean(x, w, na.rm=TRUE), g, verbose=TRUE], data.table(g=c(1L,2L), V1=c(2, 5)), output="GForce optimized j to") options(old) +# cols argument for unique.data.table, #5243 +DT = data.table(g = rep(letters, 3), v1=1:78, v2=78:1) +test(2232.1, unique(DT, by='g', cols='v1'), DT[1:26, !'v2']) +test(2232.2, unique(DT, by='g', cols='v2'), DT[1:26, !'v1']) +## no duplicates +test(2232.3, unique(DT[1:26], by='g', cols='v1'), DT[1:26, !'v2']) +## invalid columns fail as expected +test(2232.4, unique(DT, by='g', cols='v3'), error="non-existing column(s)") + diff --git a/man/duplicated.Rd b/man/duplicated.Rd index a9c333beb5..daf7c39d58 100644 --- a/man/duplicated.Rd +++ b/man/duplicated.Rd @@ -28,7 +28,8 @@ memory efficient. \usage{ \method{duplicated}{data.table}(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), \dots) -\method{unique}{data.table}(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), \dots) +\method{unique}{data.table}(x, incomparables=FALSE, fromLast=FALSE, +by=seq_along(x), cols=NULL, \dots) \method{anyDuplicated}{data.table}(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), \dots) @@ -46,6 +47,8 @@ correspond to \code{duplicated = FALSE}.} of columns from \code{x} to use for uniqueness checks. By default all columns are being used. That was changed recently for consistency to data.frame methods. In version \code{< 1.9.8} default was \code{key(x)}.} +\item{cols}{Columns (in addition to \code{by}) from \code{x} to include in the + resulting \code{data.table}.} \item{na.rm}{Logical (default is \code{FALSE}). Should missing values (including \code{NaN}) be removed?} } @@ -59,7 +62,11 @@ handle cases where limitations in floating point representation is undesirable. \code{v1.9.4} introduces \code{anyDuplicated} method for data.tables and is similar to base in functionality. It also implements the logical argument -\code{fromLast} for all three functions, with default value \code{FALSE}. +\code{fromLast} for all three functions, with default value +\code{FALSE}. + +Note: When \code{cols} is specified, the resulting table will have +columns \code{c(by, cols)}, in that order. } \value{ \code{duplicated} returns a logical vector of length \code{nrow(x)}