Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,8 @@

36. `as.xts.data.table()` now supports non-numeric xts coredata matrixes, [5268](https://github.com/Rdatatable/data.table/issues/5268). Existing numeric only functionality is supported by a new `numeric.only` parameter, which defaults to `TRUE` for backward compatability and the most common use case. To convert non-numeric columns, set this parameter to `FALSE`. Conversions of `data.table` columns to a `matrix` now uses `data.table::as.matrix`, with all its performance benefits. Thanks to @ethanbsmith for the report and fix.

37. `unique.data.table()` gains `cols` to specify a subset of columns to include in the resulting `data.table`, [#5243](https://github.com/Rdatatable/data.table/issues/5243). This saves the memory overhead of subsetting unneeded columns, and provides a cleaner API for a common operation previously needing more convoluted code. Thanks to @MichaelChirico for the suggestion & implementation.

## BUG FIXES

1. `by=.EACHI` when `i` is keyed but `on=` different columns than `i`'s key could create an invalidly keyed result, [#4603](https://github.com/Rdatatable/data.table/issues/4603) [#4911](https://github.com/Rdatatable/data.table/issues/4911). Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where a `data.table` is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries.
Expand Down
5 changes: 4 additions & 1 deletion R/duplicated.R
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,17 @@ duplicated.data.table = function(x, incomparables=FALSE, fromLast=FALSE, by=seq_
res
}

unique.data.table = function(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...) {
unique.data.table = function(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), cols=NULL, ...) {
if (!cedta()) return(NextMethod("unique")) # nocov
if (!isFALSE(incomparables)) {
.NotYetUsed("incomparables != FALSE")
}
if (nrow(x) <= 1L) return(x)
if (!length(by)) by = NULL #4594
o = forderv(x, by=by, sort=FALSE, retGrp=TRUE)
if (!is.null(cols)) {
x = .shallow(x, c(by, cols), retain.key=TRUE)
}
# if by=key(x), forderv tests for orderedness within it quickly and will short-circuit
# there isn't any need in unique() to call uniqlist like duplicated does; uniqlist returns a new nrow(x) vector anyway and isn't
# as efficient as forderv returning empty o when input is already ordered
Expand Down
9 changes: 9 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -18550,3 +18550,12 @@ test(2231.51, DT[, weighted.mean(x, w, na.rm=FALSE), g, verbose=TRUE], data.tabl
test(2231.52, DT[, weighted.mean(x, w, na.rm=TRUE), g, verbose=TRUE], data.table(g=c(1L,2L), V1=c(2, 5)), output="GForce optimized j to")
options(old)

# cols argument for unique.data.table, #5243
DT = data.table(g = rep(letters, 3), v1=1:78, v2=78:1)
test(2232.1, unique(DT, by='g', cols='v1'), DT[1:26, !'v2'])
test(2232.2, unique(DT, by='g', cols='v2'), DT[1:26, !'v1'])
## no duplicates
test(2232.3, unique(DT[1:26], by='g', cols='v1'), DT[1:26, !'v2'])
## invalid columns fail as expected
test(2232.4, unique(DT, by='g', cols='v3'), error="non-existing column(s)")

11 changes: 9 additions & 2 deletions man/duplicated.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,8 @@ memory efficient.
\usage{
\method{duplicated}{data.table}(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), \dots)

\method{unique}{data.table}(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), \dots)
\method{unique}{data.table}(x, incomparables=FALSE, fromLast=FALSE,
by=seq_along(x), cols=NULL, \dots)

\method{anyDuplicated}{data.table}(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), \dots)

Expand All @@ -46,6 +47,8 @@ correspond to \code{duplicated = FALSE}.}
of columns from \code{x} to use for uniqueness checks. By default all columns
are being used. That was changed recently for consistency to data.frame methods.
In version \code{< 1.9.8} default was \code{key(x)}.}
\item{cols}{Columns (in addition to \code{by}) from \code{x} to include in the
resulting \code{data.table}.}
\item{na.rm}{Logical (default is \code{FALSE}). Should missing values (including
\code{NaN}) be removed?}
}
Expand All @@ -59,7 +62,11 @@ handle cases where limitations in floating point representation is undesirable.

\code{v1.9.4} introduces \code{anyDuplicated} method for data.tables and is
similar to base in functionality. It also implements the logical argument
\code{fromLast} for all three functions, with default value \code{FALSE}.
\code{fromLast} for all three functions, with default value
\code{FALSE}.

Note: When \code{cols} is specified, the resulting table will have
columns \code{c(by, cols)}, in that order.
}
\value{
\code{duplicated} returns a logical vector of length \code{nrow(x)}
Expand Down