diff --git a/NEWS.md b/NEWS.md index 1ce72afd52..eb426a6d66 100644 --- a/NEWS.md +++ b/NEWS.md @@ -185,6 +185,8 @@ 11. `melt()`'s internal C code is now more memory efficient, [#5054](https://github.com/Rdatatable/data.table/pull/5054). Thanks to Toby Dylan Hocking for the PR. +12. `?merge` and `?setkey` have been updated to clarify that the row order is retained when `sort=FALSE`, and why `NA`s are always first when `sort=TRUE`, [#2574](https://github.com/Rdatatable/data.table/issues/2574) [#2594](https://github.com/Rdatatable/data.table/issues/2594). Thanks to Davor Josipovic and Markus Bonsch for the reports, and Jan Gorecki for the PR. + # data.table [v1.14.0](https://github.com/Rdatatable/data.table/milestone/23?closed=1) (21 Feb 2021) diff --git a/man/merge.Rd b/man/merge.Rd index fe0a03f7a0..6fcbc10866 100644 --- a/man/merge.Rd +++ b/man/merge.Rd @@ -4,7 +4,8 @@ \title{Merge two data.tables} \description{ Fast merge of two \code{data.table}s. The \code{data.table} method behaves -very similarly to that of \code{data.frame}s except that, by default, it attempts to merge +similarly to \code{data.frame} except that row order is specified, and by +default the columns to merge on are chosen: \itemize{ \item at first based on the shared key columns, and if there are none, @@ -13,7 +14,7 @@ very similarly to that of \code{data.frame}s except that, by default, it attempt \item then based on the common columns between the two \code{data.table}s. } -Set the \code{by}, or \code{by.x} and \code{by.y} arguments explicitly to override this default. +Use the \code{by}, \code{by.x} and \code{by.y} arguments explicitly to override this default. } \usage{ @@ -32,15 +33,17 @@ If \code{y} has no key columns, this defaults to the key of \code{x}.} \item{by.x, by.y}{Vectors of column names in \code{x} and \code{y} to merge on.} \item{all}{logical; \code{all = TRUE} is shorthand to save setting both \code{all.x = TRUE} and \code{all.y = TRUE}.} -\item{all.x}{logical; if \code{TRUE}, then extra rows will be added to the -output, one for each row in \code{x} that has no matching row in \code{y}. -These rows will have 'NA's in those columns that are usually filled with values -from \code{y}. The default is \code{FALSE}, so that only rows with data from both -\code{x} and \code{y} are included in the output.} +\item{all.x}{logical; if \code{TRUE}, rows from \code{x} which have no matching row +in \code{y} are included. These rows will have 'NA's in the columns that are usually +filled with values from \code{y}. The default is \code{FALSE} so that only rows with +data from both \code{x} and \code{y} are included in the output.} \item{all.y}{logical; analogous to \code{all.x} above.} -\item{sort}{logical. If \code{TRUE} (default), the merged \code{data.table} is -sorted by setting the key to the \code{by / by.x} columns. If \code{FALSE}, the -result is not sorted.} +\item{sort}{logical. If \code{TRUE} (default), the rows of the merged +\code{data.table} are sorted by setting the key to the \code{by / by.x} columns. If +\code{FALSE}, unlike base R's \code{merge} for which row order is unspecified, the +row order in \code{x} is retained (including retaining the position of missings when +\code{all.x=TRUE}), followed by \code{y} rows that don't match \code{x} (when \code{all.y=TRUE}) +retaining the order those appear in \code{y}.} \item{suffixes}{A \code{character(2)} specifying the suffixes to be used for making non-\code{by} column names unique. The suffix behaviour works in a similar fashion as the \code{\link{merge.data.frame}} method does.} @@ -54,27 +57,12 @@ as any \code{by.x}.} \details{ \code{\link{merge}} is a generic function in base R. It dispatches to either the \code{merge.data.frame} method or \code{merge.data.table} method depending on -the class of its first argument. Note that, unlike \code{SQL}, \code{NA} is +the class of its first argument. Note that, unlike \code{SQL} join, \code{NA} is matched against \code{NA} (and \code{NaN} against \code{NaN}) while merging. -In versions \code{<= v1.9.4}, if the specified columns in \code{by} were not the -key (or head of the key) of \code{x} or \code{y}, then a \code{\link{copy}} is -first re-keyed prior to performing the merge. This was less performant as well as memory -inefficient. The concept of secondary keys (implemented in \code{v1.9.4}) was -used to overcome this limitation from \code{v1.9.6}+. No deep copies are made -any more, thereby improving performance and memory efficiency. Also, there is better -control for providing the columns to merge on with the help of the newly implemented -\code{by.x} and \code{by.y} arguments. - For a more \code{data.table}-centric way of merging two \code{data.table}s, see \code{\link{[.data.table}}; e.g., \code{x[y, \dots]}. See FAQ 1.11 for a detailed comparison of \code{merge} and \code{x[y, \dots]}. - -If any column names provided to \code{by.x} also occur in \code{names(y)} but not in \code{by.y}, -then this \code{data.table} method will add the \code{suffixes} to those column names. As of -R v3.4.3, the \code{data.frame} method will not (leading to duplicate column names in the result) but a patch has -been proposed (see r-devel thread \href{https://r.789695.n4.nabble.com/Duplicate-column-names-created-by-base-merge-when-by-x-has-the-same-name-as-a-column-in-y-td4748345.html}{here}) -which is looking likely to be accepted for a future version of R. } \value{ @@ -84,7 +72,7 @@ set to \code{TRUE}. } \seealso{ -\code{\link{data.table}}, \code{\link{as.data.table}}, \code{\link{[.data.table}}, +\code{\link{data.table}}, \code{\link{setkey}}, \code{\link{[.data.table}}, \code{\link{merge.data.frame}} } @@ -125,14 +113,14 @@ merge(d4, d1) merge(d1, d4, all=TRUE) merge(d4, d1, all=TRUE) -# new feature, no need to set keys anymore +# setkey is automatic by default set.seed(1L) d1 <- data.table(a=sample(rep(1:3,each=2)), z=1:6) d2 <- data.table(a=2:0, z=10:12) merge(d1, d2, by="a") merge(d1, d2, by="a", all=TRUE) -# new feature, using by.x and by.y arguments +# using by.x and by.y setnames(d2, "a", "b") merge(d1, d2, by.x="a", by.y="b") merge(d1, d2, by.x="a", by.y="b", all=TRUE) diff --git a/man/setkey.Rd b/man/setkey.Rd index daf10c83ad..ca386c58d7 100644 --- a/man/setkey.Rd +++ b/man/setkey.Rd @@ -9,16 +9,23 @@ \title{ Create key on a data.table } \description{ \code{setkey} sorts a \code{data.table} and marks it as sorted with an -attribute \code{sorted}. The sorted columns are the key. The key can be any -number of columns. The columns are always sorted in \emph{ascending} order. The table -is changed \emph{by reference} and \code{setkey} is very memory efficient. - -There are three reasons \code{setkey} is desirable: i) binary search and joins are faster -when they detect they can use an existing key, ii) grouping by a leading subset of the key -columns is faster because the groups are already gathered contiguously in RAM, iii) -simpler shorter syntax; e.g. \code{DT["id",]} finds the group "id" in the first column -of DT's key using binary search. It may be helpful to think of a key as -super-charged rownames: multi-column and multi-type rownames. +attribute \code{"sorted"}. The sorted columns are the key. The key can be any +number of columns. The data is always sorted in \emph{ascending} order with \code{NA}s +(if any) always first. The table is changed \emph{by reference} and there is +no memory used for the key (other than marking which columns the data is sorted by). + +There are three reasons \code{setkey} is desirable: +\itemize{ + \item binary search and joins are faster when they detect they can use an existing key + \item grouping by a leading subset of the key columns is faster because the groups are already gathered contiguously in RAM + \item simpler shorter syntax; e.g. \code{DT["id",]} finds the group "id" in the first column of \code{DT}'s key using binary search. It may be helpful to think of a key as super-charged rownames: multi-column and multi-type. +} + +\code{NA}s are always first because: +\itemize{ + \item \code{NA} is internally \code{INT_MIN} (a large negative number) in R. Keys and indexes are always in increasing order so if \code{NA}s are first, no special treatment or branch is needed in many \code{data.table} internals involving binary search. It is not optional to place \code{NA}s last for speed, simplicity and rubustness of internals at C level. + \item if any \code{NA}s are present then we believe it is better to display them up front (rather than hiding them at the end) to reduce the risk of not realizing \code{NA}s are present. +} In \code{data.table} parlance, all \code{set*} functions change their input \emph{by reference}. That is, no copy is made at all other than for temporary