Rdatatable · mattdowle · Aug 2, 2021 · May 4, 2020 · Aug 2, 2021 · Aug 2, 2021
@@ -185,6 +185,8 @@
 
 11. `melt()`'s internal C code is now more memory efficient, [#5054](https://github.com/Rdatatable/data.table/pull/5054). Thanks to Toby Dylan Hocking for the PR.
 
+12. `?merge` and `?setkey` have been updated to clarify that the row order is retained when `sort=FALSE`, and why `NA`s are always first when `sort=TRUE`, [#2574](https://github.com/Rdatatable/data.table/issues/2574) [#2594](https://github.com/Rdatatable/data.table/issues/2594). Thanks to Davor Josipovic and Markus Bonsch for the reports, and Jan Gorecki for the PR.
+
 
 # data.table [v1.14.0](https://github.com/Rdatatable/data.table/milestone/23?closed=1)  (21 Feb 2021)
 

@@ -4,7 +4,8 @@
 \title{Merge two data.tables}
 \description{
 Fast merge of two \code{data.table}s. The \code{data.table} method behaves
-very similarly to that of \code{data.frame}s except that, by default, it attempts to merge
+similarly to \code{data.frame} except that row order is specified, and by
+default the columns to merge on are chosen:
 
 \itemize{
   \item at first based on the shared key columns, and if there are none,
@@ -13,7 +14,7 @@ very similarly to that of \code{data.frame}s except that, by default, it attempt
   \item then based on the common columns between the two \code{data.table}s.
 }
 
-Set the \code{by}, or \code{by.x} and \code{by.y} arguments explicitly to override this default.
+Use the \code{by}, \code{by.x} and \code{by.y} arguments explicitly to override this default.
 }
 
 \usage{
@@ -32,15 +33,17 @@ If \code{y} has no key columns, this defaults to the key of \code{x}.}
 \item{by.x, by.y}{Vectors of column names in \code{x} and \code{y} to merge on.}
 \item{all}{logical; \code{all = TRUE} is shorthand to save setting both
 \code{all.x = TRUE} and \code{all.y = TRUE}.}
-\item{all.x}{logical; if \code{TRUE}, then extra rows will be added to the
-output, one for each row in \code{x} that has no matching row in \code{y}.
-These rows will have 'NA's in those columns that are usually filled with values
-from \code{y}.  The default is \code{FALSE}, so that only rows with data from both
-\code{x} and \code{y} are included in the output.}
+\item{all.x}{logical; if \code{TRUE}, rows from \code{x} which have no matching row
+in \code{y} are included. These rows will have 'NA's in the columns that are usually
+filled with values from \code{y}. The default is \code{FALSE} so that only rows with
+data from both \code{x} and \code{y} are included in the output.}
 \item{all.y}{logical; analogous to \code{all.x} above.}
-\item{sort}{logical. If \code{TRUE} (default), the merged \code{data.table} is
-sorted by setting the key to the \code{by / by.x} columns. If \code{FALSE}, the
-result is not sorted.}
+\item{sort}{logical. If \code{TRUE} (default), the rows of the merged
+\code{data.table} are sorted by setting the key to the \code{by / by.x} columns. If
+\code{FALSE}, unlike base R's \code{merge} for which row order is unspecified, the
+row order in \code{x} is retained (including retaining the position of missings when
+\code{all.x=TRUE}), followed by \code{y} rows that don't match \code{x} (when \code{all.y=TRUE})
+retaining the order those appear in \code{y}.}
 \item{suffixes}{A \code{character(2)} specifying the suffixes to be used for
 making non-\code{by} column names unique. The suffix behaviour works in a similar
 fashion as the \code{\link{merge.data.frame}} method does.}
@@ -54,27 +57,12 @@ as any \code{by.x}.}
 \details{
 \code{\link{merge}} is a generic function in base R. It dispatches to either the
 \code{merge.data.frame} method or \code{merge.data.table} method depending on
-the class of its first argument. Note that, unlike \code{SQL}, \code{NA} is
+the class of its first argument. Note that, unlike \code{SQL} join, \code{NA} is
 matched against \code{NA} (and \code{NaN} against \code{NaN}) while merging.
 
-In versions \code{<= v1.9.4}, if the specified columns in \code{by} were not the
-key (or head of the key) of \code{x} or \code{y}, then a \code{\link{copy}} is
-first re-keyed prior to performing the merge. This was less performant as well as memory
-inefficient. The concept of secondary keys (implemented in \code{v1.9.4}) was
-used to overcome this limitation from \code{v1.9.6}+. No deep copies are made
-any more, thereby improving performance and memory efficiency. Also, there is better
-control for providing the columns to merge on with the help of the newly implemented
-\code{by.x} and \code{by.y} arguments.
-
 For a more \code{data.table}-centric way of merging two \code{data.table}s, see
 \code{\link{[.data.table}}; e.g., \code{x[y, \dots]}. See FAQ 1.11 for a detailed
 comparison of \code{merge} and \code{x[y, \dots]}.
-
-If any column names provided to \code{by.x} also occur in \code{names(y)} but not in \code{by.y},
-then this \code{data.table} method will add the \code{suffixes} to those column names. As of
-R v3.4.3, the \code{data.frame} method will not (leading to duplicate column names in the result) but a patch has
-been proposed (see r-devel thread \href{https://r.789695.n4.nabble.com/Duplicate-column-names-created-by-base-merge-when-by-x-has-the-same-name-as-a-column-in-y-td4748345.html}{here})
-which is looking likely to be accepted for a future version of R.
 }
 
 \value{
@@ -84,7 +72,7 @@ set to \code{TRUE}.
 }
 
 \seealso{
-\code{\link{data.table}}, \code{\link{as.data.table}}, \code{\link{[.data.table}},
+\code{\link{data.table}}, \code{\link{setkey}}, \code{\link{[.data.table}},
 \code{\link{merge.data.frame}}
 }
 
@@ -125,14 +113,14 @@ merge(d4, d1)
 merge(d1, d4, all=TRUE)
 merge(d4, d1, all=TRUE)
 
-# new feature, no need to set keys anymore
+# setkey is automatic by default
 set.seed(1L)
 d1 <- data.table(a=sample(rep(1:3,each=2)), z=1:6)
 d2 <- data.table(a=2:0, z=10:12)
 merge(d1, d2, by="a")
 merge(d1, d2, by="a", all=TRUE)
 
-# new feature, using by.x and by.y arguments
+# using by.x and by.y
 setnames(d2, "a", "b")
 merge(d1, d2, by.x="a", by.y="b")
 merge(d1, d2, by.x="a", by.y="b", all=TRUE)

@@ -9,16 +9,23 @@
 \title{ Create key on a data.table }
 \description{
 \code{setkey} sorts a \code{data.table} and marks it as sorted with an
-attribute \code{sorted}. The sorted columns are the key. The key can be any
-number of columns. The columns are always sorted in \emph{ascending} order. The table
-is changed \emph{by reference} and \code{setkey} is very memory efficient.
-
-There are three reasons \code{setkey} is desirable: i) binary search and joins are faster
-when they detect they can use an existing key, ii) grouping by a leading subset of the key
-columns is faster because the groups are already gathered contiguously in RAM, iii)
-simpler shorter syntax; e.g. \code{DT["id",]} finds the group "id" in the first column
-of DT's key using binary search. It may be helpful to think of a key as
-super-charged rownames: multi-column and multi-type rownames.
+attribute \code{"sorted"}. The sorted columns are the key. The key can be any
+number of columns. The data is always sorted in \emph{ascending} order with \code{NA}s
+(if any) always first. The table is changed \emph{by reference} and there is
+no memory used for the key (other than marking which columns the data is sorted by).
+
+There are three reasons \code{setkey} is desirable:
+\itemize{
+  \item binary search and joins are faster when they detect they can use an existing key
+  \item grouping by a leading subset of the key columns is faster because the groups are already gathered contiguously in RAM
+  \item simpler shorter syntax; e.g. \code{DT["id",]} finds the group "id" in the first column of \code{DT}'s key using binary search. It may be helpful to think of a key as super-charged rownames: multi-column and multi-type.
+}
+
+\code{NA}s are always first because:
+\itemize{
+  \item \code{NA} is internally \code{INT_MIN} (a large negative number) in R. Keys and indexes are always in increasing order so if \code{NA}s are first, no special treatment or branch is needed in many \code{data.table} internals involving binary search. It is not optional to place \code{NA}s last for speed, simplicity and rubustness of internals at C level.
+  \item if any \code{NA}s are present then we believe it is better to display them up front (rather than hiding them at the end) to reduce the risk of not realizing \code{NA}s are present.
+}
 
 In \code{data.table} parlance, all \code{set*} functions change their input
 \emph{by reference}. That is, no copy is made at all other than for temporary
Original file line number	Diff line number	Diff line change
Expand Up		@@ -185,6 +185,8 @@

		11. `melt()`'s internal C code is now more memory efficient, [#5054](https://github.com/Rdatatable/data.table/pull/5054). Thanks to Toby Dylan Hocking for the PR.

		12. `?merge` and `?setkey` have been updated to clarify that the row order is retained when `sort=FALSE`, and why `NA`s are always first when `sort=TRUE`, [#2574](https://github.com/Rdatatable/data.table/issues/2574) [#2594](https://github.com/Rdatatable/data.table/issues/2594). Thanks to Davor Josipovic and Markus Bonsch for the reports, and Jan Gorecki for the PR.


		# data.table [v1.14.0](https://github.com/Rdatatable/data.table/milestone/23?closed=1) (21 Feb 2021)

Expand Down