Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,8 @@

11. `melt()`'s internal C code is now more memory efficient, [#5054](https://github.com/Rdatatable/data.table/pull/5054). Thanks to Toby Dylan Hocking for the PR.

12. `?merge` and `?setkey` have been updated to clarify that the row order is retained when `sort=FALSE`, and why `NA`s are always first when `sort=TRUE`, [#2574](https://github.com/Rdatatable/data.table/issues/2574) [#2594](https://github.com/Rdatatable/data.table/issues/2594). Thanks to Davor Josipovic and Markus Bonsch for the reports, and Jan Gorecki for the PR.


# data.table [v1.14.0](https://github.com/Rdatatable/data.table/milestone/23?closed=1) (21 Feb 2021)

Expand Down
46 changes: 17 additions & 29 deletions man/merge.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
\title{Merge two data.tables}
\description{
Fast merge of two \code{data.table}s. The \code{data.table} method behaves
very similarly to that of \code{data.frame}s except that, by default, it attempts to merge
similarly to \code{data.frame} except that row order is specified, and by
default the columns to merge on are chosen:

\itemize{
\item at first based on the shared key columns, and if there are none,
Expand All @@ -13,7 +14,7 @@ very similarly to that of \code{data.frame}s except that, by default, it attempt
\item then based on the common columns between the two \code{data.table}s.
}

Set the \code{by}, or \code{by.x} and \code{by.y} arguments explicitly to override this default.
Use the \code{by}, \code{by.x} and \code{by.y} arguments explicitly to override this default.
}

\usage{
Expand All @@ -32,15 +33,17 @@ If \code{y} has no key columns, this defaults to the key of \code{x}.}
\item{by.x, by.y}{Vectors of column names in \code{x} and \code{y} to merge on.}
\item{all}{logical; \code{all = TRUE} is shorthand to save setting both
\code{all.x = TRUE} and \code{all.y = TRUE}.}
\item{all.x}{logical; if \code{TRUE}, then extra rows will be added to the
output, one for each row in \code{x} that has no matching row in \code{y}.
These rows will have 'NA's in those columns that are usually filled with values
from \code{y}. The default is \code{FALSE}, so that only rows with data from both
\code{x} and \code{y} are included in the output.}
\item{all.x}{logical; if \code{TRUE}, rows from \code{x} which have no matching row
in \code{y} are included. These rows will have 'NA's in the columns that are usually
filled with values from \code{y}. The default is \code{FALSE} so that only rows with
data from both \code{x} and \code{y} are included in the output.}
\item{all.y}{logical; analogous to \code{all.x} above.}
\item{sort}{logical. If \code{TRUE} (default), the merged \code{data.table} is
sorted by setting the key to the \code{by / by.x} columns. If \code{FALSE}, the
result is not sorted.}
\item{sort}{logical. If \code{TRUE} (default), the rows of the merged
\code{data.table} are sorted by setting the key to the \code{by / by.x} columns. If
\code{FALSE}, unlike base R's \code{merge} for which row order is unspecified, the
row order in \code{x} is retained (including retaining the position of missings when
\code{all.x=TRUE}), followed by \code{y} rows that don't match \code{x} (when \code{all.y=TRUE})
retaining the order those appear in \code{y}.}
\item{suffixes}{A \code{character(2)} specifying the suffixes to be used for
making non-\code{by} column names unique. The suffix behaviour works in a similar
fashion as the \code{\link{merge.data.frame}} method does.}
Expand All @@ -54,27 +57,12 @@ as any \code{by.x}.}
\details{
\code{\link{merge}} is a generic function in base R. It dispatches to either the
\code{merge.data.frame} method or \code{merge.data.table} method depending on
the class of its first argument. Note that, unlike \code{SQL}, \code{NA} is
the class of its first argument. Note that, unlike \code{SQL} join, \code{NA} is
matched against \code{NA} (and \code{NaN} against \code{NaN}) while merging.

In versions \code{<= v1.9.4}, if the specified columns in \code{by} were not the
key (or head of the key) of \code{x} or \code{y}, then a \code{\link{copy}} is
first re-keyed prior to performing the merge. This was less performant as well as memory
inefficient. The concept of secondary keys (implemented in \code{v1.9.4}) was
used to overcome this limitation from \code{v1.9.6}+. No deep copies are made
Comment thread
MichaelChirico marked this conversation as resolved.
any more, thereby improving performance and memory efficiency. Also, there is better
control for providing the columns to merge on with the help of the newly implemented
\code{by.x} and \code{by.y} arguments.

For a more \code{data.table}-centric way of merging two \code{data.table}s, see
\code{\link{[.data.table}}; e.g., \code{x[y, \dots]}. See FAQ 1.11 for a detailed
comparison of \code{merge} and \code{x[y, \dots]}.

If any column names provided to \code{by.x} also occur in \code{names(y)} but not in \code{by.y},
then this \code{data.table} method will add the \code{suffixes} to those column names. As of
R v3.4.3, the \code{data.frame} method will not (leading to duplicate column names in the result) but a patch has
been proposed (see r-devel thread \href{https://r.789695.n4.nabble.com/Duplicate-column-names-created-by-base-merge-when-by-x-has-the-same-name-as-a-column-in-y-td4748345.html}{here})
which is looking likely to be accepted for a future version of R.
}

\value{
Expand All @@ -84,7 +72,7 @@ set to \code{TRUE}.
}

\seealso{
\code{\link{data.table}}, \code{\link{as.data.table}}, \code{\link{[.data.table}},
\code{\link{data.table}}, \code{\link{setkey}}, \code{\link{[.data.table}},
\code{\link{merge.data.frame}}
}

Expand Down Expand Up @@ -125,14 +113,14 @@ merge(d4, d1)
merge(d1, d4, all=TRUE)
merge(d4, d1, all=TRUE)

# new feature, no need to set keys anymore
# setkey is automatic by default
set.seed(1L)
d1 <- data.table(a=sample(rep(1:3,each=2)), z=1:6)
d2 <- data.table(a=2:0, z=10:12)
merge(d1, d2, by="a")
merge(d1, d2, by="a", all=TRUE)

# new feature, using by.x and by.y arguments
# using by.x and by.y
setnames(d2, "a", "b")
merge(d1, d2, by.x="a", by.y="b")
merge(d1, d2, by.x="a", by.y="b", all=TRUE)
Expand Down
27 changes: 17 additions & 10 deletions man/setkey.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,23 @@
\title{ Create key on a data.table }
\description{
\code{setkey} sorts a \code{data.table} and marks it as sorted with an
attribute \code{sorted}. The sorted columns are the key. The key can be any
number of columns. The columns are always sorted in \emph{ascending} order. The table
is changed \emph{by reference} and \code{setkey} is very memory efficient.

There are three reasons \code{setkey} is desirable: i) binary search and joins are faster
when they detect they can use an existing key, ii) grouping by a leading subset of the key
columns is faster because the groups are already gathered contiguously in RAM, iii)
simpler shorter syntax; e.g. \code{DT["id",]} finds the group "id" in the first column
of DT's key using binary search. It may be helpful to think of a key as
super-charged rownames: multi-column and multi-type rownames.
attribute \code{"sorted"}. The sorted columns are the key. The key can be any
number of columns. The data is always sorted in \emph{ascending} order with \code{NA}s
(if any) always first. The table is changed \emph{by reference} and there is
no memory used for the key (other than marking which columns the data is sorted by).

There are three reasons \code{setkey} is desirable:
\itemize{
\item binary search and joins are faster when they detect they can use an existing key
\item grouping by a leading subset of the key columns is faster because the groups are already gathered contiguously in RAM
\item simpler shorter syntax; e.g. \code{DT["id",]} finds the group "id" in the first column of \code{DT}'s key using binary search. It may be helpful to think of a key as super-charged rownames: multi-column and multi-type.
}

\code{NA}s are always first because:
\itemize{
\item \code{NA} is internally \code{INT_MIN} (a large negative number) in R. Keys and indexes are always in increasing order so if \code{NA}s are first, no special treatment or branch is needed in many \code{data.table} internals involving binary search. It is not optional to place \code{NA}s last for speed, simplicity and rubustness of internals at C level.
\item if any \code{NA}s are present then we believe it is better to display them up front (rather than hiding them at the end) to reduce the risk of not realizing \code{NA}s are present.
}

In \code{data.table} parlance, all \code{set*} functions change their input
\emph{by reference}. That is, no copy is made at all other than for temporary
Expand Down