From be246bf3ec5588707760ffaad880d0903280c06e Mon Sep 17 00:00:00 2001 From: jangorecki Date: Mon, 4 May 2020 22:31:14 +0100 Subject: [PATCH 01/12] merge sort arg document better --- man/merge.Rd | 18 +++++------------- 1 file changed, 5 insertions(+), 13 deletions(-) diff --git a/man/merge.Rd b/man/merge.Rd index 65f1f14948..41618941cc 100644 --- a/man/merge.Rd +++ b/man/merge.Rd @@ -38,9 +38,10 @@ These rows will have 'NA's in those columns that are usually filled with values from \code{y}. The default is \code{FALSE}, so that only rows with data from both \code{x} and \code{y} are included in the output.} \item{all.y}{logical; analogous to \code{all.x} above.} -\item{sort}{logical. If \code{TRUE} (default), the merged \code{data.table} is -sorted by setting the key to the \code{by / by.x} columns. If \code{FALSE}, the -result is not sorted.} +\item{sort}{logical. If \code{TRUE} (default), then rows of merged +\code{data.table} are sorted by setting the key to the \code{by / by.x} columns. +Note that \code{NA}s are placed in front, unlike base R sort. If \code{FALSE}, +then rows are in an unspecified order.} \item{suffixes}{A \code{character(2)} specifying the suffixes to be used for making non-\code{by} column names unique. The suffix behaviour works in a similar fashion as the \code{\link{merge.data.frame}} method does.} @@ -54,18 +55,9 @@ as any \code{by.x}.} \details{ \code{\link{merge}} is a generic function in base R. It dispatches to either the \code{merge.data.frame} method or \code{merge.data.table} method depending on -the class of its first argument. Note that, unlike \code{SQL}, \code{NA} is +the class of its first argument. Note that, unlike \code{SQL} join, \code{NA} is matched against \code{NA} (and \code{NaN} against \code{NaN}) while merging. -In versions \code{<= v1.9.4}, if the specified columns in \code{by} were not the -key (or head of the key) of \code{x} or \code{y}, then a \code{\link{copy}} is -first re-keyed prior to performing the merge. This was less performant as well as memory -inefficient. The concept of secondary keys (implemented in \code{v1.9.4}) was -used to overcome this limitation from \code{v1.9.6}+. No deep copies are made -any more, thereby improving performance and memory efficiency. Also, there is better -control for providing the columns to merge on with the help of the newly implemented -\code{by.x} and \code{by.y} arguments. - For a more \code{data.table}-centric way of merging two \code{data.table}s, see \code{\link{[.data.table}}; e.g., \code{x[y, \dots]}. See FAQ 1.11 for a detailed comparison of \code{merge} and \code{x[y, \dots]}. From 31a6ee9874546664a741a5db9ee73731bd76371c Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Mon, 2 Aug 2021 12:23:06 -0600 Subject: [PATCH 02/12] row order is specified when sort=FALSE --- man/merge.Rd | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/man/merge.Rd b/man/merge.Rd index 07dc4c1141..8704dc3fb9 100644 --- a/man/merge.Rd +++ b/man/merge.Rd @@ -38,10 +38,11 @@ These rows will have 'NA's in those columns that are usually filled with values from \code{y}. The default is \code{FALSE}, so that only rows with data from both \code{x} and \code{y} are included in the output.} \item{all.y}{logical; analogous to \code{all.x} above.} -\item{sort}{logical. If \code{TRUE} (default), then rows of merged +\item{sort}{logical. If \code{TRUE} (default), the rows of the merged \code{data.table} are sorted by setting the key to the \code{by / by.x} columns. -Note that \code{NA}s are placed in front, unlike base R sort. If \code{FALSE}, -then rows are in an unspecified order.} +Note that \code{NA}s are placed in front, unlike base R sort (**add why**). If +\code{FALSE}, unlike base R's merge for which row order is unspecified, the row +order in `x` is retained, followed by `y` rows that don't match `x` (when \code{all.y=TRUE}) in the order they appear in `y`.} \item{suffixes}{A \code{character(2)} specifying the suffixes to be used for making non-\code{by} column names unique. The suffix behaviour works in a similar fashion as the \code{\link{merge.data.frame}} method does.} From 8c33e3ca980e988611c5b9e289ede96210e73485 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Mon, 2 Aug 2021 12:33:14 -0600 Subject: [PATCH 03/12] remove 'NAs are placed in front' because there are unlikely to be any NAs in key columns, and in this context of missings in merge, that sentence was more likely to mislead towards putting missings from all.y/all.x up front --- man/merge.Rd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/man/merge.Rd b/man/merge.Rd index 8704dc3fb9..fbc4fac1d5 100644 --- a/man/merge.Rd +++ b/man/merge.Rd @@ -39,10 +39,10 @@ from \code{y}. The default is \code{FALSE}, so that only rows with data from bo \code{x} and \code{y} are included in the output.} \item{all.y}{logical; analogous to \code{all.x} above.} \item{sort}{logical. If \code{TRUE} (default), the rows of the merged -\code{data.table} are sorted by setting the key to the \code{by / by.x} columns. -Note that \code{NA}s are placed in front, unlike base R sort (**add why**). If -\code{FALSE}, unlike base R's merge for which row order is unspecified, the row -order in `x` is retained, followed by `y` rows that don't match `x` (when \code{all.y=TRUE}) in the order they appear in `y`.} +\code{data.table} are sorted by setting the key to the \code{by / by.x} columns. If +\code{FALSE}, unlike base R's \code{merge} for which row order is unspecified, the row order +in `x` is retained, followed by `y` rows that don't match `x` (when \code{all.y=TRUE}) +retaining the order those appear in `y`.} \item{suffixes}{A \code{character(2)} specifying the suffixes to be used for making non-\code{by} column names unique. The suffix behaviour works in a similar fashion as the \code{\link{merge.data.frame}} method does.} From a0eeb853cd0c0a9eb1acce96b9bb775789d1bce9 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Mon, 2 Aug 2021 12:40:11 -0600 Subject: [PATCH 04/12] retaining order of missings specified too, and backticks to code{} --- man/merge.Rd | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/man/merge.Rd b/man/merge.Rd index fbc4fac1d5..f7528b70e1 100644 --- a/man/merge.Rd +++ b/man/merge.Rd @@ -40,9 +40,10 @@ from \code{y}. The default is \code{FALSE}, so that only rows with data from bo \item{all.y}{logical; analogous to \code{all.x} above.} \item{sort}{logical. If \code{TRUE} (default), the rows of the merged \code{data.table} are sorted by setting the key to the \code{by / by.x} columns. If -\code{FALSE}, unlike base R's \code{merge} for which row order is unspecified, the row order -in `x` is retained, followed by `y` rows that don't match `x` (when \code{all.y=TRUE}) -retaining the order those appear in `y`.} +\code{FALSE}, unlike base R's \code{merge} for which row order is unspecified, the +row order in \code{x} is retained (including retaining the position of missings when +\code{all.x=TRUE}), followed by \code{y} rows that don't match \code{x} (when \code{all.y=TRUE}) +retaining the order those appear in \code{y}.} \item{suffixes}{A \code{character(2)} specifying the suffixes to be used for making non-\code{by} column names unique. The suffix behaviour works in a similar fashion as the \code{\link{merge.data.frame}} method does.} From 0b0977f25a951d4958d8ccc76ac1278f5bb8480a Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Mon, 2 Aug 2021 12:47:36 -0600 Subject: [PATCH 05/12] removed 'extra' and 'added' from all.x item to avoid being interpretted as 'added at the end' --- man/merge.Rd | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/man/merge.Rd b/man/merge.Rd index f7528b70e1..06142c1e96 100644 --- a/man/merge.Rd +++ b/man/merge.Rd @@ -32,11 +32,10 @@ If \code{y} has no key columns, this defaults to the key of \code{x}.} \item{by.x, by.y}{Vectors of column names in \code{x} and \code{y} to merge on.} \item{all}{logical; \code{all = TRUE} is shorthand to save setting both \code{all.x = TRUE} and \code{all.y = TRUE}.} -\item{all.x}{logical; if \code{TRUE}, then extra rows will be added to the -output, one for each row in \code{x} that has no matching row in \code{y}. -These rows will have 'NA's in those columns that are usually filled with values -from \code{y}. The default is \code{FALSE}, so that only rows with data from both -\code{x} and \code{y} are included in the output.} +\item{all.x}{logical; if \code{TRUE}, rows from \code{x} which have no matching row +in \code{y} are included. These rows will have 'NA's in those columns that are usually +filled with values from \code{y}. The default is \code{FALSE} so that only rows with +data from both \code{x} and \code{y} are included in the output.} \item{all.y}{logical; analogous to \code{all.x} above.} \item{sort}{logical. If \code{TRUE} (default), the rows of the merged \code{data.table} are sorted by setting the key to the \code{by / by.x} columns. If From 77d6c0f6518a6ad26d6e37b2e7256d7b15882882 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Mon, 2 Aug 2021 12:56:28 -0600 Subject: [PATCH 06/12] remove 'new feature' as old now --- man/merge.Rd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/man/merge.Rd b/man/merge.Rd index 06142c1e96..5937b28baf 100644 --- a/man/merge.Rd +++ b/man/merge.Rd @@ -118,14 +118,14 @@ merge(d4, d1) merge(d1, d4, all=TRUE) merge(d4, d1, all=TRUE) -# new feature, no need to set keys anymore +# setkey is automatic by default set.seed(1L) d1 <- data.table(a=sample(rep(1:3,each=2)), z=1:6) d2 <- data.table(a=2:0, z=10:12) merge(d1, d2, by="a") merge(d1, d2, by="a", all=TRUE) -# new feature, using by.x and by.y arguments +# using by.x and by.y setnames(d2, "a", "b") merge(d1, d2, by.x="a", by.y="b") merge(d1, d2, by.x="a", by.y="b", all=TRUE) From b8331a2428ba7bd25a4dae33fccdc234715f2a77 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Mon, 2 Aug 2021 13:11:04 -0600 Subject: [PATCH 07/12] link to setkey, add colon before itemized list at the top --- man/merge.Rd | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/man/merge.Rd b/man/merge.Rd index 5937b28baf..125a061eb6 100644 --- a/man/merge.Rd +++ b/man/merge.Rd @@ -4,7 +4,8 @@ \title{Merge two data.tables} \description{ Fast merge of two \code{data.table}s. The \code{data.table} method behaves -very similarly to that of \code{data.frame}s except that, by default, it attempts to merge +very similarly to that of \code{data.frame}s except that, by default, it chooses +the columns to merge on: \itemize{ \item at first based on the shared key columns, and if there are none, @@ -13,7 +14,7 @@ very similarly to that of \code{data.frame}s except that, by default, it attempt \item then based on the common columns between the two \code{data.table}s. } -Set the \code{by}, or \code{by.x} and \code{by.y} arguments explicitly to override this default. +Use the \code{by}, \code{by.x} and \code{by.y} arguments explicitly to override this default. } \usage{ @@ -33,7 +34,7 @@ If \code{y} has no key columns, this defaults to the key of \code{x}.} \item{all}{logical; \code{all = TRUE} is shorthand to save setting both \code{all.x = TRUE} and \code{all.y = TRUE}.} \item{all.x}{logical; if \code{TRUE}, rows from \code{x} which have no matching row -in \code{y} are included. These rows will have 'NA's in those columns that are usually +in \code{y} are included. These rows will have 'NA's in the columns that are usually filled with values from \code{y}. The default is \code{FALSE} so that only rows with data from both \code{x} and \code{y} are included in the output.} \item{all.y}{logical; analogous to \code{all.x} above.} @@ -77,7 +78,7 @@ set to \code{TRUE}. } \seealso{ -\code{\link{data.table}}, \code{\link{as.data.table}}, \code{\link{[.data.table}}, +\code{\link{data.table}}, \code{\link{setkey}}, \code{\link{[.data.table}}, \code{\link{merge.data.frame}} } From b72378274c854bf39d5ff3d4b986e6db9641f035 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Mon, 2 Aug 2021 13:20:17 -0600 Subject: [PATCH 08/12] remove paragraph about dup names in R 3.4.3: seems it was fixed in R 3.5.0. And the nabble link doesn't seem to be working either, so removing the paragraph solves that too. --- man/merge.Rd | 6 ------ 1 file changed, 6 deletions(-) diff --git a/man/merge.Rd b/man/merge.Rd index 125a061eb6..8f35fa1e32 100644 --- a/man/merge.Rd +++ b/man/merge.Rd @@ -63,12 +63,6 @@ matched against \code{NA} (and \code{NaN} against \code{NaN}) while merging. For a more \code{data.table}-centric way of merging two \code{data.table}s, see \code{\link{[.data.table}}; e.g., \code{x[y, \dots]}. See FAQ 1.11 for a detailed comparison of \code{merge} and \code{x[y, \dots]}. - -If any column names provided to \code{by.x} also occur in \code{names(y)} but not in \code{by.y}, -then this \code{data.table} method will add the \code{suffixes} to those column names. As of -R v3.4.3, the \code{data.frame} method will not (leading to duplicate column names in the result) but a patch has -been proposed (see r-devel thread \href{https://r.789695.n4.nabble.com/Duplicate-column-names-created-by-base-merge-when-by-x-has-the-same-name-as-a-column-in-y-td4748345.html}{here}) -which is looking likely to be accepted for a future version of R. } \value{ From b9d3b5e72802cfb17e591d44329d63c2c56ae4df Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Mon, 2 Aug 2021 13:29:43 -0600 Subject: [PATCH 09/12] 'row order is specified' added to the description since the 'except' only mentioned choosing merge columns --- man/merge.Rd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/man/merge.Rd b/man/merge.Rd index 8f35fa1e32..6fcbc10866 100644 --- a/man/merge.Rd +++ b/man/merge.Rd @@ -4,8 +4,8 @@ \title{Merge two data.tables} \description{ Fast merge of two \code{data.table}s. The \code{data.table} method behaves -very similarly to that of \code{data.frame}s except that, by default, it chooses -the columns to merge on: +similarly to \code{data.frame} except that row order is specified, and by +default the columns to merge on are chosen: \itemize{ \item at first based on the shared key columns, and if there are none, From 053be04090b61c5124a5fc967725213b26d7f1c0 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Mon, 2 Aug 2021 16:53:49 -0600 Subject: [PATCH 10/12] add why NAs first to ?setkey, #2594 --- man/setkey.Rd | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/man/setkey.Rd b/man/setkey.Rd index daf10c83ad..42213a66f5 100644 --- a/man/setkey.Rd +++ b/man/setkey.Rd @@ -9,16 +9,23 @@ \title{ Create key on a data.table } \description{ \code{setkey} sorts a \code{data.table} and marks it as sorted with an -attribute \code{sorted}. The sorted columns are the key. The key can be any -number of columns. The columns are always sorted in \emph{ascending} order. The table -is changed \emph{by reference} and \code{setkey} is very memory efficient. - -There are three reasons \code{setkey} is desirable: i) binary search and joins are faster -when they detect they can use an existing key, ii) grouping by a leading subset of the key -columns is faster because the groups are already gathered contiguously in RAM, iii) -simpler shorter syntax; e.g. \code{DT["id",]} finds the group "id" in the first column -of DT's key using binary search. It may be helpful to think of a key as -super-charged rownames: multi-column and multi-type rownames. +attribute \code{"sorted"}. The sorted columns are the key. The key can be any +number of columns. The data is always sorted in \emph{ascending} order with \code{NA}s +(if any) always first. The table is changed \emph{by reference} and there is +no memory used for the key (other than marking which columns the data is sorted by). + +There are three reasons \code{setkey} is desirable: +\itemize{ + \item binary search and joins are faster when they detect they can use an existing key + \item grouping by a leading subset of the key columns is faster because the groups are already gathered contiguously in RAM + \item simpler shorter syntax; e.g. \code{DT["id",]} finds the group "id" in the first column of \code{DT}'s key using binary search. It may be helpful to think of a key as super-charged rownames: multi-column and multi-type. +} + +\code{NA}s are always first because: +\itemize{ + \item \code{NA} is internally \code{INT_MIN} (a large negative number) in R. Keys and indexes are always in increasing order so if \code{NA}s are first, no special treatment or branch is needed in many \code{data.table} internals involving binary search. It is not optional to place \code{NA}s last for speed, simplicity and rubustness of internals at C level. + \item if any \code{NA}s are present then we believe it is better to display them up front rather than hiding them at the end to reduce the risk of not realizing \code{NA}s are present. +} In \code{data.table} parlance, all \code{set*} functions change their input \emph{by reference}. That is, no copy is made at all other than for temporary From e11ec510e7f673103c95d4413610eb85721e597f Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Mon, 2 Aug 2021 16:54:47 -0600 Subject: [PATCH 11/12] added news note --- NEWS.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/NEWS.md b/NEWS.md index 1ce72afd52..eb426a6d66 100644 --- a/NEWS.md +++ b/NEWS.md @@ -185,6 +185,8 @@ 11. `melt()`'s internal C code is now more memory efficient, [#5054](https://github.com/Rdatatable/data.table/pull/5054). Thanks to Toby Dylan Hocking for the PR. +12. `?merge` and `?setkey` have been updated to clarify that the row order is retained when `sort=FALSE`, and why `NA`s are always first when `sort=TRUE`, [#2574](https://github.com/Rdatatable/data.table/issues/2574) [#2594](https://github.com/Rdatatable/data.table/issues/2594). Thanks to Davor Josipovic and Markus Bonsch for the reports, and Jan Gorecki for the PR. + # data.table [v1.14.0](https://github.com/Rdatatable/data.table/milestone/23?closed=1) (21 Feb 2021) From c5e68be7070cb8def5fe6e16dbe03bc5acecad29 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Mon, 2 Aug 2021 16:58:03 -0600 Subject: [PATCH 12/12] add parens to leave no doubt what the 'to reduce' refers to --- man/setkey.Rd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/setkey.Rd b/man/setkey.Rd index 42213a66f5..ca386c58d7 100644 --- a/man/setkey.Rd +++ b/man/setkey.Rd @@ -24,7 +24,7 @@ There are three reasons \code{setkey} is desirable: \code{NA}s are always first because: \itemize{ \item \code{NA} is internally \code{INT_MIN} (a large negative number) in R. Keys and indexes are always in increasing order so if \code{NA}s are first, no special treatment or branch is needed in many \code{data.table} internals involving binary search. It is not optional to place \code{NA}s last for speed, simplicity and rubustness of internals at C level. - \item if any \code{NA}s are present then we believe it is better to display them up front rather than hiding them at the end to reduce the risk of not realizing \code{NA}s are present. + \item if any \code{NA}s are present then we believe it is better to display them up front (rather than hiding them at the end) to reduce the risk of not realizing \code{NA}s are present. } In \code{data.table} parlance, all \code{set*} functions change their input