diff --git a/NAMESPACE b/NAMESPACE index 50333f201a..193b4dedae 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -81,9 +81,14 @@ S3method(as.data.table, default) S3method(as.data.frame, data.table) S3method(as.list, data.table) S3method(as.matrix, data.table) -#S3method(cbind, data.table) -#S3method(rbind, data.table) -export(.rbind.data.table) +if (getRversion() >= "4.0.0") { + # this version number must be the same as in .onLoad + # fix in R in Sep 2019 (#3948) makes c|rbind S3 dispatch work; see FAQ 2.24. + # if we register these methods always though, the previous workaround no longer works in R<4.0.0. Hence only register in R>=4.0.0. + S3method(cbind, data.table) + S3method(rbind, data.table) +} +export(.rbind.data.table) # continue to export for now because it has been exported in the past so it may be depended on S3method(dim, data.table) S3method(dimnames, data.table) S3method("dimnames<-", data.table) diff --git a/NEWS.md b/NEWS.md index a92b241056..507a35d833 100644 --- a/NEWS.md +++ b/NEWS.md @@ -4,8 +4,6 @@ # data.table [v1.12.5](https://github.com/Rdatatable/data.table/milestone/18) (in development) -## NEW FEATURES - ## BUG FIXES 1. `shift()` on a `nanotime` with the default `fill=NA` now fills a `nanotime` missing value correctly, [#3945](https://github.com/Rdatatable/data.table/issues/3945). Thanks to @mschubmehl for reporting and fixing in PR [#3942](https://github.com/Rdatatable/data.table/pull/3942). @@ -14,10 +12,12 @@ 3. A runtime error in `fwrite`'s compression, but only observed so far on Solaris 10 32bit with zlib 1.2.8 (Apr 2013), [#3931](https://github.com/Rdatatable/data.table/issues/3931): `Error -2: one or more threads failed to allocate buffers or there was a compression error.` In case it happens again, this area has been made more robust and the error more detailed. As is often the case, investigating the Solaris problem revealed secondary issues in the same area of the code. In this case, some `%d` in verbose output should have been `%lld`. This obliquity that CRAN's Solaris provides is greatly appreciated. -4. A leak could occur in the event of an unsupported column type error, or if working memory cannot all be allocated; [#3940](https://github.com/Rdatatable/data.table/issues/3940). Found thanks to `clang`'s Leak Sanitizer (prompted by CRAN's diligent use of latest tools), and two tests in the test suite which tested the unsupported type error. +4. A leak could occur in the event of an unsupported column type error, or if working memory could only partially be allocated; [#3940](https://github.com/Rdatatable/data.table/issues/3940). Found thanks to `clang`'s Leak Sanitizer (prompted by CRAN's diligent use of latest tools), and two tests in the test suite which tested the unsupported-type error. ## NOTES +1. Many thanks to Kurt Hornik for fixing R's S3 dispatch of `rbind` and `cbind` methods, [#3948](https://github.com/Rdatatable/data.table/issues/3948). With `R>=4.0.0` (current R-devel), `data.table` now registers the S3 methods `cbind.data.table` and `rbind.data.table`, and no longer applies the workaround documented in FAQ 2.24. + # data.table [v1.12.4](https://github.com/Rdatatable/data.table/milestone/16?closed=1) (03 Oct 2019) diff --git a/R/data.table.R b/R/data.table.R index 15cc632f4a..20fcf04e03 100644 --- a/R/data.table.R +++ b/R/data.table.R @@ -2531,15 +2531,15 @@ chgroup = function(x) { if (length(o)) as.vector(o) else seq_along(x) # as.vector removes the attributes } -.rbind.data.table = function(..., use.names=TRUE, fill=FALSE, idcol=NULL) { - # See FAQ 2.23 - # Called from base::rbind.data.frame - # fix for #1626.. because some packages (like psych) bind an input - # data.frame/data.table with a matrix.. - l = lapply(list(...), function(x) if (is.list(x)) x else as.data.table(x)) +cbind.data.table = data.table + +rbind.data.table = function(..., use.names=TRUE, fill=FALSE, idcol=NULL) { + l = lapply(list(...), function(x) if (is.list(x)) x else as.data.table(x)) #1626; e.g. psych binds a data.frame|table with a matrix rbindlist(l, use.names, fill, idcol) } +.rbind.data.table = rbind.data.table # to support R<4.0.0 and has been exported for a long time; see .onLoad and #3948 + rbindlist = function(l, use.names="check", fill=FALSE, idcol=NULL) { if (is.null(l)) return(null.data.table()) if (!is.list(l) || is.data.frame(l)) stop("Input is ", class(l)[1L]," but should be a plain list of items to be stacked") diff --git a/R/onLoad.R b/R/onLoad.R index 255e6bfb7a..ccbd307926 100644 --- a/R/onLoad.R +++ b/R/onLoad.R @@ -25,31 +25,39 @@ stop("The datatable.",dll," version (",dllV,") does not match the package (",RV,"). Please close all R sessions to release the old ",toupper(dll)," and reinstall data.table in a fresh R session. The root cause is that R's package installer can in some unconfirmed circumstances leave a package in a state that is apparently functional but where new R code is calling old C code silently: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17478. Once a package is in this mismatch state it may produce wrong results silently until you next upgrade the package. Please help by adding precise circumstances to 17478 to move the status to confirmed. This mismatch between R and C code can happen with any package not just data.table. It is just that data.table has added this check.") } - "Please read FAQ 2.23 (vignette('datatable-faq')) which explains in detail why data.table adds one for loop to the start of base::cbind.data.frame and base::rbind.data.frame. If there is a better solution we will gladly change it." - # Commented as a character string so this message is retained and seen by anyone who types data.table:::.onLoad - tt = base::cbind.data.frame - ss = body(tt) - if (class(ss)[1L]!="{") ss = as.call(c(as.name("{"), ss)) - prefix = if (!missing(pkgname)) "data.table::" else "" # R provides the arguments when it calls .onLoad, I don't in dev/test - if (!length(grep("data.table", ss[[2L]], fixed = TRUE))) { - ss = ss[c(1L, NA, 2L:length(ss))] - ss[[2L]] = parse(text=paste0("if (!identical(class(..1),'data.frame')) for (x in list(...)) { if (inherits(x,'data.table')) return(",prefix,"data.table(...)) }"))[[1L]] - body(tt)=ss - (unlockBinding)("cbind.data.frame",baseenv()) - assign("cbind.data.frame",tt,envir=asNamespace("base"),inherits=FALSE) - lockBinding("cbind.data.frame",baseenv()) - } - tt = base::rbind.data.frame - ss = body(tt) - if (class(ss)[1L]!="{") ss = as.call(c(as.name("{"), ss)) - if (!length(grep("data.table", ss[[2L]], fixed = TRUE))) { - ss = ss[c(1L, NA, 2L:length(ss))] - ss[[2L]] = parse(text=paste0("for (x in list(...)) { if (inherits(x,'data.table')) return(",prefix,".rbind.data.table(...)) }"))[[1L]] # fix for #4995 - body(tt)=ss - (unlockBinding)("rbind.data.frame",baseenv()) - assign("rbind.data.frame",tt,envir=asNamespace("base"),inherits=FALSE) - lockBinding("rbind.data.frame",baseenv()) + # c|rbind S3 dispatch now works in R-devel from Sep 2019; #3948 and FAQ 2.24, and R-devel is 4.0.0 as of now. I wanted to test + # for the presence of the R fix here rather than hard code a version number. But in this case the S3method registration needs to + # be conditional too: registering the S3 methods in R before 4.0.0 causes this workaround to no longer work. However, the R + # syntax available to use in NAMESPACE is very limited (can't call data.table() in it in a capability test, for example). + # This version number ("4.0.0") must be precisely the same as used in NAMESPACE; see PR for #3948. + if (base::getRversion() < "4.0.0") { + # continue to support R<4.0.0 + # If R 3.6.2 (not yet released) includes the c|rbind S3 dispatch fix, then this workaround still works. + tt = base::cbind.data.frame + ss = body(tt) + if (class(ss)[1L]!="{") ss = as.call(c(as.name("{"), ss)) + prefix = if (!missing(pkgname)) "data.table::" else "" # R provides the arguments when it calls .onLoad, I don't in dev/test + if (!length(grep("data.table", ss[[2L]], fixed = TRUE))) { + ss = ss[c(1L, NA, 2L:length(ss))] + ss[[2L]] = parse(text=paste0("if (!identical(class(..1),'data.frame')) for (x in list(...)) { if (inherits(x,'data.table')) return(",prefix,"data.table(...)) }"))[[1L]] + body(tt)=ss + (unlockBinding)("cbind.data.frame",baseenv()) + assign("cbind.data.frame",tt,envir=asNamespace("base"),inherits=FALSE) + lockBinding("cbind.data.frame",baseenv()) + } + tt = base::rbind.data.frame + ss = body(tt) + if (class(ss)[1L]!="{") ss = as.call(c(as.name("{"), ss)) + if (!length(grep("data.table", ss[[2L]], fixed = TRUE))) { + ss = ss[c(1L, NA, 2L:length(ss))] + ss[[2L]] = parse(text=paste0("for (x in list(...)) { if (inherits(x,'data.table')) return(",prefix,".rbind.data.table(...)) }"))[[1L]] # fix for #4995 + body(tt)=ss + (unlockBinding)("rbind.data.frame",baseenv()) + assign("rbind.data.frame",tt,envir=asNamespace("base"),inherits=FALSE) + lockBinding("rbind.data.frame",baseenv()) + } } + # Set options for the speed boost in v1.8.0 by avoiding 'default' arg of getOption(,default=) # In fread and fwrite we have moved back to using getOption's default argument since it is unlikely fread and fread will be called in a loop many times, plus they # are relatively heavy functions where the overhead in getOption() would not be noticed. It's only really [.data.table where getOption default bit. diff --git a/vignettes/datatable-faq.Rmd b/vignettes/datatable-faq.Rmd index 38f4e618b2..84254f7790 100644 --- a/vignettes/datatable-faq.Rmd +++ b/vignettes/datatable-faq.Rmd @@ -423,28 +423,15 @@ This is an unfortunate downside to get [#869](https://github.com/Rdatatable/data ## I've noticed that `base::cbind.data.frame` (and `base::rbind.data.frame`) appear to be changed by data.table. How is this possible? Why? -It is a temporary, last resort solution until we discover a better way to solve the problems listed below. Essentially, the issue is that data.table inherits from `data.frame`, _and_ `base::cbind` and `base::rbind` (uniquely) do their own S3 dispatch internally as documented by `?cbind`. The change is adding one `for` loop to the start of each function directly in `base`; _e.g._, +It was a temporary, last resort solution before rbind and cbind S3 method dispatch was fixed in R >= 4.0.0. Essentially, the issue was that `data.table` inherits from `data.frame`, _and_ `base::cbind` and `base::rbind` (uniquely) do their own S3 dispatch internally as documented by `?cbind`. The `data.table` workaround was adding one `for` loop to the start of each function directly in `base`. That modification was made dynamically, _i.e._, the `base` definition of `cbind.data.frame` was fetched, the `for` loop added to the beginning, and then assigned back to `base`. This solution was designed to be robust to different definitions of `base::cbind.data.frame` in different versions of R, including unknown future changes. It worked well. The competing requirements were: -```{r} -base::cbind.data.frame -``` - -That modification is made dynamically, _i.e._, the `base` definition of `cbind.data.frame` is fetched, the `for` loop added to the beginning and then assigned back to `base`. This solution is intended to be robust to different definitions of `base::cbind.data.frame` in different versions of R, including unknown future changes. Again, it is a last resort until a better solution is known or made available. The competing requirements are: - - - `cbind(DT, DF)` needs to work. Defining `cbind.data.table` doesn't work because `base::cbind` does its own S3 dispatch and requires that the _first_ `cbind` method for each object it is passed is _identical_. This is not true in `cbind(DT, DF)` because the first method for `DT` is `cbind.data.table` but the first method for `DF` is `cbind.data.frame`. `base::cbind` then falls through to its internal `bind` code which appears to treat `DT` as a regular `list` and returns very odd looking and unusable `matrix` output. See [below](#cbinderror). We cannot just advise users not to call `cbind(DT, DF)` because packages such as `ggplot2` make such a call ([test 167.2](https://github.com/Rdatatable/data.table/blob/master/inst/tests/tests.Rraw#L444-L447)). + - `cbind(DT, DF)` needs to work. Defining `cbind.data.table` didn't work because `base::cbind` does its own S3 dispatch and required (before R 4.0.0) that the _first_ `cbind` method for each object it is passed is _identical_. This is not true in `cbind(DT, DF)` because the first method for `DT` is `cbind.data.table` but the first method for `DF` is `cbind.data.frame`. `base::cbind` then fell through to its internal `bind` code which appears to treat `DT` as a regular `list` and returns very odd looking and unusable `matrix` output. See [below](#cbinderror). We cannot just advise users not to call `cbind(DT, DF)` because packages such as `ggplot2` make such a call ([test 167.2](https://github.com/Rdatatable/data.table/blob/master/inst/tests/tests.Rraw#L444-L447)). - - This naturally leads to trying to mask `cbind.data.frame` instead. Since a data.table is a `data.frame`, `cbind` would find the same method for both `DT` and `DF`. However, this doesn't work either because `base::cbind` appears to find methods in `base` first; _i.e._, `base::cbind.data.frame` isn't maskable. This is reproducible as follows : - -```{r} -foo = data.frame(a = 1:3) -cbind.data.frame = function(...) cat("Not printed\n") -cbind(foo) -rm("cbind.data.frame") -``` + - This naturally led to trying to mask `cbind.data.frame` instead. Since a data.table is a `data.frame`, `cbind` would find the same method for both `DT` and `DF`. However, this didn't work either because `base::cbind` appears to find methods in `base` first; _i.e._, `base::cbind.data.frame` isn't maskable. - Finally, we tried masking `cbind` itself (v1.6.5 and v1.6.6). This allowed `cbind(DT, DF)` to work, but introduced compatibility issues with package `IRanges`, since `IRanges` also masks `cbind`. It worked if `IRanges` was lower on the `search()` path than data.table, but if `IRanges` was higher then data.table's, `cbind` would never be called and the strange-looking `matrix` output occurs again (see [below](#cbinderror)). -If you know of a better solution that still solves all the issues above, then please let us know and we'll gladly change it. +Many thanks to the R core team for fixing the issue in Sep 2019. data.table v1.12.6+ no longer applies the workaround in R >= 4.0.0. ## I've read about method dispatch (_e.g._ `merge` may or may not dispatch to `merge.data.table`) but _how_ does R know how to dispatch? Are dots significant or special? How on earth does R know which function to dispatch and when? {#r-dispatch}