diff --git a/NAMESPACE b/NAMESPACE index 00cb51a4cb..81c0fce689 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -57,6 +57,7 @@ export(setnafill) export(.Last.updated) export(fcoalesce) export(substitute2) +export(DT) # mtcars |> DT(i,j,by) #4872 S3method("[", data.table) S3method("[<-", data.table) diff --git a/NEWS.md b/NEWS.md index 48192c232b..769b98dafa 100644 --- a/NEWS.md +++ b/NEWS.md @@ -109,6 +109,12 @@ 21. `melt()` was pseudo generic in that `melt(DT)` would dispatch to the `melt.data.table` method but `melt(not-DT)` would explicitly redirect to `reshape2`. Now `melt()` is standard generic so that methods can be developed in other packages, [#4864](https://github.com/Rdatatable/data.table/pull/4864). Thanks to @odelmarcelle for suggesting and implementing. +22. `DT(i, j, by, ...)` has been added, i.e. functional form of a `data.table` query, [#641](https://github.com/Rdatatable/data.table/issues/641) [#4872](https://github.com/Rdatatable/data.table/issues/4872). Thanks to Yike Lu and Elio Campitelli for filing requests, many others for comments and suggestions, and Matt Dowle for the PR. This enables the `data.table` general form query to be invoked on a `data.frame` without converting it to a `data.table` first. The class of the input object is retained. + + ```R + mtcars |> DT(mpg>20, .(mean_hp=mean(hp)), by=cyl) + ``` + ## BUG FIXES 1. `by=.EACHI` when `i` is keyed but `on=` different columns than `i`'s key could create an invalidly keyed result, [#4603](https://github.com/Rdatatable/data.table/issues/4603) [#4911](https://github.com/Rdatatable/data.table/issues/4911). Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where a `data.table` is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries. diff --git a/R/data.table.R b/R/data.table.R index c06c60e817..504eb49cc7 100644 --- a/R/data.table.R +++ b/R/data.table.R @@ -846,10 +846,10 @@ replace_dot_alias = function(e) { if (!is.na(nomatch)) irows = irows[irows!=0L] # TO DO: can be removed now we have CisSortedSubset if (length(allbyvars)) { ############### TO DO TO DO TO DO ############### if (verbose) catf("i clause present and columns used in by detected, only these subset: %s\n", brackify(allbyvars)) - xss = x[irows,allbyvars,with=FALSE,nomatch=nomatch,mult=mult,roll=roll,rollends=rollends] + xss = `[.data.table`(x,irows,allbyvars,with=FALSE,nomatch=nomatch,mult=mult,roll=roll,rollends=rollends) } else { if (verbose) catf("i clause present but columns used in by not detected. Having to subset all columns before evaluating 'by': '%s'\n", deparse(by)) - xss = x[irows,nomatch=nomatch,mult=mult,roll=roll,rollends=rollends] + xss = `[.data.table`(x,irows,nomatch=nomatch,mult=mult,roll=roll,rollends=rollends) } if (bysub %iscall% ':' && length(bysub)==3L) { byval = eval(bysub, setattr(as.list(seq_along(xss)), 'names', names(xss)), parent.frame()) @@ -1910,6 +1910,8 @@ replace_dot_alias = function(e) { setalloccol(ans) # TODO: overallocate in dogroups in the first place and remove this line } +DT = `[.data.table` #4872 + .optmean = function(expr) { # called by optimization of j inside [.data.table only. Outside for a small speed advantage. if (length(expr)==2L) # no parameters passed to mean, so defaults of trim=0 and na.rm=FALSE return(call(".External",quote(Cfastmean),expr[[2L]], FALSE)) diff --git a/man/data.table.Rd b/man/data.table.Rd index 7418d12118..d1ab11a924 100644 --- a/man/data.table.Rd +++ b/man/data.table.Rd @@ -5,6 +5,7 @@ \alias{Ops.data.table} \alias{is.na.data.table} \alias{[.data.table} +\alias{DT} \alias{.} \alias{.(} \alias{.()} @@ -217,6 +218,8 @@ The way to read this out loud is: "Take \code{DT}, subset rows by \code{i}, \emp # see ?assign to add/update/delete columns by reference using the same consistent interface } +A \code{data.table} query may be invoked on a \code{data.frame} using functional form \code{DT(...)}, see examples. The class of the input is retained. + A \code{data.table} is a \code{list} of vectors, just like a \code{data.frame}. However : \enumerate{ \item it never has or uses rownames. Rownames based indexing can be done by setting a \emph{key} of one or more columns or done \emph{ad-hoc} using the \code{on} argument (now preferred). @@ -431,6 +434,11 @@ dev.off() # using rleid, get max(y) and min of all cols in .SDcols for each consecutive run of 'v' DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b] +# functional query DT(...) +if (getRversion() >= "4.1.0") { # native pipe |> new in R 4.1.0 + mtcars |> DT(mpg>20, .(mean_hp=mean(hp)), by=cyl) +} + # Support guide and links: # https://github.com/Rdatatable/data.table/wiki/Support