Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 21 additions & 19 deletions man/dcast.data.table.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,9 @@
\alias{dcast}
\title{Fast dcast for data.table}
\description{
\code{dcast.data.table} is a much faster version of \code{reshape2::dcast}, but for \code{data.table}s. More importantly, it is capable of handling very large data quite efficiently in terms of memory usage in comparison to \code{reshape2::dcast}.

From 1.9.6, \code{dcast} is implemented as an S3 generic in \code{data.table}. To melt or cast \code{data.table}s, it is not necessary to load \code{reshape2} any more. If you have load \code{reshape2}, do so before loading \code{data.table} to prevent unwanted masking.

\bold{NEW}: \code{dcast.data.table} can now cast multiple \code{value.var} columns and also accepts multiple functions to \code{fun.aggregate}. See Examples for more.
\code{dcast.data.table} is \code{data.table}'s long-to-wide reshaping tool. It is designed as an enhancement to \code{reshape2::dcast}; in the spirit of \code{data.table}, it is very fast and memory efficient, making it well-suited to handling large data sets in RAM. More importantly, it is capable of handling very large data quite efficiently in terms of memory usage vis-a-vis \code{reshape2::dcast}. \code{dcast.data.table} can also cast multiple \code{value.var} columns and accepts multiple functions to \code{fun.aggregate}. See Examples for more.
}

% \method{dcast}{data.table}
\usage{
\method{dcast}{data.table}(data, formula, fun.aggregate = NULL, sep = "_",
\dots, margins = NULL, subset = NULL, fill = NULL,
Expand All @@ -22,48 +17,52 @@
\item{formula}{A formula of the form LHS ~ RHS to cast, see Details.}
\item{fun.aggregate}{Should the data be aggregated before casting? If the formula doesn't identify a single observation for each cell, then aggregation defaults to \code{length} with a message.

\bold{NEW}: it is possible to provide a list of functions to \code{fun.aggregate}. See Examples. }
To use multiple aggregation functions, pass a \code{list}; see Examples. }
\item{sep}{Character vector of length 1, indicating the separating character in variable names generated during casting. Default is \code{_} for backwards compatibility. }
\item{\dots}{Any other arguments that may be passed to the aggregating function.}
\item{margins}{Not implemented yet. Should take variable names to compute margins on. A value of \code{TRUE} would compute all margins.}
\item{subset}{Specified if casting should be done on a subset of the data. Ex: \code{subset = .(col1 <= 5)} or \code{subset = .(variable != "January")}.}
\item{fill}{Value with which to fill missing cells. If \code{fun.aggregate} is present, takes the value by applying the function on a 0-length vector.}
\item{drop}{\code{FALSE} will cast by including all missing combinations.

\bold{NEW:} Following \href{https://github.com/Rdatatable/data.table/issues/1512}{#1512}, \code{c(FALSE, TRUE)} will only include all missing combinations of formula \code{LHS}. And \code{c(TRUE, FALSE)} will only include all missing combinations of formula RHS. See Examples.}
\code{c(FALSE, TRUE)} will only include all missing combinations of formula \code{LHS}; \code{c(TRUE, FALSE)} will only include all missing combinations of formula RHS. See Examples.}

\item{value.var}{Name of the column whose values will be filled to cast. Function `guess()` tries to, well, guess this column automatically, if none is provided.
\item{value.var}{Name of the column whose values will be filled to cast. Function \code{guess()} tries to, well, guess this column automatically, if none is provided.

\bold{NEW}: it is now possible to cast multiple \code{value.var} columns simultaneously. See Examples. }
Cast multiple \code{value.var} columns simultaneously by passing their names as a \code{character} vector. See Examples. }
\item{verbose}{Not used yet. May be dropped in the future or used to provide informative messages through the console.}
}
\details{
The cast formula takes the form \code{LHS ~ RHS}, ex: \code{var1 + var2 ~ var3}. The order of entries in the formula is essential. There are two special variables: \code{.} and \code{\dots}. \code{.} represents no variable; \code{\dots} represents all variables not otherwise mentioned in \code{formula}; see Examples.
The cast formula takes the form \code{LHS ~ RHS}, ex: \code{var1 + var2 ~ var3}. The order of entries in the formula is essential. There are two special variables: \code{.} represents no variable, while \code{\dots} represents all variables not otherwise mentioned in \code{formula}; see Examples.

When not all combinations of LHS & RHS values are present in the data, some or all (in accordance with \code{drop}) missing combinations will replaced with the value specified by \code{fill}. Note that \code{fill} will be converted to the class of \code{value.var}; see Examples.

\code{dcast} also allows \code{value.var} columns of type \code{list}.

When variable combinations in \code{formula} doesn't identify a unique value in a cell, \code{fun.aggregate} will have to be specified, which defaults to \code{length} if unspecified. The aggregating function should take a vector as input and return a single value (or a list of length one) as output. In cases where \code{value.var} is a list, the function should be able to handle a list input and provide a single value or list of length one as output.
When variable combinations in \code{formula} don't identify a unique value, \code{fun.aggregate} will have to be specified, which defaults to \code{length}. For the formula \code{var1 ~ var2}, this means there are some \code{(var1, var2)} combinations in the data corresponding to multiple rows (i.e. \code{x} is not unique by \code{(var1, var2)}.

The aggregating function should take a vector as input and return a single value (or a list of length one) as output. In cases where \code{value.var} is a list, the function should be able to handle a list input and provide a single value or list of length one as output.

If the formula's LHS contains the same column more than once, ex: \code{dcast(DT, x+x~ y)}, then the answer will have duplicate names. In those cases, the duplicate names are renamed using \code{make.unique} so that key can be set without issues.

Names for columns that are being cast are generated in the same order (separated by an underscore, \code{_}) from the (unique) values in each column mentioned in the formula RHS.

From \code{v1.9.4}, \code{dcast} tries to preserve attributes wherever possible.

\bold{NEW}: From \code{v1.9.6}, it is possible to cast multiple \code{value.var} columns and also cast by providing multiple \code{fun.aggregate} functions. Multiple \code{fun.aggregate} functions should be provided as a \code{list}, for e.g., \code{list(mean, sum, function(x) paste(x, collapse="")}. \code{value.var} can be either a character vector or list of length=1, or a list of length equal to \code{length(fun.aggregate)}. When \code{value.var} is a character vector or a list of length 1, each function mentioned under \code{fun.aggregate} is applied to every column specified under \code{value.var} column. When \code{value.var} is a list of length equal to \code{length(fun.aggregate)} each element of \code{fun.aggregate} is applied to each element of \code{value.var} column.
From \code{v1.9.6}, it is possible to cast multiple \code{value.var} columns and also cast by providing multiple \code{fun.aggregate} functions. Multiple \code{fun.aggregate} functions should be provided as a \code{list}, for e.g., \code{list(mean, sum, function(x) paste(x, collapse="")}. \code{value.var} can be either a character vector or list of length one, or a list of length equal to \code{length(fun.aggregate)}. When \code{value.var} is a character vector or a list of length one, each function mentioned under \code{fun.aggregate} is applied to every column specified under \code{value.var} column. When \code{value.var} is a list of length equal to \code{length(fun.aggregate)} each element of \code{fun.aggregate} is applied to each element of \code{value.var} column.

}
\value{
A keyed \code{data.table} that has been cast. The key columns are equal to the variables in the \code{formula} LHS in the same order.
}

\examples{
require(data.table)
names(ChickWeight) <- tolower(names(ChickWeight))
ChickWeight = as.data.table(ChickWeight)
setnames(ChickWeight, tolower(names(ChickWeight)))
DT <- melt(as.data.table(ChickWeight), id=2:4) # calls melt.data.table

# dcast is a S3 method in data.table from v1.9.6
dcast(DT, time ~ variable, fun=mean)
# dcast is an S3 method in data.table from v1.9.6
dcast(DT, time ~ variable, fun=mean) # using partial matching of argument
dcast(DT, diet ~ variable, fun=mean)
dcast(DT, diet+chick ~ time, drop=FALSE)
dcast(DT, diet+chick ~ time, drop=FALSE, fill=0)
Expand Down Expand Up @@ -93,6 +92,10 @@ dcast(DT, v1 + v2 + v3 ~ ., value.var = "v4")
## for each combination of (v1, v2), add up all values of v4
dcast(DT, v1 + v2 ~ ., value.var = "v4", fun.aggregate = sum)

# fill and types
dcast(DT, v2 ~ v3, value.var = 'v1', fill = 0L) # 0L --> 0
dcast(DT, v2 ~ v3, value.var = 'v4', fill = 1.1) # 1.1 --> 1L

\dontrun{
# benchmark against reshape2's dcast, minimum of 3 runs
set.seed(45)
Expand All @@ -105,7 +108,7 @@ system.time(dcast(DT, bb ~ cc, fun=mean)) # 0.04 seconds
system.time(dcast(DT, aa + bb ~ cc, fun=sum)) # 1.2 seconds
}

# NEW FEATURE - multiple value.var and multiple fun.aggregate
# multiple value.var and multiple fun.aggregate
DT = data.table(x=sample(5,20,TRUE), y=sample(2,20,TRUE),
z=sample(letters[1:2], 20,TRUE), d1 = runif(20), d2=1L)
# multiple value.var
Expand All @@ -121,4 +124,3 @@ dcast(DT, x + y ~ z, fun=list(sum, mean), value.var=list("d1", "d2"))
\code{\link{melt.data.table}}, \code{\link{rowid}}, \url{https://cran.r-project.org/package=reshape}
}
\keyword{data}