I was recently profiling a data.table heavy R script that involved lots of joins and linear scans on tables with number of rows from 35000 to 65000. I was surprised to see dim.data.table as top most contributor to the spent time. The reason was quickly identified, the problem in following implementation of dim.data.table is hidden in x[[1L]]:
dim.data.table <- function(x) {
if (length(x)) c(length(x[[1L]]), length(x))
else c(0L,0L)
# TO DO: consider placing "dim" as an attibute updated on inserts. Saves this 'if'.
}
x[[1L]] is behind the scene is calling [[.data.frame, which rather awful in terms of performance:
> `[[.data.frame`
function (x, ..., exact = TRUE)
{
na <- nargs() - (!missing(exact))
if (!all(names(sys.call()) %in% c("", "exact")))
warning("named arguments other than 'exact' are discouraged")
if (na < 3L)
(function(x, i, exact) if (is.matrix(i))
as.matrix(x)[[i]]
else .subset2(x, i, exact = exact))(x, ..., exact = exact)
else {
col <- .subset2(x, ..2, exact = exact)
i <- if (is.character(..1))
pmatch(..1, row.names(x), duplicates.ok = TRUE)
else ..1
col[[i, exact = exact]]
}
}
The second top most contributor was alloc.col that suffered from the same problem:
alloc.col <- function(DT, n=getOption("datatable.alloccol"), verbose=getOption("datatable.verbose"))
{
...
for (i in seq_along(ans)) {
# clear the same excluded by copyMostAttrib(). Primarily for data.table and as.data.table, but added here centrally (see #4890).
setattr(ans[[i]],"names",NULL)
setattr(ans[[i]],"dim",NULL)
setattr(ans[[i]],"dimnames",NULL)
}
...
}
Three [[.data.frame calls per column.
I was recently profiling a data.table heavy R script that involved lots of joins and linear scans on tables with number of rows from 35000 to 65000. I was surprised to see dim.data.table as top most contributor to the spent time. The reason was quickly identified, the problem in following implementation of
dim.data.tableis hidden inx[[1L]]:x[[1L]]is behind the scene is calling[[.data.frame, which rather awful in terms of performance:The second top most contributor was
alloc.colthat suffered from the same problem:Three
[[.data.framecalls per column.