diff --git a/NEWS.md b/NEWS.md index 57d45c2a81..40ffa071c4 100644 --- a/NEWS.md +++ b/NEWS.md @@ -6,6 +6,8 @@ 1. `:=` no longer recycles length>1 RHS vectors. There was a warning when recycling left a remainder but no warning when the LHS length was an exact multiple of the RHS length (the same behaviour as base R). Consistent feedback for several years has been that recycling is more often a bug. In rare cases where you need to recycle a length>1 vector, use `rep()` explicitly. Single values are still recycled silently as before. Early warning was given in [this tweet](https://twitter.com/MattDowle/status/1088544083499311104). The 758 CRAN and Bioconductor packages using data.table were tested and the maintainers of the 16 packages affected (2%) were consulted before going ahead, [#3310](https://github.com/Rdatatable/data.table/pull/3310). +2. `as.IDate.default` and `as.ITime.default` gain `use_lookup` argument, default `'auto'`, [#2603](https://github.com/Rdatatable/data.table/issues/2603). In data sets with millions of observations, there are almost universally a much smaller number of unique dates in any given column. In this case, it is much more efficient to parse only the unique set of dates and use this as a lookup table for the full vector; `use_lookup` implements this logic. By default this is done for any conversion of at least 1,000 observations. + #### BUG FIXES 1. `rbindlist()` of a malformed factor missing levels attribute is now a helpful error rather than a cryptic error about `STRING_ELT`, [#3315](https://github.com/Rdatatable/data.table/issues/3315). Thanks to Michael Chirico for reporting. diff --git a/R/IDateTime.R b/R/IDateTime.R index 20fece4bdf..2f09491060 100644 --- a/R/IDateTime.R +++ b/R/IDateTime.R @@ -5,9 +5,17 @@ as.IDate <- function(x, ...) UseMethod("as.IDate") -as.IDate.default <- function(x, ..., tz = attr(x, "tzone")) { +as.IDate.default <- function(x, ..., tz = attr(x, "tzone"), use_lookup = 'auto') { if (is.null(tz)) tz = "UTC" - as.IDate(as.Date(x, tz = tz, ...)) + if (isTRUE(use_lookup) || (use_lookup == 'auto' && length(x) >= 1000L)) { + DT = list(input = x) + setDT(DT) + + lookup = unique(DT) + lookup[ , 'IDate' := as.IDate(as.Date(input, tz = tz, ...))] + lookup[DT, on = 'input']$IDate + } else { + as.IDate(as.Date(x, tz = tz, ...)) } as.IDate.numeric <- function(x, origin = "1970-01-01", ...) { @@ -111,8 +119,17 @@ round.IDate <- function (x, digits=c("weeks", "months", "quarters", "years"), .. as.ITime <- function(x, ...) UseMethod("as.ITime") -as.ITime.default <- function(x, ...) { - as.ITime(as.POSIXlt(x, ...), ...) +as.ITime.default <- function(x, ..., use_lookup = 'auto') { + if (isTRUE(use_lookup) || (use_lookup == 'auto' && length(x) >= 1000)) { + DT = list(input = x) + setDT(DT) + + lookup = unique(DT) + lookup[ , 'ITime' := as.ITime(as.POSIXlt(input, ...), ...)] + lookup[DT, on = 'input']$ITime + } else { + as.ITime(as.POSIXlt(x, ...), ...) + } } as.ITime.POSIXct <- function(x, tz = attr(x, "tzone"), ...) { diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index c0e894b989..35ff5c5dd0 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -13414,6 +13414,18 @@ DT = data.table(a = c(1:3, 3:1)) test(1984.38, rowidv(DT, prefix = 5L), error='must be NULL or a character vector') test(1984.39, rowidv(DT, prefix = c('hey','you')), error='must be NULL or a character vector') +# #2306, use_merge argument of as.IDate.default +char_date = as.character(.Date(17532:17896)) +test(1985.1, as.IDate(char_date, use_lookup = TRUE), + as.IDate(char_date, use_lookup = FALSE)) +test(1985.2, as.IDate(char_date, use_lookup = 'auto'), + as.IDate(char_date, use_lookup = FALSE)) + +char_time = format(.POSIXct(1549503263:1549503312, 'UTC'), '%T') +test(1985.3, as.ITime(char_time, use_lookup = TRUE), + as.ITime(char_time, use_lookup = FALSE)) +test(1985.4, as.ITime(char_time, use_lookup = 'auto'), + as.ITime(char_time, use_lookup = FALSE)) ################################### # Add new tests above this line # diff --git a/man/IDateTime.Rd b/man/IDateTime.Rd index 6dae27647e..4822b52ca8 100644 --- a/man/IDateTime.Rd +++ b/man/IDateTime.Rd @@ -48,14 +48,14 @@ } \usage{ as.IDate(x, \dots) -\method{as.IDate}{default}(x, \dots, tz = attr(x, "tzone")) +\method{as.IDate}{default}(x, \dots, tz = attr(x, "tzone"), use_lookup = 'auto') \method{as.IDate}{Date}(x, \dots) \method{as.Date}{IDate}(x, \dots) \method{as.POSIXct}{IDate}(x, tz = "UTC", time = 0, \dots) \method{round}{IDate}(x, digits = c("weeks", "months", "quarters","years"), \ldots) as.ITime(x, \dots) -\method{as.ITime}{default}(x, \dots) +\method{as.ITime}{default}(x, \dots, use_lookup = 'auto') \method{as.ITime}{POSIXlt}(x, ms = 'truncate', \dots) \method{as.POSIXct}{ITime}(x, tz = "UTC", date = Sys.Date(), \dots) @@ -85,6 +85,7 @@ year(x) \code{as.IDate.default}, arguments are passed to \code{as.Date}. For \code{as.ITime.default}, arguments are passed to \code{as.POSIXlt}.} \item{tz}{time zone (see \code{strptime}).} + \item{use_lookup}{ \code{logical}; should the parsing be done by via merge? See Details. } \item{date}{date object convertable with \code{as.IDate}.} \item{time}{time-of-day object convertable with \code{as.ITime}.} \item{digits}{really \code{units}; one of the units listed for rounding. May be abbreviated.} @@ -96,6 +97,17 @@ internal representation as the \code{Date} class, except the storage mode is integer. \code{IDate} is a relatively simple wrapper, and it should work in almost all situations as a replacement for \code{Date}. +The \code{use_merge} argument to \code{as.IDate.default} and \code{as.ITime.default} was implemented +based on the observation that the number of unique dates/times is +typically a small percentage of the number of total observations -- as such, +it is inefficient to repeatedly parse the same date/time potentially thousands or +millions of times. Instead, when \code{use_merge = TRUE}, conversion is done +using parsing on the set of unique dates/times, which is then merged to the complete +set of dates/times. By default, \code{use_merge = 'auto'}, in which case this method +will be applied when \code{length(x) >= 1000} since the overhead of merging +was found to exceed the efficiency gains of merging in small data sets. As a +rule, the performance improves with \code{1 - uniqueN(x)/length(x)}. + Functions that use \code{Date} objects generally work for \code{IDate} objects. This package provides specific methods for \code{IDate} objects for \code{mean}, \code{cut}, \code{seq}, \code{c},