Skip to content

Use lookup (join to dictionary) for performance boost #2603

@ben519

Description

@ben519

When I have a large dataset with a date column read in as a character, I can convert it to Date type like so

set.seed(2018)
dt <- data.table(
  DateChar = as.character(as.Date("2000-01-01") + sample(10, replace = T, size = 10^7))
)

# Simple conversion using as.Date
system.time(dt[, Date1 := as.Date(DateChar)])  # ~59 seconds

# Simple conversion using as.IDate
system.time(dt[, Date2 := as.IDate(DateChar)])  # ~59 seconds

But this is painfully slow. So, I usually build a table of the unique date characters, convert those to Date type, and then do a join + insertion back into my original table

system.time({
  uniqueDates <- unique(dt[, list(DateChar)])
  uniqueDates[, Date3 := as.Date(DateChar)]
  dt[uniqueDates, Date3 := i.Date3, on = "DateChar"]
})  # ~0.5 seconds (about 120 times speedup!)

This results in an enormous speedup. I was wondering if similar logic could be embedded in data.table to do this internally where it's appropriate. I haven't given it a ton of thought, but here are some basic points

  • This could be extended for other methods like day(), weekday(), ... (I don't think you could generalize this behavior for any function. Imagine trying to apply this technique when the user does something like dt[, foo := myfunc(foo)] where myfunc <- function(x) ifelse(x < 10, x, rnorm(n = 1)). With that said, having a curated set of some functions where this technique can be applied would still be greatly helpful.)
  • In my example, there would obviously be a performance hit if all the dates in the table were unique. Personally, I'd be willing to live with that occasional hit, but maybe some dumb logic could be embedded like "if the number of values > 1M and the number of unique elements represent less than 50% of the total rows, then apply the technique". Counting the unique elements may be slow, so perhaps the user can opt in for the behavior or it could only be applied on columns containing an index.

Obviously I haven't fleshed out all the details of how this would work, but hopefully my idea is clear. The performance boost of this technique would be incredibly helpful to my workflow (after all the reason I use data.table over dplyr and pandas is because of its superior performance). Searching through the issues, @MichaelChirico touched on this a bit in #2503

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2
data.table_1.10.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    IDate/ITimejoinsUse label:"non-equi joins" for rolling, overlapping, and non-equi joinsperformance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions