Use lookup (join to dictionary) for performance boost

When I have a large dataset with a date column read in as a character, I can convert it to Date type like so

```
set.seed(2018)
dt <- data.table(
  DateChar = as.character(as.Date("2000-01-01") + sample(10, replace = T, size = 10^7))
)

# Simple conversion using as.Date
system.time(dt[, Date1 := as.Date(DateChar)])  # ~59 seconds

# Simple conversion using as.IDate
system.time(dt[, Date2 := as.IDate(DateChar)])  # ~59 seconds
```

But this is painfully slow. So, I usually build a table of the unique date characters, convert those to Date type, and then do a join + insertion back into my original table

```
system.time({
  uniqueDates <- unique(dt[, list(DateChar)])
  uniqueDates[, Date3 := as.Date(DateChar)]
  dt[uniqueDates, Date3 := i.Date3, on = "DateChar"]
})  # ~0.5 seconds (about 120 times speedup!)
```

This results in an enormous speedup. I was wondering if similar logic could be embedded in data.table to do this internally where it's appropriate.  I haven't given it a ton of thought, but here are some basic points

- This could be extended for other methods like `day()`, `weekday()`, ... (I don't think you could generalize this behavior for *any* function.  Imagine trying to apply this technique when the user does something like `dt[, foo := myfunc(foo)]` where `myfunc <- function(x) ifelse(x < 10, x, rnorm(n = 1))`. With that said, having a curated set of *some* functions where this technique can be applied would still be greatly helpful.)
- In my example, there would obviously be a performance hit if all the dates in the table were unique. Personally, I'd be willing to live with that occasional hit, but maybe some dumb logic could be embedded like "if the number of values > 1M and the number of unique elements represent less than 50% of the total rows, then apply the technique".  Counting the unique elements may be slow, so perhaps the user can opt in for the behavior or it could only be applied on columns containing an index.

Obviously I haven't fleshed out all the details of how this would work, but hopefully my idea is clear. The performance boost of this technique would be incredibly helpful to my workflow (after all the reason I use data.table over dplyr and pandas is because of its superior performance).  Searching through the issues, @MichaelChirico touched on this a bit in [#2503](https://github.com/Rdatatable/data.table/issues/2503)

```
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2
data.table_1.10.5
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use lookup (join to dictionary) for performance boost #2603

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use lookup (join to dictionary) for performance boost #2603

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions