When I have a large dataset with a date column read in as a character, I can convert it to Date type like so
set.seed(2018)
dt <- data.table(
DateChar = as.character(as.Date("2000-01-01") + sample(10, replace = T, size = 10^7))
)
# Simple conversion using as.Date
system.time(dt[, Date1 := as.Date(DateChar)]) # ~59 seconds
# Simple conversion using as.IDate
system.time(dt[, Date2 := as.IDate(DateChar)]) # ~59 seconds
But this is painfully slow. So, I usually build a table of the unique date characters, convert those to Date type, and then do a join + insertion back into my original table
system.time({
uniqueDates <- unique(dt[, list(DateChar)])
uniqueDates[, Date3 := as.Date(DateChar)]
dt[uniqueDates, Date3 := i.Date3, on = "DateChar"]
}) # ~0.5 seconds (about 120 times speedup!)
This results in an enormous speedup. I was wondering if similar logic could be embedded in data.table to do this internally where it's appropriate. I haven't given it a ton of thought, but here are some basic points
- This could be extended for other methods like
day(), weekday(), ... (I don't think you could generalize this behavior for any function. Imagine trying to apply this technique when the user does something like dt[, foo := myfunc(foo)] where myfunc <- function(x) ifelse(x < 10, x, rnorm(n = 1)). With that said, having a curated set of some functions where this technique can be applied would still be greatly helpful.)
- In my example, there would obviously be a performance hit if all the dates in the table were unique. Personally, I'd be willing to live with that occasional hit, but maybe some dumb logic could be embedded like "if the number of values > 1M and the number of unique elements represent less than 50% of the total rows, then apply the technique". Counting the unique elements may be slow, so perhaps the user can opt in for the behavior or it could only be applied on columns containing an index.
Obviously I haven't fleshed out all the details of how this would work, but hopefully my idea is clear. The performance boost of this technique would be incredibly helpful to my workflow (after all the reason I use data.table over dplyr and pandas is because of its superior performance). Searching through the issues, @MichaelChirico touched on this a bit in #2503
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2
data.table_1.10.5
When I have a large dataset with a date column read in as a character, I can convert it to Date type like so
But this is painfully slow. So, I usually build a table of the unique date characters, convert those to Date type, and then do a join + insertion back into my original table
This results in an enormous speedup. I was wondering if similar logic could be embedded in data.table to do this internally where it's appropriate. I haven't given it a ton of thought, but here are some basic points
day(),weekday(), ... (I don't think you could generalize this behavior for any function. Imagine trying to apply this technique when the user does something likedt[, foo := myfunc(foo)]wheremyfunc <- function(x) ifelse(x < 10, x, rnorm(n = 1)). With that said, having a curated set of some functions where this technique can be applied would still be greatly helpful.)Obviously I haven't fleshed out all the details of how this would work, but hopefully my idea is clear. The performance boost of this technique would be incredibly helpful to my workflow (after all the reason I use data.table over dplyr and pandas is because of its superior performance). Searching through the issues, @MichaelChirico touched on this a bit in #2503