Just reading the new dev notes and noticed #3333. I was going to actually feature request %likep% (would make sense to conform to %plike%) the other day, but decided against it (thought maybe the consensus was that less convenience wrappers were more ideal for data.table. Any particular reason why data.table can't incorporate another, leveraging the perl = TRUE argument?
Often you get considerable speed improvements, and a bunch of other features / behaviors
# Following packages required .
# install.packages(c("stringi", "microbenchmark")
# load data.table.
library(data.table)
# Create a data.table of 100,000 random strings (20 chars in length).
DT = data.table(x = stringi::stri_rand_strings(100000, 20))
# Define a trivial regex pattern.
regex_pattern = "car|blah|far|nah"
# Create an alternative to %like% that sets `perl = TRUE`.
`%likep%` = function (vector, pattern) {
if (is.factor(vector)) {
as.integer(vector) %in% grep(pattern, levels(vector), perl = TRUE)
}
else {
grepl(pattern, vector, perl = TRUE)
}
}
# Microbenchmark the results to demonstrate speed improvements.
microbenchmark::microbenchmark(like = {(DT[x %like% regex_pattern])}, likep = (DT[x %likep% regex_pattern]))
# Unit: milliseconds
# expr min lq mean median uq max neval
# like 84.1235 86.56265 91.51547 87.74410 91.16710 159.6292 100
# likep 16.0932 16.64750 17.81476 16.95985 17.82195 34.1415 100
Just reading the new dev notes and noticed #3333. I was going to actually feature request
%likep%(would make sense to conform to%plike%) the other day, but decided against it (thought maybe the consensus was that less convenience wrappers were more ideal fordata.table. Any particular reason why data.table can't incorporate another, leveraging theperl = TRUEargument?Often you get considerable speed improvements, and a bunch of other features / behaviors