Hello, I would like to know what is the appropriate way of matching patterns in Character columns.
In the vignettes it is clearly explained that Character columns should be preferred over Factor columns.
I know of the like function in the package but its performance seems fairly low in the bellow problem where I try to detect a pattern in a Character column and a Factor column.
Is there a recommended way to do pattern matching on character columns in data.table
library(data.table)
library(stringr)
library(microbenchmark)
set.seed(1L)
n_disctinct <- 100L
n_letters <- 5L
x <- replicate(n_disctinct, paste(sample(LETTERS, n_letters, replace = TRUE), collapse = ""))
n_length <- 1e6L
x <- sample(x = x, size = n_length, replace = TRUE)
f <- factor(x)
DT <- data.table(f = f, x = x)
fct_detect <- function(x, ...) {
stringr::str_detect(levels(x), ...)[x]
}
microbenchmark(
DT1 <- DT[x %like% "SSA"], # the like function on Character
DT2 <- DT[fct_detect(f, "SSA")], # naive detect function on Factor using str_detect on the levels
DT3 <- DT[f %like% "SSA"], # the like function on Factor
DT4 <- DT[x %chin% "SSATD"],
DT5 <- DT[x == "SSATD"],
DT6 <- DT[f == "SSATD"],
DT7 <- DT[(levels(f) == "SSATD")[f]],
times = 50L
) -> res
res
# Unit: milliseconds
# expr min lq mean median uq max neval
# DT1 <- DT[x %like% "SSA"] 137.224022 139.326541 153.066482 141.602557 148.522236 237.255871 50
# DT2 <- DT[fct_detect(f, "SSA")] 4.593743 4.896224 5.717059 5.080999 5.346516 9.457377 50
# DT3 <- DT[f %like% "SSA"] 19.556630 21.070560 24.242968 21.944684 29.048203 32.234688 50
# DT4 <- DT[x %chin% "SSATD"] 1.477981 1.629233 2.060127 1.709921 2.850643 3.092975 50
# DT5 <- DT[x == "SSATD"] 1.480074 1.654964 2.023678 1.710184 2.813284 3.068752 50
# DT6 <- DT[f == "SSATD"] 4.523973 4.854614 5.994524 5.023527 6.237378 9.392769 50
# DT7 <- DT[(levels(f) == "SSATD")[f]] 4.462589 4.716833 5.786381 5.031112 5.205085 9.351510 50
sapply(list(DT2, DT3), FUN = identical, DT1)
# [1] TRUE TRUE
sapply(list(DT5, DT6, DT7), FUN = identical, DT4)
# [1] TRUE TRUE TRUE
Hello, I would like to know what is the appropriate way of matching patterns in Character columns.
In the vignettes it is clearly explained that Character columns should be preferred over Factor columns.
I know of the like function in the package but its performance seems fairly low in the bellow problem where I try to detect a pattern in a Character column and a Factor column.
Is there a recommended way to do pattern matching on character columns in data.table