Pattern matching for character columns compared to Factor columns

Hello, I would like to know what is the appropriate way of matching patterns in Character columns.
In the vignettes it is clearly explained that Character columns should be preferred over Factor columns.
I know of the like function in the package but its performance seems fairly low in the bellow problem where I try to detect a pattern in a Character column and a Factor column.
Is there a recommended way to do pattern matching on character columns in data.table 

```
library(data.table)                                                                                            
library(stringr)                                                                                               
library(microbenchmark)
                                                                                                               
set.seed(1L)                                                                                                   
n_disctinct <- 100L                                                                                            
n_letters <- 5L                                                                                                
x <- replicate(n_disctinct, paste(sample(LETTERS, n_letters, replace = TRUE), collapse = ""))                  
n_length <- 1e6L                                                                                               
x <- sample(x = x, size = n_length, replace = TRUE)                                                            
f <- factor(x)                                                                                                 
DT <- data.table(f = f, x = x)                                                                                 
                                                                                                               
fct_detect <- function(x, ...) {                                                                               
    stringr::str_detect(levels(x), ...)[x]                                                                     
}                                                                                                              
                                                                                                               
microbenchmark(                                                                                                
    DT1 <- DT[x %like% "SSA"],                 # the like function on Character                                                                
    DT2 <- DT[fct_detect(f, "SSA")],            # naive detect function on Factor using str_detect on the levels                                                               
    DT3 <- DT[f %like% "SSA"],                  # the like function on Factor                                                                
    DT4 <- DT[x %chin% "SSATD"],                                                                               
    DT5 <- DT[x == "SSATD"],                                                                                   
    DT6 <- DT[f == "SSATD"],                                                                                   
    DT7 <- DT[(levels(f) == "SSATD")[f]],                                                                      
    times = 50L                                                                                                
    ) -> res                                                                                                   
res                                                                                                            
# Unit: milliseconds                                                                                           
#                                  expr        min         lq       mean     median         uq        max neval
#             DT1 <- DT[x %like% "SSA"] 137.224022 139.326541 153.066482 141.602557 148.522236 237.255871    50
#       DT2 <- DT[fct_detect(f, "SSA")]   4.593743   4.896224   5.717059   5.080999   5.346516   9.457377    50
#             DT3 <- DT[f %like% "SSA"]  19.556630  21.070560  24.242968  21.944684  29.048203  32.234688    50
#           DT4 <- DT[x %chin% "SSATD"]   1.477981   1.629233   2.060127   1.709921   2.850643   3.092975    50
#               DT5 <- DT[x == "SSATD"]   1.480074   1.654964   2.023678   1.710184   2.813284   3.068752    50
#               DT6 <- DT[f == "SSATD"]   4.523973   4.854614   5.994524   5.023527   6.237378   9.392769    50
#  DT7 <- DT[(levels(f) == "SSATD")[f]]   4.462589   4.716833   5.786381   5.031112   5.205085   9.351510    50

sapply(list(DT2, DT3), FUN = identical, DT1)       
# [1] TRUE TRUE                                    
sapply(list(DT5, DT6, DT7), FUN = identical, DT4)  
# [1] TRUE TRUE TRUE                               

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pattern matching for character columns compared to Factor columns #4748

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pattern matching for character columns compared to Factor columns #4748

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions