Skip to content

Pattern matching for character columns compared to Factor columns #4748

@statquant

Description

@statquant

Hello, I would like to know what is the appropriate way of matching patterns in Character columns.
In the vignettes it is clearly explained that Character columns should be preferred over Factor columns.
I know of the like function in the package but its performance seems fairly low in the bellow problem where I try to detect a pattern in a Character column and a Factor column.
Is there a recommended way to do pattern matching on character columns in data.table

library(data.table)                                                                                            
library(stringr)                                                                                               
library(microbenchmark)
                                                                                                               
set.seed(1L)                                                                                                   
n_disctinct <- 100L                                                                                            
n_letters <- 5L                                                                                                
x <- replicate(n_disctinct, paste(sample(LETTERS, n_letters, replace = TRUE), collapse = ""))                  
n_length <- 1e6L                                                                                               
x <- sample(x = x, size = n_length, replace = TRUE)                                                            
f <- factor(x)                                                                                                 
DT <- data.table(f = f, x = x)                                                                                 
                                                                                                               
fct_detect <- function(x, ...) {                                                                               
    stringr::str_detect(levels(x), ...)[x]                                                                     
}                                                                                                              
                                                                                                               
microbenchmark(                                                                                                
    DT1 <- DT[x %like% "SSA"],                 # the like function on Character                                                                
    DT2 <- DT[fct_detect(f, "SSA")],            # naive detect function on Factor using str_detect on the levels                                                               
    DT3 <- DT[f %like% "SSA"],                  # the like function on Factor                                                                
    DT4 <- DT[x %chin% "SSATD"],                                                                               
    DT5 <- DT[x == "SSATD"],                                                                                   
    DT6 <- DT[f == "SSATD"],                                                                                   
    DT7 <- DT[(levels(f) == "SSATD")[f]],                                                                      
    times = 50L                                                                                                
    ) -> res                                                                                                   
res                                                                                                            
# Unit: milliseconds                                                                                           
#                                  expr        min         lq       mean     median         uq        max neval
#             DT1 <- DT[x %like% "SSA"] 137.224022 139.326541 153.066482 141.602557 148.522236 237.255871    50
#       DT2 <- DT[fct_detect(f, "SSA")]   4.593743   4.896224   5.717059   5.080999   5.346516   9.457377    50
#             DT3 <- DT[f %like% "SSA"]  19.556630  21.070560  24.242968  21.944684  29.048203  32.234688    50
#           DT4 <- DT[x %chin% "SSATD"]   1.477981   1.629233   2.060127   1.709921   2.850643   3.092975    50
#               DT5 <- DT[x == "SSATD"]   1.480074   1.654964   2.023678   1.710184   2.813284   3.068752    50
#               DT6 <- DT[f == "SSATD"]   4.523973   4.854614   5.994524   5.023527   6.237378   9.392769    50
#  DT7 <- DT[(levels(f) == "SSATD")[f]]   4.462589   4.716833   5.786381   5.031112   5.205085   9.351510    50

sapply(list(DT2, DT3), FUN = identical, DT1)       
# [1] TRUE TRUE                                    
sapply(list(DT5, DT6, DT7), FUN = identical, DT4)  
# [1] TRUE TRUE TRUE                               

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions