Skip to content

fread spends too much time in is_url/is_secureurl/is_file for long in memory input #2531

@javrucebo

Description

@javrucebo

Summary

When fread is fed with a character string as input the routine spends considerable amount of time detecting that the supplied input is not a filename or an url.

This is due to grepl not scaling well for large input as used in fread.

Example

is_url <- function(x) grepl("^(http|ftp)s?://", x)

Although the pattern is anchored at the beginning of the string, running grepl for large inputs will take a lot of time for large inputs (more detailed benchmarks further down).
This will lead then to the full call to fread possibly spending a third of the time for those supposedly simple checks. (example also below)

Possible solutions could be one or more of the following

  • changing the call to grepl to use PERL regexp engine perl=TRUE
    Alternatively use another method to determine whether the input starts with url (see benchmarks below)
  • adding an explicit optional argument which denotes string input, e.g. str= to denote that the input is to be considered as the data and skip the tests for url or file. This would be similar in spirit to the file= argument.
    As a side-effect this would also allow input to be read which consits only of url's and having no header, e.g.
fread("http://hkhfsk\nhttp://fhdkf\nhttp://kjfhskd\nhttp://hfkjf", header=FALSE)
  • consider an input, which exceeds a certain number of characters in length, not being an URL

Profiling example

library(data.table)
# create a random character input in 5 columns with 20000*200 lines
randomString <- function() {
  paste(sample(c(LETTERS, letters), sample(1:9,1)), collapse="")
}
input <- paste(rep(paste(replicate(20000,paste(replicate(5, randomString()), 
                                              collapse="\t")),
                         collapse="\n"),200),
               collapse="\n")
# profile the call to fread
Rprof()
invisible(fread(input, header=FALSE))
Rprof(NULL)
summaryRprof()
$by.self
        self.time self.pct total.time total.pct
".Call"      5.34    64.49       5.34     64.49
"grepl"      2.94    35.51       2.94     35.51

Benchmarking grepl and friends

Comparing different functions to verify whether a string starts with any of http(s)/ftp(s) or file shows that grepl scales badly and is by far the slowest of the tested variants.
Adding simply perl=TRUE already improves by around factor 100 for large inputs (code further below)

image

library(data.table)
library(ggplot2)
library(Rcpp)

# define alternative functions to detect URL's
grepl_full <- function(txt) { grepl("^(http|ftp)s?://", txt) }
grepl_substr <- function(txt) { grepl("^(http|ftp)s?://", substr(txt,1,8)) }
grepl_perl <- function(txt) { grepl("^(http|ftp)s?://", txt, perl=TRUE) }
grepl_perl_substr <- function(txt) { grepl("^(http|ftp)s?://", substr(txt,1,8), perl=TRUE) }
grepl_perl_stri_sub <- function(txt) { grepl("^(http|ftp)s?://", stringi::stri_sub(txt,1,8),
                                             perl=TRUE) }
stridetect <- function(txt) { stringi::stri_detect_regex(txt, "^(http|ftp)s?://") }
stristartswith <- function(txt) { 
  stringi::stri_startswith_fixed(txt, c("http://", "https://", "ftp://", "ftps://")) 
}
cppFunction('bool isURL_cpp(const std::string &x) {
               bool ret = FALSE;
               if (x.length() >= 8) {
                 std::string s = x.substr(0,8);
                 if (s.substr(0,7) == "http://" || 
                     s == "https://" ||
                     s.substr(0,6) == "ftp://" ||
                     s.substr(0,7) == "ftps://") {
                   ret = TRUE;
                 }
               }
               return(ret);
             }')
cppFunction('bool isURL_c(const char* x) {
               bool ret = FALSE;
               if (strlen(x) >= 8) {
                 if (!strncmp(x, "http://", 7) || 
                     !strncmp(x, "https://", 8) ||
                     !strncmp(x, "ftp://", 6) ||
                     !strncmp(x, "ftps://", 7)) {
                   ret = TRUE;
                 }
               }
               return(ret);
             }')


# number of string lengths to test with. max len = 1e+nn
nn <- 8
# create random strings of given length
for (i in 1:nn) {
  assign(paste0("str_",i), paste(sample(c(LETTERS, letters), 10^i, replace=TRUE), 
                                 collapse=""))
}
function_list <- c("grepl_full", "grepl_substr", "grepl_perl", "grepl_perl_substr", 
                   "grepl_perl_stri_sub", 
                   "isURL_cpp", "isURL_c", 
                   "stridetect", "stristartswith")

# cross join of functions and strings
dt <- CJ(fun=function_list, ind=1:nn)
# build code to run microbenchmark
mb_code <- paste("microbenchmark::microbenchmark(",
                 paste0("'", dt$fun, " ", 10^dt$ind, "'=", 
                        dt$fun, "(str_", dt$ind, ")", collapse=","),",times=10)")
# eval microbenchmark and calculate median
mb <- as.data.table(eval(parse(text=mb_code)))
mb[,c("fun", "nr"):=tstrsplit(as.character(expr), " ", fixed=TRUE)]
mb[,nr:=as.integer(nr)]
mb <- mb[,.(time=median(time)), by=.(fun,nr)]

# plot
ggplot(mapping=aes(nr, time/1e6)) + 
  geom_line(data=mb[fun %chin% c("grepl_full", "grepl_perl")],
            aes(col=fun), size=1.1) +
  geom_line(data=mb[!fun %chin% c("grepl_full", "grepl_perl")],
            aes(col=fun), size=0.5, linetype="dotted") +
  geom_point(data=mb, aes(col=fun)) +
  scale_y_log10(breaks=c(0.01,0.1,1,10,100,1000)) +
  scale_x_log10(breaks=10^(1:nn)) +
  labs(y="time in milliseconds", x="string length", 
       title="Benchmark string tests for isURL") +
  theme(legend.position = c(0.05,0.95), legend.justification = c(0,1))

sessionInfo

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_IE.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_IE.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_IE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] tools_3.3.1 yaml_2.1.15

Results are similar with R 3.4.3 / data.table 1.10.4 / Windows 10 64bit

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions