Skip to content

fread should un-escape escaped quotes in fields #1109

@jangorecki

Description

@jangorecki

According to the docs:

A quoted field must start with quote and end with a quote that is also immediately followed by sep or \n. Thus, unescaped quotes may be present in a quoted field (...,2,"Joe, "Bloggs"",3.14,...) as well as escaped quotes (...,2,"Joe \",Bloggs\"",3.14,...).

The following csv should be supported by fread.

library(data.table)
dt <- data.table(a = 1:2, b = c('f(c("a","b"))','sum(1,2)'))

dt
#    a             b
#1: 1 f(c("a","b"))
#2: 2      sum(1,2)

write.table(dt,"tbl1.csv",sep=",",na="",col.names=TRUE,row.names=FALSE,qmethod="escape")
write.table(dt,"tbl2.csv",sep=",",na="",col.names=TRUE,row.names=FALSE,qmethod="double")

system("cat tbl1.csv")
# "a","b"
#1,"f(c(\"a\",\"b\"))"
#2,"sum(1,2)"
system("cat tbl2.csv")
# "a","b"
#1,"f(c(""a"",""b""))"
#2,"sum(1,2)"

# output NA
fread("tbl1.csv",sep=",")
# Error in fread("tbl1.csv", sep = ",") : 
#   Expected sep (',') but new line, EOF (or other non printing character) ends field 1 when detecting types (   first): 2,"sum(1,2)"
# In addition: Warning message:
#   In fread("tbl1.csv", sep = ",") :
#   Starting data input on line 2 and discarded previous non-empty line: "a","b"

# incorrect output
fread("tbl2.csv",sep=",")
# a                 b
#1: 1 f(c(""a"",""b""))
#2: 2          sum(1,2)

# incorrect output
read.table("tbl1.csv",sep=",",header=TRUE)
#   a                 b
#1 1 f(c(\\a\\,\\b\\))
#2 2          sum(1,2)

# correct output
read.table("tbl2.csv",sep=",",header=TRUE)
# a               b
#1 1 f(c("a","b"))
#2 2      sum(1,2)

Findings:
as of now, only writing using qmethod="double" and read.table correctly supports write-read such kind of data.

Latest dev data.table, my locale, etc:

> sessionInfo()
# R version 3.1.3 (2015-03-09)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 14.04.2 LTS
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_DK.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] data.table_1.9.5
# 
# loaded via a namespace (and not attached):
#  [1] bitops_1.0-6   chron_2.3-45   devtools_1.7.0 evaluate_0.5.5 formatR_1.0    httr_0.6.1     knitr_1.8      RCurl_1.95-4.5 stringr_0.6.2  tools_3.1.3   

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions