This may be related to issue #1812, but as that one does not have a reproducible example to confirm, I thought it would be more appropriate to open a new issue.
When a file with an uneven number of columns has the max number of columns in the final row fread fails with the following error:
Error in fread("foo", header = FALSE, fill = TRUE, sep = ",", col.names = paste("V", :
Expecting 3 cols, but line 9 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
This occurs even with fill = TRUE and the maximum number of column names passed to col.names .
Here is a small example
text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n16520, California, ocean, summer, golden gate, beach, San Francisco\n"
cat(text, file = "foo")
max.fields<-max(count.fields("foo", sep = ','))
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))
However, when the row with the maximum number of fields is moved to the middle of the file (in this example row 6), fread behaves as expected.
text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n16520, California, ocean, summer, golden gate, beach, San Francisco\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n"
cat(text, file = "foo")
max.fields<-max(count.fields("foo", sep = ','))
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))
I included this caveat in my answer to this Stackoverflow question
laptop session info
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.3
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_2.2.1 microbenchmark_1.4-4 data.table_1.10.4-3
loaded via a namespace (and not attached):
[1] colorspace_1.3-2 scales_0.5.0 compiler_3.4.1 lazyeval_0.2.1 plyr_1.8.4 tools_3.4.1 pillar_1.2.1
[8] gtable_0.2.0 tibble_1.4.2 Rcpp_0.12.15 grid_3.4.1 rlang_0.2.0 munsell_0.4.3
Also tested on this machine, same results
#R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] vegan_2.4-4 lattice_0.20-35 permute_0.9-4 ggplot2_2.2.1 data.table_1.10.4 reshape2_1.4.3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.15 cluster_2.0.6 magrittr_1.5 MASS_7.3-49 munsell_0.4.3 colorspace_1.3-2
[7] rlang_0.1.6 stringr_1.2.0 plyr_1.8.4 tools_3.4.4 parallel_3.4.4 grid_3.4.4
[13] gtable_0.2.0 nlme_3.1-131.1 mgcv_1.8-23 digest_0.6.15 yaml_2.1.14 lazyeval_0.2.1
[19] tibble_1.4.2 Matrix_1.2-11 labeling_0.3 stringi_1.1.6 compiler_3.4.4 pillar_1.1.0
[25] scales_0.5.0
This may be related to issue #1812, but as that one does not have a reproducible example to confirm, I thought it would be more appropriate to open a new issue.
When a file with an uneven number of columns has the max number of columns in the final row
freadfails with the following error:This occurs even with
fill = TRUEand the maximum number of column names passed tocol.names.Here is a small example
However, when the row with the maximum number of fields is moved to the middle of the file (in this example row 6),
freadbehaves as expected.I included this caveat in my answer to this Stackoverflow question
laptop session info
Also tested on this machine, same results