Hi,
Currently, the parameter sep in function fread defaults to the set [,\t |;:]
I suggest to include "\n" as final separator in the default, as this might improve downwards-compatibility of existing code with previous versions of data.table.
An example would be a file, where only one single string is written in each line but occassionally some of the sep-default-characters are part of the string.This produces an error in 1.9.5 due to string "c:4" in line 3 (but not in 1.9.4) when not explicitly specifying sep = "\n".
Here is an example:
(I am using data.table 1.9.5 devel from 8.3.2015, txt file available at https://www.dropbox.com/s/y6cmkcza36c1qjn/ex_150309.txt?dl=0)
myfile = "/net/ifs1/san_projekte/projekte/genstat/09_nutzer/holger/39_dt_request//ex_150309.txt" # available at https://www.dropbox.com/s/y6cmkcza36c1qjn/ex_150309.txt?dl=0
aa = fread(myfile, verbose = T)
## Input contains no \n. Taking this to be a filename to open
## File opened, filesize is 0.000000 GB.
## Memory mapping ... ok
## Detected eol as \r\n (CRLF) in that order, the Windows standard.
## Positioned on line 1 after skip or autostart
## This line is the autostart and not blank so searching up for the last non-blank ... line 1
## Detecting sep ... ':'
## Detected 2 columns. Longest stretch was from line 3 to line 3
## Starting data input on line 3 (either column names or first row of data). First 10 characters: c:4
## Warning in fread(myfile, verbose = T): Starting data input on line 3 and
## discarded previous non-empty line: b
## Some fields on line 3 are not type character (or are empty). Treating as a data row and using default column names.
## Count of eol: 3 (including 1 at the end)
## Count of sep: 1
## nrow = MIN( nsep [1] / ncol [2] -1, neol [3] - nblank [1] ) = 1
## Error in fread(myfile, verbose = T): Expected sep (':') but new line, EOF (or other non printing character) ends field 0 when detecting types ( first): d
aa = fread(myfile, verbose = T, sep = "\n")
## Input contains no \n. Taking this to be a filename to open
## File opened, filesize is 0.000000 GB.
## Memory mapping ... ok
## Detected eol as \r\n (CRLF) in that order, the Windows standard.
## Positioned on line 1 after skip or autostart
## This line is the autostart and not blank so searching up for the last non-blank ... line 1
## Using supplied sep '
## ' ... Deducing this is a single column input.
## Starting data input on line 1 (either column names or first row of data). First 10 characters: a
## All the fields on line 1 are character fields. Treating as the column names.
## Count of eol: 4 (including 1 at the end)
## Count of sep: 3
## ncol==1 so sep count ignored
## Type codes ( first 5 rows): 4
## Type codes: 4 (after applying colClasses and integer64)
## Type codes: 4 (after applying drop or select (if supplied)
## Allocating 1 column slots (1 - 0 dropped)
## Read 3 rows. Exactly what was estimated and allocated up front
## 0.000s ( 71%) Memory map (rerun may be quicker)
## 0.000s ( 13%) sep and header detection
## 0.000s ( 3%) Count rows (wc -l)
## 0.000s ( 6%) Column type detection (first, middle and last 5 rows)
## 0.000s ( 3%) Allocation of 3x1 result (xMB) in RAM
## 0.000s ( 2%) Reading data
## 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
## 0.000s ( 0%) Coercing data already read in type bumps (if any)
## 0.000s ( 2%) Changing na.strings to NA
## 0.000s Total
aa
## a
## 1: b
## 2: c:4
## 3: d
sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-suse-linux-gnu (64-bit)
##
## locale:
## [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
## [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.9.5 knitr_1.9
##
## loaded via a namespace (and not attached):
## [1] chron_2.3-45 evaluate_0.5.5 formatR_1.0 stringr_0.6.2
## [5] tools_3.1.2
Hi,
Currently, the parameter sep in function fread defaults to the set [,\t |;:]
I suggest to include "\n" as final separator in the default, as this might improve downwards-compatibility of existing code with previous versions of data.table.
An example would be a file, where only one single string is written in each line but occassionally some of the sep-default-characters are part of the string.This produces an error in 1.9.5 due to string "c:4" in line 3 (but not in 1.9.4) when not explicitly specifying sep = "\n".
Here is an example:
(I am using data.table 1.9.5 devel from 8.3.2015, txt file available at https://www.dropbox.com/s/y6cmkcza36c1qjn/ex_150309.txt?dl=0)