I am working with CSV data files generated by an instrument. The instrument exports three tables, each with two header rows, into the same file. All lines end with a ',EOL'.
For Example:
Head 1, Head 2, Head 3, ..., Head n,
Sub 1, Sub 2, Sub 3, ..., Sub n,
character, numeric, numeric, ..., numeric,
character, numeric, numeric, ..., numeric,
character, numeric, numeric, ..., numeric,
< 30,000 - 800,000 rows >
Using read.table() with a combination of skip and nrow arguments to specify a continuous block of data corresponding to one of the three tables in the file works fine:
df1 <- read.csv(file, header=FALSE, skip=20, nrows=27214)
dim(df1)
[1] 27214 43
Using fread() with the same settings returns an error:
df <- fread(file, header=FALSE, skip=20, nrows=27214)
Error in fread(file, header = FALSE, skip = 20, nrows = 27214) :
Expected sep (',') but new line, EOF (or other non printing character) ends
field 37 on line 22 when detecting types: P - 20, ,3.897,133.436,
0.786,1.137,0.046,761.305,0.211,183.300,1.129,1337.282,0.563,385.954,
116117.274,50391.888,166509.163,2.814,2.799,396.083,0.317,4775.659,0.285,
12.336,1288.281,0.867,1.066,0.721,0.377,272.761,997.594,2668.682,1060838.391,
424835.353,1485673.719,10.000,
There actually exists two problems here.
read.table() is able to interpret a comma immediately preceding an End Of Line as a column with value NA. Ideally, fread() should mimic this behavior, and/or provide an option to remove columns with unique(column) == NA. -- Removing the trailing commas from all lines in my file, allows fread() to execute successfully.
- Perhaps more importantly, the line listed in the error message above is the 5th from the end of the file. When
nrows is specified, type detection should be constrained to the lines between skip and skip + nrows. That is, perform type detection in rows c(1:5) + skip, the middle 5 rows, and skip + nrows - c( 5:1 ).
In my particular case, the number of columns is not fixed between the three data tables in the file, so using rows outside the skip-nrows range will not give an accurate representation of the data with the range.
Verbose output from failed fread() with trailing commas:
fread(file, sep=",", header=FALSE, skip=cellHead, nrows=length(cell), verbose=T)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.012 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 21 ('skip' has been supplied) ... found ok
Found 43 columns
First row with 43 fields occurs on line 21 (either column names or first row of data)
'header' changed by user from 'auto' to FALSE
Count of eol after first data row: 27993
Subtracted 2 for last eol and any trailing empty lines, leaving 27991 data rows
nrow limited to nrows passed in (27214)
Type codes ( first 5 rows): 4144444444444444444444444444444444444444410
Type codes (+ middle 5 rows): 4144444444444444444444444444444444444444410
Error in fread(file, sep = ",", header = FALSE, skip = cellHead, nrows = length(cell), :
Expected sep (',') but new line, EOF (or other non printing character) ends field 37 on line 22 when detecting types: P - 20, ,3.897,133.436,0.786,1.137,0.046,761.305,0.211,183.300,1.129,1337.282,0.563,385.954,116117.274,50391.888,166509.163,2.814,2.799,396.083,0.317,4775.659,0.285,12.336,1288.281,0.867,1.066,0.721,0.377,272.761,997.594,2668.682,1060838.391,424835.353,1485673.719,10.000,
Verbose output from successful fread() without trailing commas:
fread(file2, sep=",", header=FALSE, skip=cellHead, nrows=length(cell), verbose=T)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.008 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 21 ('skip' has been supplied) ... found ok
Found 42 columns
First row with 42 fields occurs on line 21 (either column names or first row of data)
'header' changed by user from 'auto' to FALSE
Count of eol after first data row: 27992
Subtracted 1 for last eol and any trailing empty lines, leaving 27991 data rows
nrow limited to nrows passed in (27214)
Type codes ( first 5 rows): 413333333333333333313333333333333333333311
Type codes (+ middle 5 rows): 413333333333333333313333333333333333333311
Type codes (+ last 5 rows): 413333333333333333333333333333333333333311
Type codes: 413333333333333333333333333333333333333311 (after applying colClasses and integer64)
Type codes: 413333333333333333333333333333333333333311 (after applying drop or select (if supplied)
Allocating 42 column slots (42 - 0 dropped)
Bumping column 15 from REAL to STR on data row 2228, field contains ' NaN '
0.002s ( 0%) Memory map (rerun may be quicker)
0.003s ( 0%) sep and header detection
0.227s ( 30%) Count rows (wc -l)
0.002s ( 0%) Column type detection (first, middle and last 5 rows)
0.000s ( 0%) Allocation of 27214x42 result (xMB) in RAM
0.519s ( 69%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
0.002s ( 0%) Coercing data already read in type bumps (if any)
0.000s ( 0%) Changing na.strings to NA
0.755s Total
I am working with CSV data files generated by an instrument. The instrument exports three tables, each with two header rows, into the same file. All lines end with a ',EOL'.
Using
read.table()with a combination ofskipandnrowarguments to specify a continuous block of data corresponding to one of the three tables in the file works fine:Using
fread()with the same settings returns an error:There actually exists two problems here.
read.table()is able to interpret a comma immediately preceding an End Of Line as a column with value NA. Ideally,fread()should mimic this behavior, and/or provide an option to remove columns withunique(column) == NA. -- Removing the trailing commas from all lines in my file, allowsfread()to execute successfully.nrowsis specified, type detection should be constrained to the lines betweenskipandskip + nrows. That is, perform type detection in rowsc(1:5) + skip, the middle 5 rows, andskip + nrows - c( 5:1 ).In my particular case, the number of columns is not fixed between the three data tables in the file, so using rows outside the
skip-nrowsrange will not give an accurate representation of the data with the range.Verbose output from failed
fread()with trailing commas:Verbose output from successful
fread()without trailing commas: