Skip to content

fread fails w/ nrow > 1000 and out of region garbage row #2621

@mneilly

Description

@mneilly

First, thank you for the enormous amount of effort that must go into data.table!

This issue seems to be a variant of #831. The following comment on the NEWS page stands out to me as relevant but not knowing anything about the code base maybe it is irrelevant...

fread has always jumped to the middle and to the end of the file for a much improved column type guess. The sample size is increased from 100 rows at 10 jump jump points (1,000 rows) to 100 rows at 100 jumps points (10,000 row sample).

There is a single CSV file attached that contains a header followed by 1012 rows of data and a final line of "# JUNK". This is intended to be a minimal example of the failure and not my actual use case which is multiple CSV regions within a single file as in #831.

foo.txt

For the attached CSV (named foo.txt and not foo.csv since I could not upload a csv file) the following fread() with nrow=1000 works:

> fread("/tmp/foo.csv", nrow=1000, header=TRUE, verbose=T)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.000008 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 2 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: count,cach
'header' changed by user from 'auto' to TRUE
nrow set to nrows passed in (1000)
Type codes (point  0): 14
Type codes: 14 (after applying colClasses and integer64)
Type codes: 14 (after applying drop or select (if supplied)
Allocating 2 column slots (2 - 0 dropped)
Read 1000 rows. Exactly what was estimated and allocated up front
   0.000s ( 20%) Memory map (rerun may be quicker)
   0.000s ( 12%) sep and header detection
   0.000s (  2%) Count rows (wc -l)
   0.000s ( 26%) Column type detection (100 rows at 10 points)
   0.000s (  4%) Allocation of 1000x2 result (xMB) in RAM
   0.000s ( 34%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  2%) Changing na.strings to NA
   0.000s        Total
      count cacheid
   1:     1    blah
   2:     2    blah
   3:     3    blah
   4:     4    blah
   5:     5    blah
  ---
 996:   996    blah
 997:   997    blah
 998:   998    blah
 999:   999    blah
1000:  1000    blah
>

but the following fread() with nrow=1001 does not:

> fread("/tmp/foo.csv", nrow=1001, header=TRUE, verbose=T)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.000008 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 2 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: count,cach
'header' changed by user from 'auto' to TRUE
nrow set to nrows passed in (1001)
Type codes (point  0): 14
Type codes (point  1): 14
Type codes (point  2): 14
Type codes (point  3): 14
Type codes (point  4): 14
Type codes (point  5): 14
Type codes (point  6): 14
Type codes (point  7): 14
Type codes (point  8): 14
Error in fread("/tmp/foo.csv", nrow = 1001, header = TRUE, verbose = T) :
  Expected sep (',') but new line, EOF (or other non printing character) ends field 0 when detecting types from point 9: # JUNK
>  

Adding a "," after "# Junk" causes the fread to succeed.

It appears that as long as the row with '# Junk' has at least as many columns as the specified region of interest, fread works, but if it has fewer columns it does not work. I have verified that using read.csv() with the same CSV and parameters works.

The following is the output from sessionInfo():

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.10.4-2

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2 scales_0.4.1     lazyeval_0.2.0   plyr_1.8.4
 [5] tools_3.3.3      gtable_0.2.0     tibble_1.3.3     Rcpp_0.12.14
 [9] ggplot2_2.2.1    grid_3.3.3       rlang_0.1.1      munsell_0.4.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions