Skip to content

newline in csv file causes fread to stop #4192

@andreas-sudo

Description

@andreas-sudo

This might relate to #2800 - which doesn't fix my problem.

My problem: fread stops loading

csv file, if there is a newline indside a text qualifier /field.
My original file has 1 mio. rows. The first error occours around line 92.980.

The line that causes trouble is:

"3";"2018";"Thing_x";"Thing_y";"Thing_k";"Thing_y";"Thing_k";"Private";"abc";"2017";"20.98.51.";"20.98.51.20";"965";;;"5708";" Individuel supplerende elevstøtte, efterkoler";"35";"Thing_x                                       ";;;;;;;"ENK";"Enkelt Postering";;;;;;;;;;;;"N";;;"Individuel supp. (ÅE 18,45*takst 1.209,00/1)
";;;"22306,05000"

The error is:

Stopped early on line 92980. Expected 45 fields but found 42. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<"3";"2018";

When I try to create a reproducible example with only a few lines above and below the troublesome row, fread reads the file succesfully. However such a file is attached anyway (csv renamed to log). I'll be happy to share the original file, but not publicly on github.

linebreaks_Example.log

Sorry I can't help more. Data.table helps me very much. Thanks for the hard work.

The output from fread(, verbose=TRUE) is below. And so is sessioninfo.


  omp_get_num_procs()            4
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          4
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 2 threads. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 2 threads (omp_get_max_threads()=4, nth=2)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file C:\Users\xxxx\Downloads\to_delete\file_20200122.csv
  File opened, size = 655MB (686952672 bytes).
  Memory mapped ok
[03] Detect and skip BOM
  UTF-8 byte order mark EF BB BF found at the start of the file and skipped.
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<"PERIODE";"UDBETALINGSAAR";"IN>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=';'  with 100 lines of 45 fields using quote rule 0
  Detected 45 columns on line 1. This line is either column names or first data row. Line starts as: <<"PERIODE";"UDBETALINGSAAR";"IN>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 45
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (686952667 bytes from row 1 to eof) / (2 * 62986 jump0size) == 5453
  Type codes (jump 000)    : 55A5A5AAA5AA5225A5A5A5A5AAAAAAA5A552AAA7222AA  Quote rule 0
  Type codes (jump 001)    : 55A5A5AAA5AA5A25A5A5A5A5AAAAAAA7A555AAA752AAA  Quote rule 0
  Type codes (jump 002)    : 55A5A5AAA5AA5AA5A5A5A5A5AAAAAAA7A555AAA75AAAA  Quote rule 0
  A line with too-few fields (33/45) was found on line 38 of sample jump 14. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-few fields (33/45) was found on line 81 of sample jump 28. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-few fields (33/45) was found on line 28 of sample jump 42. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-few fields (33/45) was found on line 7 of sample jump 66. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-few fields (33/45) was found on line 66 of sample jump 68. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-few fields (33/45) was found on line 13 of sample jump 73. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-few fields (33/45) was found on line 8 of sample jump 85. Most likely this jump landed awkwardly so type bumps here will be skipped.
  Type codes (jump 100)    : 55A5A5AAA5AA5AA5A5A5A5A5AAAAAAA7A555AAA75AAAA  Quote rule 0
  'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 9594 sample rows
  =====
  Sampled 9594 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 2 to the end of last row: 686951963
  Line length: mean=574.05 sd=57.68 min=320 max=769
  Estimated number of rows: 686951963 / 574.05 = 1196673
  Initial alloc = 1497646 rows (1196673 + 25%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 55A5A5AAA5AA5AA5A5A5A5A5AAAAAAA7A555AAA75AAAA
[10] Allocate memory for the datatable
  Allocating 45 column slots (45 - 0 dropped) with 1497646 rows
[11] Read the data
  jumps=[0..656), chunk_size=1047182, total_size=686951963
|--------------------------------------------------|
|  Restarting team from jump 5. nSwept==0 quoteRule==1
  jumps=[5..656), chunk_size=1047182, total_size=686951963
  Restarting team from jump 5. nSwept==0 quoteRule==2
  jumps=[5..656), chunk_size=1047182, total_size=686951963
===  Restarting team from jump 50. nSwept==0 quoteRule==3
  jumps=[50..656), chunk_size=1047182, total_size=686951963
===============================================|
Read 92978 rows x 45 columns from 655MB (686952672 bytes) file in 00:01.151 wall clock time
[12] Finalizing the datatable
  Type counts:
        15 : int32     '5'
         2 : float64   '7'
        28 : string    'A'
=============================
   0.000s (  0%) Memory map 0.640GB file
   0.125s ( 11%) sep=';' ncol=45 and header detection
   0.000s (  0%) Column type detection using 9594 sample rows
   0.462s ( 40%) Allocation of 1497646 rows x 45 cols (0.418GB) of which 92978 (  6%) rows used
   0.564s ( 49%) Reading 656 chunks (0 swept) of 0.999MB (each chunk 141 rows) using 2 threads
   +    0.295s ( 26%) Parse to row-major thread buffers (grown 0 times)
   +    0.186s ( 16%) Transpose
   +    0.083s (  7%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   1.151s        Total

Session info:

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252    LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C                    LC_TIME=Danish_Denmark.1252    
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.8

loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1  

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions