Fix reading of files where fields may contain many newlines by st-pasha · Pull Request #2800 · Rdatatable/data.table

st-pasha · 2018-04-27T17:26:01Z

I tested that this fix allows fread to correctly parse the jigsaw-toxic-comments and avito-demand-prediction kaggle datasets.

The bug is resolved by removing the safeguard that would stop reading a field after encountering 100 newlines inside it. This safeguards breaks those use cases where the used does have a dataset with fields containing many newlines (eg. emails, extended descriptions, user comments on the web, etc.)

At first I thought of merely raising the limit higher -- say, to 10000. But that would merely make fread fail less often, but wouldn't eliminate the problem altogether. I also thought of making it an fread option exposed to the user. That would have added more complexity (such as throwing an exception suggesting the user to increase that newline limit), and then the first thing the user would probably do is to increase that limit anyways. So in the end that increased complexity would have served no purpose whatsoever...

The reason why the limit was there in the first place was so that if the user didn't quote their fields correctly, and there was just a single quote in the whole huge file, we didn't want to spend time reading the entire file before trying a more liberal QR. But I feel reading a well-formed CSV file correctly is much more important than the possibility of wasting some time when reading an ill-formed CSV... Because in reality there is no limit on how many newlines you may have in a text field. It could be a billion (although once a single field reaches in size 2^31 bytes, everything will break anyways :).

Closes #2395
Closes #2600

codecov-io · 2018-04-28T00:07:19Z

Codecov Report

Merging #2800 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2800      +/-   ##
==========================================
- Coverage   93.49%   93.48%   -0.01%     
==========================================
  Files          61       61              
  Lines       12367    12364       -3     
==========================================
- Hits        11562    11558       -4     
- Misses        805      806       +1

Impacted Files	Coverage Δ
src/fread.c	`97.95% <ø> (-0.09%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 52ce9b6...0c08dc5. Read the comment docs.

mattdowle

Yep, completely agree. Great.

Fix reading of files where fields may contain many newlines

b7b54a5

st-pasha added the fread label Apr 27, 2018

st-pasha self-assigned this Apr 27, 2018

st-pasha requested a review from mattdowle April 27, 2018 17:26

Merge branch 'master' into fread1

0c08dc5

mattdowle added this to the v1.11.0 milestone Apr 28, 2018

mattdowle approved these changes Apr 28, 2018

View reviewed changes

mattdowle merged commit ffbf0f2 into master Apr 28, 2018

mattdowle deleted the fread1 branch April 28, 2018 01:26

andreas-sudo mentioned this pull request Jan 22, 2020

newline in csv file causes fread to stop #4192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reading of files where fields may contain many newlines#2800

Fix reading of files where fields may contain many newlines#2800
mattdowle merged 2 commits intomasterfrom
fread1

st-pasha commented Apr 27, 2018

Uh oh!

codecov-io commented Apr 28, 2018

Uh oh!

mattdowle left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

st-pasha commented Apr 27, 2018

Uh oh!

codecov-io commented Apr 28, 2018

Codecov Report

Uh oh!

mattdowle left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants