Skip to content

Fix reading of files where fields may contain many newlines#2800

Merged
mattdowle merged 2 commits intomasterfrom
fread1
Apr 28, 2018
Merged

Fix reading of files where fields may contain many newlines#2800
mattdowle merged 2 commits intomasterfrom
fread1

Conversation

@st-pasha
Copy link
Copy Markdown
Contributor

I tested that this fix allows fread to correctly parse the jigsaw-toxic-comments and avito-demand-prediction kaggle datasets.

The bug is resolved by removing the safeguard that would stop reading a field after encountering 100 newlines inside it. This safeguards breaks those use cases where the used does have a dataset with fields containing many newlines (eg. emails, extended descriptions, user comments on the web, etc.)

At first I thought of merely raising the limit higher -- say, to 10000. But that would merely make fread fail less often, but wouldn't eliminate the problem altogether. I also thought of making it an fread option exposed to the user. That would have added more complexity (such as throwing an exception suggesting the user to increase that newline limit), and then the first thing the user would probably do is to increase that limit anyways. So in the end that increased complexity would have served no purpose whatsoever...

The reason why the limit was there in the first place was so that if the user didn't quote their fields correctly, and there was just a single quote in the whole huge file, we didn't want to spend time reading the entire file before trying a more liberal QR. But I feel reading a well-formed CSV file correctly is much more important than the possibility of wasting some time when reading an ill-formed CSV... Because in reality there is no limit on how many newlines you may have in a text field. It could be a billion (although once a single field reaches in size 2^31 bytes, everything will break anyways :).

Closes #2395
Closes #2600

@st-pasha st-pasha added the fread label Apr 27, 2018
@st-pasha st-pasha self-assigned this Apr 27, 2018
@st-pasha st-pasha requested a review from mattdowle April 27, 2018 17:26
@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #2800 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2800      +/-   ##
==========================================
- Coverage   93.49%   93.48%   -0.01%     
==========================================
  Files          61       61              
  Lines       12367    12364       -3     
==========================================
- Hits        11562    11558       -4     
- Misses        805      806       +1
Impacted Files Coverage Δ
src/fread.c 97.95% <ø> (-0.09%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 52ce9b6...0c08dc5. Read the comment docs.

@mattdowle mattdowle added this to the v1.11.0 milestone Apr 28, 2018
Copy link
Copy Markdown
Member

@mattdowle mattdowle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, completely agree. Great.

@mattdowle mattdowle merged commit ffbf0f2 into master Apr 28, 2018
@mattdowle mattdowle deleted the fread1 branch April 28, 2018 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to parse Kaggle "toxic comments" dataset fread with many new lines in string

3 participants