Skip to content

Improve fread parser for floats#2629

Merged
mattdowle merged 5 commits intomasterfrom
fread-parse-floats
Feb 19, 2018
Merged

Improve fread parser for floats#2629
mattdowle merged 5 commits intomasterfrom
fread-parse-floats

Conversation

@st-pasha
Copy link
Copy Markdown
Contributor

This PR allows fread to recognize floating-point literals with arbitrary many digits, and parse them as doubles. This includes numbers such as 10000000000000000000000000000.0 or 2.23498720349871093847109387e-21, but not 100000000000000000000000000 (long integer will still be parsed as string).

The reason for this change is that we have already seen multiple datasets where floating-point values are stored with more than the "canonical" 17 significant digits. To an unsuspecting user, such number looks completely innocuous: eg. 2.49102793750273308, and it comes at a complete surprise that fread would parse it as a string... It is also relatively easy to produce these extra digits of precision "by accident" asking for too many digits in printf(), or printing with %f format, etc.

This PR also fixes issue #2625 by being more careful about checking whether the literal is valid.

Closes #2625

@st-pasha st-pasha added this to the v1.10.6 milestone Feb 16, 2018
@st-pasha st-pasha self-assigned this Feb 16, 2018
@st-pasha st-pasha requested a review from mattdowle February 16, 2018 07:42
@codecov-io
Copy link
Copy Markdown

codecov-io commented Feb 16, 2018

Codecov Report

Merging #2629 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2629      +/-   ##
==========================================
+ Coverage   93.03%   93.04%   +<.01%     
==========================================
  Files          61       61              
  Lines       12116    12131      +15     
==========================================
+ Hits        11272    11287      +15     
  Misses        844      844
Impacted Files Coverage Δ
src/fread.c 97.14% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa50640...ac974b8. Read the comment docs.

@MichaelChirico
Copy link
Copy Markdown
Member

Just noticed this in examples(fread)

DT = fread("A\n1.010203040506070809010203040506\n")  # too precise for double, so read as character
# TODO: add numerals=c("allow.loss", "warn.loss", "no.loss") from base::read.table
typeof(DT$A)=="character"   # TRUE

Seems related

Copy link
Copy Markdown
Member

@mattdowle mattdowle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch and fix.

@mattdowle mattdowle merged commit 15abd60 into master Feb 19, 2018
@mattdowle mattdowle deleted the fread-parse-floats branch February 19, 2018 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fread parses some invalid tokens as numeric

4 participants