Improve fread parser for floats by st-pasha · Pull Request #2629 · Rdatatable/data.table

st-pasha · 2018-02-16T07:42:22Z

This PR allows fread to recognize floating-point literals with arbitrary many digits, and parse them as doubles. This includes numbers such as 10000000000000000000000000000.0 or 2.23498720349871093847109387e-21, but not 100000000000000000000000000 (long integer will still be parsed as string).

The reason for this change is that we have already seen multiple datasets where floating-point values are stored with more than the "canonical" 17 significant digits. To an unsuspecting user, such number looks completely innocuous: eg. 2.49102793750273308, and it comes at a complete surprise that fread would parse it as a string... It is also relatively easy to produce these extra digits of precision "by accident" asking for too many digits in printf(), or printing with %f format, etc.

This PR also fixes issue #2625 by being more careful about checking whether the literal is valid.

Closes #2625

…s doubles

codecov-io · 2018-02-16T07:45:22Z

Codecov Report

Merging #2629 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2629      +/-   ##
==========================================
+ Coverage   93.03%   93.04%   +<.01%     
==========================================
  Files          61       61              
  Lines       12116    12131      +15     
==========================================
+ Hits        11272    11287      +15     
  Misses        844      844

Impacted Files	Coverage Δ
src/fread.c	`97.14% <100%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa50640...ac974b8. Read the comment docs.

MichaelChirico · 2018-02-17T07:09:09Z

Just noticed this in examples(fread)

DT = fread("A\n1.010203040506070809010203040506\n")  # too precise for double, so read as character
# TODO: add numerals=c("allow.loss", "warn.loss", "no.loss") from base::read.table
typeof(DT$A)=="character"   # TRUE

Seems related

mattdowle

Great catch and fix.

st-pasha added 3 commits February 15, 2018 23:26

Parse floating-point values with arbitrary number of decimal digits a…

03410d5

…s doubles

Add a note in NEWS.md

0db86d4

Also parse numbers such as 129801928301982370918723123e12

84b6c09

st-pasha added bug enhancement fread labels Feb 16, 2018

st-pasha added this to the v1.10.6 milestone Feb 16, 2018

st-pasha self-assigned this Feb 16, 2018

st-pasha requested a review from mattdowle February 16, 2018 07:42

MichaelChirico and others added 2 commits February 18, 2018 20:09

Merge branch 'master' into fread-parse-floats

6e8405a

Updated wording given this a good catch of dev-only behaviour.

ac974b8

mattdowle approved these changes Feb 19, 2018

View reviewed changes

mattdowle merged commit 15abd60 into master Feb 19, 2018

mattdowle deleted the fread-parse-floats branch February 19, 2018 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve fread parser for floats#2629

Improve fread parser for floats#2629
mattdowle merged 5 commits intomasterfrom
fread-parse-floats

st-pasha commented Feb 16, 2018

Uh oh!

codecov-io commented Feb 16, 2018 •

edited

Loading

Uh oh!

MichaelChirico commented Feb 17, 2018

Uh oh!

mattdowle left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

st-pasha commented Feb 16, 2018

Uh oh!

codecov-io commented Feb 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MichaelChirico commented Feb 17, 2018

Uh oh!

mattdowle left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented Feb 16, 2018 •

edited

Loading