Skip to content

fread(file, nrows=0) file with header does not determine types#5253

Merged
mattdowle merged 6 commits intomasterfrom
fread_nrows=0L_detectTypes
Nov 18, 2021
Merged

fread(file, nrows=0) file with header does not determine types#5253
mattdowle merged 6 commits intomasterfrom
fread_nrows=0L_detectTypes

Conversation

@ben-schwen
Copy link
Copy Markdown
Member

Closes #4029

Follow up to #4694

@codecov
Copy link
Copy Markdown

codecov bot commented Nov 6, 2021

Codecov Report

Merging #5253 (39f60b3) into master (45e7da8) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5253   +/-   ##
=======================================
  Coverage   99.50%   99.50%           
=======================================
  Files          77       77           
  Lines       14603    14603           
=======================================
  Hits        14531    14531           
  Misses         72       72           
Impacted Files Coverage Δ
src/fread.c 99.40% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45e7da8...39f60b3. Read the comment docs.

Comment thread src/fread.c Outdated
const char *firstJumpEnd=NULL; // remember where the winning jumpline from jump 0 ends, to know its size excluding header
const char *prevStart = NULL; // the start of the non-empty line before the first not-ignored row (for warning message later, or taking as column names)
int jumpLines = (int)umin(100,nrowLimit); // how many lines from each jump point to use. If nrowLimit is supplied, nJumps is later set to 1 as well.
int jumpLines = (int)umin(100, umax(1,nrowLimit)); // how many lines from each jump point to use. If nrowLimit is supplied, nJumps is later set to 1 as well. // sample at least 1 line for files with header and nrows==0L #4029
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to remove nrowLimit from here entirely? I have to imagine the first (or any random) row will give the incorrect column type a decent % of the time.

Scanning 100 rows to guess column types should be pretty fast (it makes sense to test)...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the default behavior pre #2623. It seems that it was user requested to only read nrows for sampling if nrows argument is supplied #1267. Bumping from 0 to 1 seems pretty safe to me (because I suspect files to have at least 1-2 "good" lines) while bumping from 0 to 100 might introduce unwanted behavior for those users?

Changing to int jumpLines = 100 already breaks example from #1267

fread("1,2,3\n1,2", nrows=1)
# > fread("1,2,3\n1,2", nrows=1)
#        1     2      3
#    <int> <int> <lgcl>
# 1:     1     2     NA
# Warning message:
# In fread("1,2,3\n1,2", nrows = 1) :
#   Detected 3 column names but the data has 2 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning.
fread("1,2,3\n1,2", nrows=1, header=FALSE)
# > fread("1,2,3\n1,2", nrows=1, header=FALSE)
#       V1    V2
#    <int> <int>
# 1:     1     2

While current fix gives

fread("1,2,3\n1,2", nrows=1)
# > fread("1,2,3\n1,2", nrows=1)
#       V1    V2    V3
#    <int> <int> <int>
# 1:     1     2     3

Copy link
Copy Markdown
Member

@MichaelChirico MichaelChirico Nov 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, point taken. It really seems like a clash of purpose then -- fread(., nrows=0) is essentially asking for a different behavior vs. fread(., nrows=1..100).

Maybe it makes sense to split that into a different argument? Or otherwise treat n=0 differently than n=1..100 -- we check up to 100 rows for n=0 (because the user is trying to get a quick view of their data) but only n rows for nrows=n?

Seems somewhat confusing. Happy to approve this PR as is though.

cc @mattdowle and @jangorecki for vis on essentially a design decision.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While a different argument could have made sense for this as e.g. peak=TRUE, I guess it would be pretty annoying for users who rely on this feature to change it now.

Treating n=0 differently, e.g. sampling 100 rows for nrows=0 and otherwise take min(100, nrows), seems like a good option to me and I'm happy to alter the PR in this direction.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree with nrows=0 running full sample. I altered as you both suggested. Now it just needs a test looking at verbose output to make sure it's doing that, will do ...

@mattdowle mattdowle added this to the 1.14.3 milestone Nov 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

when nrows = 0 in fread, field types returned are not correct

4 participants