fread(file, nrows=0) file with header does not determine types#5253
fread(file, nrows=0) file with header does not determine types#5253
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5253 +/- ##
=======================================
Coverage 99.50% 99.50%
=======================================
Files 77 77
Lines 14603 14603
=======================================
Hits 14531 14531
Misses 72 72
Continue to review full report at Codecov.
|
| const char *firstJumpEnd=NULL; // remember where the winning jumpline from jump 0 ends, to know its size excluding header | ||
| const char *prevStart = NULL; // the start of the non-empty line before the first not-ignored row (for warning message later, or taking as column names) | ||
| int jumpLines = (int)umin(100,nrowLimit); // how many lines from each jump point to use. If nrowLimit is supplied, nJumps is later set to 1 as well. | ||
| int jumpLines = (int)umin(100, umax(1,nrowLimit)); // how many lines from each jump point to use. If nrowLimit is supplied, nJumps is later set to 1 as well. // sample at least 1 line for files with header and nrows==0L #4029 |
There was a problem hiding this comment.
Would it make sense to remove nrowLimit from here entirely? I have to imagine the first (or any random) row will give the incorrect column type a decent % of the time.
Scanning 100 rows to guess column types should be pretty fast (it makes sense to test)...
There was a problem hiding this comment.
This was the default behavior pre #2623. It seems that it was user requested to only read nrows for sampling if nrows argument is supplied #1267. Bumping from 0 to 1 seems pretty safe to me (because I suspect files to have at least 1-2 "good" lines) while bumping from 0 to 100 might introduce unwanted behavior for those users?
Changing to int jumpLines = 100 already breaks example from #1267
fread("1,2,3\n1,2", nrows=1)
# > fread("1,2,3\n1,2", nrows=1)
# 1 2 3
# <int> <int> <lgcl>
# 1: 1 2 NA
# Warning message:
# In fread("1,2,3\n1,2", nrows = 1) :
# Detected 3 column names but the data has 2 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning.
fread("1,2,3\n1,2", nrows=1, header=FALSE)
# > fread("1,2,3\n1,2", nrows=1, header=FALSE)
# V1 V2
# <int> <int>
# 1: 1 2While current fix gives
fread("1,2,3\n1,2", nrows=1)
# > fread("1,2,3\n1,2", nrows=1)
# V1 V2 V3
# <int> <int> <int>
# 1: 1 2 3There was a problem hiding this comment.
I see, point taken. It really seems like a clash of purpose then -- fread(., nrows=0) is essentially asking for a different behavior vs. fread(., nrows=1..100).
Maybe it makes sense to split that into a different argument? Or otherwise treat n=0 differently than n=1..100 -- we check up to 100 rows for n=0 (because the user is trying to get a quick view of their data) but only n rows for nrows=n?
Seems somewhat confusing. Happy to approve this PR as is though.
cc @mattdowle and @jangorecki for vis on essentially a design decision.
There was a problem hiding this comment.
While a different argument could have made sense for this as e.g. peak=TRUE, I guess it would be pretty annoying for users who rely on this feature to change it now.
Treating n=0 differently, e.g. sampling 100 rows for nrows=0 and otherwise take min(100, nrows), seems like a good option to me and I'm happy to alter the PR in this direction.
There was a problem hiding this comment.
Yes agree with nrows=0 running full sample. I altered as you both suggested. Now it just needs a test looking at verbose output to make sure it's doing that, will do ...
Closes #4029
Follow up to #4694