fread(file, nrows=0) file with header does not determine types by ben-schwen · Pull Request #5253 · Rdatatable/data.table

ben-schwen · 2021-11-06T01:17:59Z

Follow up to #4694

codecov · 2021-11-06T01:33:58Z

Codecov Report

Merging #5253 (39f60b3) into master (45e7da8) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #5253   +/-   ##
=======================================
  Coverage   99.50%   99.50%           
=======================================
  Files          77       77           
  Lines       14603    14603           
=======================================
  Hits        14531    14531           
  Misses         72       72

Impacted Files	Coverage Δ
src/fread.c	`99.40% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45e7da8...39f60b3. Read the comment docs.

MichaelChirico · 2021-11-07T03:25:57Z

  const char *firstJumpEnd=NULL; // remember where the winning jumpline from jump 0 ends, to know its size excluding header
  const char *prevStart = NULL;  // the start of the non-empty line before the first not-ignored row (for warning message later, or taking as column names)
-  int jumpLines = (int)umin(100,nrowLimit);   // how many lines from each jump point to use. If nrowLimit is supplied, nJumps is later set to 1 as well.
+  int jumpLines = (int)umin(100, umax(1,nrowLimit));   // how many lines from each jump point to use. If nrowLimit is supplied, nJumps is later set to 1 as well. // sample at least 1 line for files with header and nrows==0L #4029


Would it make sense to remove nrowLimit from here entirely? I have to imagine the first (or any random) row will give the incorrect column type a decent % of the time.

Scanning 100 rows to guess column types should be pretty fast (it makes sense to test)...

This was the default behavior pre #2623. It seems that it was user requested to only read nrows for sampling if nrows argument is supplied #1267. Bumping from 0 to 1 seems pretty safe to me (because I suspect files to have at least 1-2 "good" lines) while bumping from 0 to 100 might introduce unwanted behavior for those users?

Changing to int jumpLines = 100 already breaks example from #1267

fread("1,2,3\n1,2", nrows=1) # > fread("1,2,3\n1,2", nrows=1) # 1 2 3 # <int> <int> <lgcl> # 1: 1 2 NA # Warning message: # In fread("1,2,3\n1,2", nrows = 1) : # Detected 3 column names but the data has 2 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning. fread("1,2,3\n1,2", nrows=1, header=FALSE) # > fread("1,2,3\n1,2", nrows=1, header=FALSE) # V1 V2 # <int> <int> # 1: 1 2

While current fix gives

fread("1,2,3\n1,2", nrows=1) # > fread("1,2,3\n1,2", nrows=1) # V1 V2 V3 # <int> <int> <int> # 1: 1 2 3

I see, point taken. It really seems like a clash of purpose then -- fread(., nrows=0) is essentially asking for a different behavior vs. fread(., nrows=1..100).

Maybe it makes sense to split that into a different argument? Or otherwise treat n=0 differently than n=1..100 -- we check up to 100 rows for n=0 (because the user is trying to get a quick view of their data) but only n rows for nrows=n?

Seems somewhat confusing. Happy to approve this PR as is though.

cc @mattdowle and @jangorecki for vis on essentially a design decision.

While a different argument could have made sense for this as e.g. peak=TRUE, I guess it would be pretty annoying for users who rely on this feature to change it now.

Treating n=0 differently, e.g. sampling 100 rows for nrows=0 and otherwise take min(100, nrows), seems like a good option to me and I'm happy to alter the PR in this direction.

Yes agree with nrows=0 running full sample. I altered as you both suggested. Now it just needs a test looking at verbose output to make sure it's doing that, will do ...

ben-schwen added 2 commits November 6, 2021 02:16

fix #4029

1902f59

NEWS

9b37b09

MichaelChirico reviewed Nov 7, 2021

View reviewed changes

MichaelChirico approved these changes Nov 7, 2021

View reviewed changes

mattdowle added 2 commits November 16, 2021 06:13

merge master

2286645

nrows=0 uses full sample

7eb9717

mattdowle added this to the 1.14.3 milestone Nov 16, 2021

mattdowle added 2 commits November 17, 2021 18:01

add output= tests of sampling when nrows=0

aa3781f

tweak comment on test

39f60b3

mattdowle merged commit 96860f2 into master Nov 18, 2021

mattdowle deleted the fread_nrows=0L_detectTypes branch November 18, 2021 02:18

ben-schwen mentioned this pull request Jan 15, 2022

fread internal error for nrow=0 on malformed csv #4355

Closed

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

This was referenced Dec 31, 2023

fread(nrows = 0) for a particular file enters infinite loop in dev #5868

Closed

fix fread(..., nrows=0) infinite loop #5869

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread(file, nrows=0) file with header does not determine types#5253

fread(file, nrows=0) file with header does not determine types#5253
mattdowle merged 6 commits intomasterfrom
fread_nrows=0L_detectTypes

ben-schwen commented Nov 6, 2021

Uh oh!

codecov bot commented Nov 6, 2021 •

edited

Loading

Uh oh!

MichaelChirico Nov 7, 2021

Uh oh!

ben-schwen Nov 7, 2021

Uh oh!

MichaelChirico Nov 7, 2021 •

edited

Loading

Uh oh!

ben-schwen Nov 8, 2021

Uh oh!

mattdowle Nov 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ben-schwen commented Nov 6, 2021

Uh oh!

codecov bot commented Nov 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MichaelChirico Nov 7, 2021

Choose a reason for hiding this comment

Uh oh!

ben-schwen Nov 7, 2021

Choose a reason for hiding this comment

Uh oh!

MichaelChirico Nov 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ben-schwen Nov 8, 2021

Choose a reason for hiding this comment

Uh oh!

mattdowle Nov 16, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Nov 6, 2021 •

edited

Loading

MichaelChirico Nov 7, 2021 •

edited

Loading