Conversation
Codecov Report
@@ Coverage Diff @@
## master #2623 +/- ##
==========================================
+ Coverage 92.94% 92.95% +0.01%
==========================================
Files 61 61
Lines 12109 12130 +21
==========================================
+ Hits 11255 11276 +21
Misses 854 854
Continue to review full report at Codecov.
|
st-pasha
approved these changes
Feb 13, 2018
Contributor
st-pasha
left a comment
There was a problem hiding this comment.
Great changes, happy to see so many issues resolved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1267
Closes #2518
Closes #2515 (fixed previously, actually, but including its good test in this PR)
Closes #1671
Default for
skip=now"__auto__"rather than 0. To more correctly represent what actually happens. The__are so thatskip="auto"still looks for a line containing the string"auto". Whenskip=is used, this now determines the first row (either column names or data) and automatic detection of the first consistent line is now turned off (before, it used to still run).autostart=NAmoved to the end of the argument list as it's deprecated. That was about automatically searching up from within a consistent block whenskip=was pointing inside one of many tables. That was removed a while back in dev.nrow=limit can no longer be 0. Must be>=1. If 1, then 1 row is used for type sampling. Before, the sampling proceeded on the whole file regardless ofnrow=. What I had in mind there is consistency of column types when user is extracting batches from a valid file. However, invalid files are more common and causing more pain, which is more often whyskip=andnrow=is used. Invalid lines outside this range no longer cause errors; i.e. works as user expects.If a too-few or too-many field line occurs, the result is returned up to that line with a detailed warning suggesting
fill=TRUE. If that occurs in-sample, that jump is skipped and the warning is left to after reading when we're sure previous jumps have processed correctly. Before, spurious invalid lines were tripping up sampling just because the jump point wasn't good. Sampling is simplified now.If
nrow=is supplied, multi-threading is now turned off. Since if an out-of-sample bump occurred in a jump after the jump which reachesnrow=, the bump from the later thread would still occur (and trigger reread) when that was not intended because the bump occurred after thenrow=was read. Accommodating that while keeping MT would require each thread to have its own copy oftype[], or maintain a stack of bumps to be applied in the ordered clause. Both would complicate the code. A copy oftype[]for each thread would bite memory usage in the case of 10,000 columns, too. Further, whennrow=is supplied, just the first jump is now sampled. Similarly, we don't want sampling problems or bumps after thenrow=row to affect the result.Using
nrows=also now turns off auto skip, i.e. skip is set to 0 and column names are expected on line 1 (since auto skip relies on testing 100 rows to find the biggest contiguous consistent set of rows). Ifnrow=is provided, it could be small (say 1), so for consistency, auto skip is then off.Sampling no longer attempts to find the
lastRowEnd. This relied on the last jump finding a goodnextGoodLine()and could be incorrect in edge cases. The data read step now always goes up toeofand checks for discarding footer afterwards, once all jumps have been successfully completed and therefore we're sure we're positioned correctly at the end.An out-of-sample type bump now checks that line has the correct number of fields, before applying the bump. Before, a too-few or too-many line was an error so this didn't matter, but now that it's a warning and the result is returned up to that point, bumps from the invalid line should not affect the result.