diff --git a/NEWS.md b/NEWS.md index cfa5ec8160..537ed18966 100644 --- a/NEWS.md +++ b/NEWS.md @@ -3,6 +3,26 @@ ### Changes in v1.10.5 ( in development ) +#### NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES + +1. `fread()`'s `na.strings=` argument : + ``` + "NA" # old default + getOption("datatable.na.strings", "NA") # this release; i.e. the same; no change yet + getOption("datatable.na.strings", "") # future release + ``` +This option controls how `,,` is read in character columns. It does not affect numeric columns which read `,,` as `NA` regardless. We would like `,,`=>`NA` for consistency with numeric types, and `,"",`=>empty string to be the standard default for `fwrite/fread` character columns so that `fread(fwrite(DT))==DT` without needing any change to any parameters. `fwrite` has never written `NA` as `"NA"` in case `"NA"` is a valid string in the data; e.g., 2 character id columns sometimes do. Instead, `fwrite` has always written `,,` by default for an `` in a character columns. The use of R's `getOption()` allows users to move forward now, using `options(datatable.fread.na.strings="")`, or restore old behaviour when the default's default is changed in future, using `options(datatable.fread.na.strings="NA")`. + +2. `fread()` and `fwrite()`'s `logical01=` argument : + ``` + logical01 = FALSE # old default + getOption("datatable.logical01", FALSE) # this release; i.e. the same; no change yet + getOption("datatable.logical01", TRUE) # future release + ``` +This option controls whether a column of all 0's and 1's is read as `integer`, or `logical` directly to avoid needing to change the type afterwards to `logical` or use `colClasses`. `0/1` is smaller and faster than `"TRUE"/"FALSE"`, which can make a significant difference to space and time the more `logical` columns there are. When the default's default changes to `TRUE` for `fread` we do not expect much impact since all arithmetic operators that are currently receiving 0's and 1's as type `integer` (think `sum()`) but instead could receive `logical`, would return exactly the same result on the 0's and 1's as `logical` type. However, code that is manipulating column types using `is.integer` or `is.logical` on `fread`'s result, could require change. It could be painful if `DT[(logical_column)]` (i.e. `DT[logical_column==TRUE]`) changed behaviour due to `logical_column` no longer being type `logical` but `integer`. But that is not the change proposed. The change is the other way around; i.e., a previously `integer` column holding only 0's and 1's would now be type `logical`. Since it's that way around, we believe the scope for breakage is limited. We think a lot of code is converting 0/1 integer columns to logical anyway, either using `colClasses=` or afterwards with an assign. For `fwrite`, the level of breakage depends on the consumer of the output file. We believe `0/1` is a better more standard default choice to move to. See notes below about improvements to `fread`'s sampling for type guessing, and automatic rereading in the rare cases of out-of-sample type surprises. + +These options are meant for temporary use to aid your migration, [#2652](https://github.com/Rdatatable/data.table/pull/2652). You are not meant to set them to the old default and then not migrate your code that is dependent on the default. Either set the argument explicitly so your code is not dependent on the default, or change the code to cope with the new default. Over the next few years we will slowly start to remove these options, warning you if you are using them, and return to a simple default. See the history of NEWS and NEWS.0 for past migrations that have, generally speaking, been successfully managed in this way. For example, at the end of NOTES for this version (below in this file) is a note about the usage of `datatable.old.unique.by.key` now warning, as you were warned it would do over a year ago. When that change was introduced, the default was changed and that option provided an option to restore the old behaviour. These `fread`/`fwrite` changes are even more cautious and not even changing the default's default yet. Giving you extra warning by way of this notice to move forward. And giving you a chance to object. + #### NEW FEATURES 1. `fread()`: @@ -17,25 +37,24 @@ * `dec=','` is now implemented directly so there is no dependency on locale. The options `datatable.fread.dec.experiment` and `datatable.fread.dec.locale` have been removed. * `\\r\\r\\n` line endings are now handled such as produced by `base::download.file()` when it doubles up `\\r`. Other rare line endings (`\\r` and `\\n\\r`) are now more robust. * Mixed line endings are now handled; e.g. a file formed by concatenating a Unix file and a Windows file so that some lines end with `\\n` while others end with `\\r\\n`. - * Improved automatic detection of whether the first row is column names by comparing the types of the fields on the first and second row. + * Improved automatic detection of whether the first row is column names by comparing the types of the fields on the first row against the column types ascertained by the 10,000 rows sample (or `colClasses` if provided). If a numeric column has a string value at the top, then column names are deemed present. * Detects GB-18030 and UTF-16 encodings and in verbose mode prints a message about BOM detection. * Detects and ignores trailing ^Z end-of-file control character sometimes created on MS DOS/Windows, [#1612](https://github.com/Rdatatable/data.table/issues/1612). Thanks to Gergely Daróczi for reporting and providing a file. - * Added option `logical01` to read a column of only `0`s and `1`s as `logical`, default `TRUE` for convenience in most cases. The large sample of rows throughout the file means that `fread` will be confident that the column really does just contain `0`s and `1`s, enabling and encouraging this convenient and efficient choice to save needing conversion afterwards or setting `colClasses` manually. In R, `logical` is `integer` anyway and can be treated as such in calculations. Further, it is no longer allowed to have mixed-case literals within a single column; i.e., a column of `TRUE/FALSE`s is ok, as well as `True/False`s and `true/false`s, but mixing different styles together is not. * Added ability to recognize and parse hexadecimal floating point numbers, as used for example in Java. Thanks for @scottstanfield [#2316](https://github.com/Rdatatable/data.table/issues/2316) for the report. * Now handles floating-point NaN values in a wide variety of formats, including `NaN`, `sNaN`, `1.#QNAN`, `NaN1234`, `#NUM!` and others, [#1800](https://github.com/Rdatatable/data.table/issues/1800). Thanks to Jori Liesenborgs for highlighting and the PR. * If negative numbers are passed to `select=` the out-of-range error now suggests `drop=` instead, [#2423](https://github.com/Rdatatable/data.table/issues/2423). Thanks to Michael Chirico for the suggestion. - * `sep=NULL` or `sep=""` (i.e., no column separator) can now be used to specify single column input reliably like `base::readLines`, [#1616](https://github.com/Rdatatable/data.table/issues/1616). `sep='\\n'` still works (even on Windows where line ending is actually `\\r\\n`) but `NULL` or `""` are now documented and recommended. Thanks to Dmitriy Selivanov for the pull request and many others for comments. As before, `sep=NA` is not valid; use the default `"auto"` for automatic separator detection. `sep='\\n'` may be deprecated in future. - * Single-column input with blank lines is now valid and the blank lines are significant (meaning an NA in the single column). The blank lines are significant even at the very end, which may be surprising on first glance. The change is so that `fread(fwrite(DT))==DT` for single-column inputs containing NA which are written as blank. There is no change when `ncol>1` (i.e., input stops with detailed warning at the first blank line) because a blank line when `ncol>1` is invalid input due to no separators present instead of `ncol-1` separators. + * `sep=NULL` or `sep=""` (i.e., no column separator) can now be used to specify single column input reliably like `base::readLines`, [#1616](https://github.com/Rdatatable/data.table/issues/1616). `sep='\\n'` still works (even on Windows where line ending is actually `\\r\\n`) but `NULL` or `""` are now documented and recommended. Thanks to Dmitriy Selivanov for the pull request and many others for comments. As before, `sep=NA` is not valid; use the default `"auto"` for automatic separator detection. `sep='\\n'` is now deprecated and in future will start to warn when used. + * Single-column input with blank lines is now valid and the blank lines are significant (representing `NA`). The blank lines are significant even at the very end, which may be surprising on first glance. The change is so that `fread(fwrite(DT))==DT` for single-column inputs containing `NA` which are written as blank. There is no change when `ncol>1`; i.e., input stops with detailed warning at the first blank line, because a blank line when `ncol>1` is invalid input due to no separators being present. Thanks to @skanskan, Michael Chirico, @franknarf1 and Pasha for the testing and discussions, [#2106](https://github.com/Rdatatable/data.table/issues/2106). * Too few column names are now auto filled with default column names, with warning, [#1625](https://github.com/Rdatatable/data.table/issues/1625). If there is just one missing column name it is guessed to be for the first column (row names or an index), otherwise the column names are filled at the end. Similarly, too many column names now automatically sets `fill=TRUE`, with warning. - * `skip=` and `nrow=` are more reliable and no longer affected by invalid lines outside the range specified. Thanks to Ziyad Saeed and Kyle Chung for reporting, [#1267](https://github.com/Rdatatable/data.table/issues/1267). Tests added. + * `skip=` and `nrow=` are more reliable and are no longer affected by invalid lines outside the range specified. Thanks to Ziyad Saeed and Kyle Chung for reporting, [#1267](https://github.com/Rdatatable/data.table/issues/1267). * Ram disk (`/dev/shm`) is no longer used for the output of system command input. Although faster when it worked, it was causing too many device full errors; e.g., [#1139](https://github.com/Rdatatable/data.table/issues/1139) and [zUMIs/19](https://github.com/sdparekh/zUMIs/issues/19). Thanks to Kyle Chung for reporting. Standard `tempdir()` is now used. If you wish to use ram disk, set TEMPDIR to `/dev/shm`; see `?tempdir`. * Detecting whether a very long input string is a file name or data is now much faster, [#2531](https://github.com/Rdatatable/data.table/issues/2531). Many thanks to @javrucebo for the detailed report, benchmarks and suggestions. + * A column of `TRUE/FALSE`s is ok, as well as `True/False`s and `true/false`s, but mixing styles (e.g. `TRUE/false`) is not and will be read as type `character`. * Many thanks to @yaakovfeldman, Guillermo Ponce, Arun Srinivasan, Hugh Parsonage, Mark Klik, Pasha Stetsenko, Mahyar K, Tom Crockett, @cnoelke, @qinjs, @etienne-s, Mark Danese, Avraham Adler, @franknarf1, @MichaelChirico, @tdhock, Luke Tierney for testing dev and reporting these regressions before release to CRAN: #2070, #2073, #2087, #2091, #2107, #2118, #2092, #1888, #2123, #2167, #2194, #2238, #2228, #1464, #2201, #2287, #2299, #2285, #2251, #2347, #2222, #2352, #2246, #2370, #2371, #2404, #2196, #2322, #2453, #2446, #2464, #2457, #1895, #2481, #2499, #2516, #2520, #2512, #2523, #2542, #2526, #2518, #2515, #1671, #2267, #2561, #2625, #2265, #2548, #2535 2. `fwrite()`: * empty strings are now always quoted (`,"",`) to distinguish them from `NA` which by default is still empty (`,,`) but can be changed using `na=` as before. If `na=` is provided and `quote=` is the default `'auto'` then `quote=` is set to `TRUE` so that if the `na=` value occurs in the data, it can be distinguished from `NA`. Thanks to Ethan Welty for the request [#2214](https://github.com/Rdatatable/data.table/issues/2214) and Pasha for the code change and tests, [#2215](https://github.com/Rdatatable/data.table/issues/2215). - * `logicalAsInt` has been renamed `logical01` and the default changed from `FALSE` to `TRUE`, both changes for consistency with `fread` (see item above). The old name `logicalAsInt` continues to work but is now deprecated. The previous default can easily be restored without any code changes by setting `options("datatable.logical01" = FALSE)`. - * When `DT` is a single column, `na=` is now set to `"NA"` to avoid blank lines in the output, [#2106](https://github.com/Rdatatable/data.table/issues/2106). Thanks to @skanskan, Michael Chirico and @franknarf1 for the testing and ideas. + * `logical01` has been added and the old name `logicalAsInt` retained. Pease move to the new name when convenient for you. The old argument name (`logicalAsInt`) will slowly be deprecated over the next few years. The default is unchanged: `FALSE`, so `logical` is still written as `"TRUE"`/`"FALSE"` in full by default. We intend to change the default's default in future to `TRUE`; see the notice at the top of these release notes. 3. Added helpful message when subsetting by a logical column without wrapping it in parentheses, [#1844](https://github.com/Rdatatable/data.table/issues/1844). Thanks @dracodoc for the suggestion and @MichaelChirico for the PR. @@ -158,6 +177,8 @@ Thanks to @sritchie73 for reporting and fixing [PR#2631](https://github.com/Rdat 35. `CJ()` now fails with proper error message when results would exceed max integer, [#2636](https://github.com/Rdatatable/data.table/issues/2636). +36. `NA` in character columns now display as `` just like base R to distinguish from `""` and `"NA"`. + #### NOTES 0. The license has been changed from GPL to MPL (Mozilla Public License). All contributors were consulted and approved. [PR#2456](https://github.com/Rdatatable/data.table/pull/2456) details the reasons for the change. diff --git a/R/fread.R b/R/fread.R index 92ad3a1000..f6e7d774e2 100644 --- a/R/fread.R +++ b/R/fread.R @@ -1,5 +1,5 @@ -fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings="NA",stringsAsFactors=FALSE,verbose=getOption("datatable.verbose"),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(),data.table=getOption("datatable.fread.datatable"),nThread=getDTthreads(),logical01=TRUE,autostart=NA) +fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings=getOption("datatable.na.strings","NA"),stringsAsFactors=FALSE,verbose=getOption("datatable.verbose",FALSE),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64","integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(), data.table=getOption("datatable.fread.datatable",TRUE), nThread=getDTthreads(), logical01=getOption("datatable.logical01", FALSE), autostart=NA) { if (is.null(sep)) sep="\n" # C level knows that \n means \r\n on Windows, for example else { diff --git a/R/fwrite.R b/R/fwrite.R index 02b59cc63f..914c23c2cd 100644 --- a/R/fwrite.R +++ b/R/fwrite.R @@ -1,8 +1,8 @@ fwrite <- function(x, file="", append=FALSE, quote="auto", sep=",", sep2=c("","|",""), eol=if (.Platform$OS.type=="windows") "\r\n" else "\n", - na=if (length(x)>1L) "" else "NA", dec=".", row.names=FALSE, col.names=TRUE, + na="", dec=".", row.names=FALSE, col.names=TRUE, qmethod=c("double","escape"), - logical01=getOption("datatable.logical01", TRUE), + logical01=getOption("datatable.logical01", FALSE), # due to change to TRUE; see NEWS logicalAsInt=logical01, dateTimeAs = c("ISO","squash","epoch","write.csv"), buffMB=8, nThread=getDTthreads(), diff --git a/R/onLoad.R b/R/onLoad.R index 6690b604e5..d7ffc1c5da 100644 --- a/R/onLoad.R +++ b/R/onLoad.R @@ -29,7 +29,9 @@ lockBinding("rbind.data.frame",baseenv()) } # Set options for the speed boost in v1.8.0 by avoiding 'default' arg of getOption(,default=) - # TODO: submit improvement to .Internal(getOption(x)) in base::getOption to return NULL when option not set, to avoid (relatively slow) 'x %in% names(options())' there. + # In fread and fwrite we have moved back to using getOption's default argument since it is unlikely fread and fread will be called in a loop many times, plus they + # are relatively heavy functions where the overhead in getOption() would not be noticed. It's only really [.data.table where getOption default bit. + # Improvement to base::getOption() now submitted (100x; 5s down to 0.05s): https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17394 opts = c("datatable.verbose"="FALSE", # datatable. "datatable.nomatch"="NA_integer_", # datatable. "datatable.optimize"="Inf", # datatable. @@ -43,13 +45,10 @@ "datatable.dfdispatchwarn"="TRUE", # not a function argument "datatable.warnredundantby"="TRUE", # not a function argument "datatable.alloccol"="1024L", # argument 'n' of alloc.col. Over-allocate 1024 spare column slots - "datatable.integer64"="'integer64'", # datatable. integer64|double|character "datatable.auto.index"="TRUE", # DT[col=="val"] to auto add index so 2nd time faster "datatable.use.index"="TRUE", # global switch to address #1422 - "datatable.fread.datatable"="TRUE", "datatable.prettyprint.char" = NULL, # FR #1091 - "datatable.old.unique.by.key" = "FALSE", # TODO: change warnings in duplicated.R to error on or after Jan 2019 then remove in Jan 2020. - "datatable.logical01" = "TRUE" # fwrite/fread to revert to FALSE. TODO: warn in next release and remove after 1 year + "datatable.old.unique.by.key" = "FALSE" # TODO: change warnings in duplicated.R to error on or after Jan 2019 then remove in Jan 2020. ) for (i in setdiff(names(opts),names(options()))) { eval(parse(text=paste("options(",i,"=",opts[i],")",sep=""))) diff --git a/R/print.data.table.R b/R/print.data.table.R index ae6a8249b5..ae9791e491 100644 --- a/R/print.data.table.R +++ b/R/print.data.table.R @@ -58,7 +58,7 @@ print.data.table <- function(x, topn=getOption("datatable.print.topn"), rn = seq_len(nrow(x)) printdots = FALSE } - toprint=format.data.table(toprint, ...) + toprint=format.data.table(toprint, na.encode=FALSE, ...) # na.encode=FALSE so that NA in character cols print as if ((!"bit64" %chin% loadedNamespaces()) && any(sapply(x,inherits,"integer64"))) require_bit64() # When we depend on R 3.2.0 (Apr 2015) we can use isNamespaceLoaded() added then, instead of %chin% above diff --git a/R/utils.R b/R/utils.R index b14579b172..51d53e77b7 100644 --- a/R/utils.R +++ b/R/utils.R @@ -69,3 +69,5 @@ vapply_1i <- function (x, fun, ..., use.names = TRUE) { vapply(X = x, FUN = fun, ..., FUN.VALUE = NA_integer_, USE.NAMES = use.names) } +more = function(f) system(paste("more",f)) # nocov (just a dev helper) + diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index 04862db9c8..6c722b7d8c 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -2455,7 +2455,7 @@ if (test_bit64) { # getwd() has been set by test.data.table() to the location of this tests.Rraw file. Test files should be in the same directory. f = testDir("ch11b.dat") # http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat test(900.1, fread(f, logical01=FALSE), as.data.table(read.table(f))) -test(900.2, fread(f), as.data.table(read.table(f))[,V5:=as.logical(V5)]) +test(900.2, fread(f, logical01=TRUE), as.data.table(read.table(f))[,V5:=as.logical(V5)]) f = testDir("1206FUT.txt") # a CRLF line ending file (DOS) test(901.1, DT<-fread(f,strip.white=FALSE), setDT(read.table(f,sep="\t",header=TRUE,colClasses=as.vector(sapply(DT,class))))) @@ -2759,14 +2759,14 @@ test(1009, DT[,list(mean(a), sum(a)),by=b], data.table(b=c(1,2),V1=c(NA,0),V2=c( # an fread error shouldn't hold a lock on the file on Windows # TODO: now that these are warnings and not errors, we need another way to trigger a STOP() inside fread.c. options(warn=2) isn't enough. cat('A,B\n1,2\n3\n5,6\n', file=(f<-tempfile())) -test(1010.1, fread(f), ans<-data.table(A=TRUE, B=2L), warning=(txt<-"Stopped early on line 3.*Expected 2 fields but found 1.*fill.*TRUE.*<<3>>")) -test(1010.2, fread(f), ans, warning=txt) +test(1010.1, fread(f,logical01=TRUE), ans<-data.table(A=TRUE, B=2L), warning=(txt<-"Stopped early on line 3.*Expected 2 fields but found 1.*fill.*TRUE.*<<3>>")) +test(1010.2, fread(f,logical01=TRUE), ans, warning=txt) cat('7\n8,9',file=f,append=TRUE) # that append works after error test(1010.3, fread(f,fill=TRUE), data.table(A=INT(1,3,5,7,8), B=INT(2,NA,6,NA,9))) -test(1010.4, fread(f), ans, warning=txt) +test(1010.4, fread(f,logical01=TRUE), ans, warning=txt) cat('A,B\n1,2\n3\n5,6\n', file=f) # that overwrite works after error test(1010.5, fread(f,fill=TRUE), data.table(A=INT(1,3,5), B=INT(2,NA,6))) -test(1010.6, fread(f), ans, warning=txt) +test(1010.6, fread(f,logical01=TRUE), ans, warning=txt) unlink(f) # that file can be removed after error test(1010.7, !file.exists(f)) @@ -3052,7 +3052,8 @@ b = rbind(a, data.table(z=2,x=1)) test(1080, b$z, c(1,2,3,2)) # mid row logical detection -test(1081, fread("A,B,C\n1,T,2\n"), data.table(A=TRUE,B="T",C=2L)) +test(1081.1, fread("A,B,C\n1,T,2\n",logical01=TRUE), data.table(A=TRUE,B="T",C=2L)) +test(1081.2, fread("A,B,C\n1,T,2\n",logical01=FALSE), data.table(A=1L,B="T",C=2L)) # cartesian join answer's key should contain only the columns considered in binary search. Fixes #2677 set.seed(45) @@ -9072,6 +9073,9 @@ test(1658.28, fwrite(data.table(a=1)[NULL,]), error="ncol(x) > 0L is not TRUE") # 0.0 written as 0, but TODO #2398, probably related to the 2 lines after l==0 missing coverage in writeFloat64 test(1658.29, fwrite(data.table(id=c("A","B","C"), v=c(1.1,0.0,9.9))), output="id,v\nA,1.1\nB,0\nC,9.9") +# logical NA as "NA" when logical01=TRUE, instead of the default na="" which writes all types including in character column as ,, consistently. +test(1658.30, fwrite(data.table(id=1:3,bool=c(TRUE,NA,FALSE)),na="NA",logical01=TRUE), output="\"id\",\"bool\"\n1,1\n2,NA\n3,0") + ## End fwrite tests # tests for #679, inrange(), FR #707 @@ -9646,9 +9650,9 @@ test(1729.2, fwrite(data.table(V2=c(9.999999999999998223643160599749535322189331 DT = data.table(V1=c(9999999999.99, 0.00000000000000099, 0.0000000000000000000009, 0.9, 9.0, 9.1, 99.9, 0.000000000000000000000999999999999999999999999, 99999999999999999999999999999.999999)) -ans = "\"V1\"\n9999999999.99\n9.9e-16\n9e-22\n0.9\n9\n9.1\n99.9\n1e-21\n1e+29" +ans = "V1\n9999999999.99\n9.9e-16\n9e-22\n0.9\n9\n9.1\n99.9\n1e-21\n1e+29" test(1729.3, fwrite(DT), output=ans) -test(1729.4, write.csv(DT,row.names=FALSE), output=ans) +test(1729.4, write.csv(DT,row.names=FALSE,quote=FALSE), output=ans) options(oldverbose) # same decimal/scientific rule (shortest format) as write.csv @@ -9720,7 +9724,7 @@ DT = data.table(unlist(.Machine[c("double.eps","double.neg.eps","double.xmin","d # double.eps double.neg.eps double.xmin double.xmax # 2.220446e-16 1.110223e-16 2.225074e-308 1.797693e+308 test(1729.13, typeof(DT[[1L]]), "double") -test(1729.14, capture.output(fwrite(DT)), capture.output(write.csv(DT,row.names=FALSE))) +test(1729.14, capture.output(fwrite(DT)), capture.output(write.csv(DT,row.names=FALSE,quote=FALSE))) if (test_bit64) { test(1730.1, typeof(-2147483647L), "integer") @@ -9842,8 +9846,8 @@ test(1736.6, capture.output(fwrite(DT, sep='|', sep2=c("{",",","}"), logicalAsIn c("A|B|C", "1|{1,2,3,4,5,6,7,8,9,10}|{s,t,u,v,w}", "2|{15,16,17,18}|{1.2,2.3,3.4,3.14159265358979,-9}", "3|{7}|{foo,bar}", "4|{9,10}|{1,1,0}")) DT = data.table(A=c("foo","ba|r","baz")) -test(1736.7, capture.output(fwrite(DT,na="")), c("A","foo","ba|r","baz")) # no list column so no need to quote -test(1736.8, capture.output(fwrite(DT)), c("\"A\"","\"foo\"","\"ba|r\"","\"baz\"")) # column name is quoted because na="NA" due to 1-column +test(1736.7, capture.output(fwrite(DT,na="")), c("A","foo","ba|r","baz")) # no list column so no need to quote +test(1736.8, capture.output(fwrite(DT)), c("A","foo","ba|r","baz")) DT = data.table(A=c("foo","ba|r","baz"), B=list(1:3,1:4,c("fo|o","ba,r","baz"))) # now list column and need to quote test(1736.9, capture.output(fwrite(DT)), c("A,B", "foo,1|2|3", "\"ba|r\",1|2|3|4", "baz,\"fo|o\"|\"ba,r\"|baz")) test(1736.11, capture.output(fwrite(DT,quote=TRUE)), c("\"A\",\"B\"", "\"foo\",1|2|3", "\"ba|r\",1|2|3|4", "\"baz\",\"fo|o\"|\"ba,r\"|\"baz\"")) @@ -9857,7 +9861,7 @@ test(1737.5, fwrite(list(1.2,B=c("foo","bar"))), error="Column 2's length (2) is # fwrite ITime, Date, IDate DT = data.table(A=as.ITime(c("23:59:58","23:59:59","12:00:00","00:00:01",NA,"00:00:00"))) -test(1738.1, capture.output(fwrite(DT)), c("\"A\"","23:59:58","23:59:59","12:00:00","00:00:01","NA","00:00:00")) +test(1738.1, capture.output(fwrite(DT)), c("A","23:59:58","23:59:59","12:00:00","00:00:01","","00:00:00")) test(1738.2, capture.output(fwrite(DT,na="")), capture.output(write.csv(DT,row.names=FALSE,quote=FALSE, na=""))) dts = c("1901-05-17","1907-10-22","1929-10-24","1962-05-28","1987-10-19","2008-09-15", "1968-12-30","1968-12-31","1969-01-01","1969-01-02") @@ -10227,9 +10231,13 @@ if (test_bit64) { } # end Grouping Sets -# for completeness, added test for NA problem to close #1837. Fixed long ago before release to CRAN. -test(1751.1, capture.output(fwrite(data.table(x=NA_integer_),verbose=FALSE)), c("\"x\"","NA")) -test(1751.2, capture.output(fwrite(data.table(x=NA_integer_),na="",verbose=FALSE)), c("x","")) +# for completeness, added test for NA problem to close #1837. +DT = data.table(x=NA) +test(1751.1, capture.output(fwrite(DT)), c("x","")) +test(1751.2, capture.output(fwrite(DT,na="")), c("x","")) +test(1751.3, capture.output(fwrite(DT,na="NA")), c("\"x\"","NA")) +test(1751.4, fread({fwrite(DT, f<-tempfile());f}), DT) # the important thing +unlink(f) if (test_nanotime) { DT = data.table(A=nanotime(tt<-c("2016-09-28T15:30:00.000000070Z", @@ -10249,8 +10257,8 @@ if (test_nanotime) { # check too many fields error from ,\n line ending highlighted in #2044 test(1753.1, fread("X,Y\n1,2\n3,4\n5,6"), data.table(X=INT(1,3,5),Y=INT(2,4,6))) -test(1753.2, fread("X,Y\n1,2\n3,4,\n5,6"), ans<-data.table(X=TRUE,Y=2L), warning="Stopped.*line 3. Expected 2 fields but found 3.*discarded.*<<3,4,>>") -test(1753.3, fread("X,Y\n1,2\n3,4,7\n5,6"), ans, warning="Stopped.*line 3. Expected 2 fields but found 3.*discarded.*<<3,4,7>>") +test(1753.2, fread("X,Y\n1,2\n3,4,\n5,6",logical01=TRUE), ans<-data.table(X=TRUE,Y=2L), warning="Stopped.*line 3. Expected 2 fields but found 3.*discarded.*<<3,4,>>") +test(1753.3, fread("X,Y\n1,2\n3,4,7\n5,6",logical01=TRUE), ans, warning="Stopped.*line 3. Expected 2 fields but found 3.*discarded.*<<3,4,7>>") # issue 2051 where a quoted field contains ", New quote rule detection handles it. test(1753.4, fread(testDir("issue_2051.csv"))[2,grep("^Our.*tool$",COLUMN50)], 1L) @@ -10270,7 +10278,7 @@ test(1753.4, fread(testDir("issue_2051.csv"))[2,grep("^Our.*tool$",COLUMN50)], 1 test(1754, fread(testDir("allchar.csv"))[c(1,2,17575,17576),col2], c("AAN","BAN","YZZ","ZZZ")) # unescaped embedded quotes from here: http://stackoverflow.com/questions/42939866/fread-multiple-separators-in-a-string -test(1755, fread(testDir("unescaped.csv")), +test(1755, fread(testDir("unescaped.csv"), logical01=TRUE), data.table(No =c(FALSE,TRUE), Comment=c('he said:"wonderful."', 'The problem is: reading table, and also "a problem, yes." keep going on.'), Type =c('A','A'))) @@ -10700,7 +10708,7 @@ test(1808.2, fread("A,B\r1,2\r3,4\r"), data.table(A=c(1L,3L),B=c(2L,4L))) cat("A,B\r1,2\r3,4",file=f<-tempfile()) test(1808.3, fread(f), data.table(A=c(1L,3L),B=c(2L,4L))) unlink(f) -test(1808.4, fread("A,B\r1,3\r\r\r2,4\r"), data.table(A=TRUE, B=3L), warning="Discarded single-line footer: <<2,4>>") +test(1808.4, fread("A,B\r1,3\r\r\r2,4\r", logical01=TRUE), data.table(A=TRUE, B=3L), warning="Discarded single-line footer: <<2,4>>") test(1808.5, fread("A,B\r4,3\r\r \r2,4\r"), data.table(A=4L, B=3L), warning="Discarded single-line footer: <<2,4>>") test(1808.6, fread("A,B\r1,3\r\r \r2,4\r", blank.lines.skip=TRUE), data.table(A=1:2, B=3:4)) test(1808.7, fread("A,B\r1,3\r\r \r2,4\r", fill=TRUE), data.table(A=c(1L,NA,NA,2L), B=c(3L,NA,NA,4L))) @@ -10870,8 +10878,8 @@ test(1832.2, any(grepl("^Column writers.* [.][.][.] ", capture.output(fwrite(DT, unlink(f) # ensure explicitly setting select to default value doesn't error, #2007 -test(1833, fread('V1,V2\n1,2', select = NULL), - data.table(V1 = TRUE, V2 = 2L)) +test(1833, fread('V1,V2\n5,6', select = NULL), + data.table(V1 = 5L, V2 = 6L)) # fread file for which nextGoodLine in sampling struggles, #2404 # Every line in the first 100 has a quoted field containing sep, so the quote rule was being @@ -10908,13 +10916,14 @@ test(1838, fread("default payment next month\n0.5524\n0.2483\n0.1157\n"), data.t # better writing and reading of NA in single column input, #2106 DT = data.table(a=c(4,NA,2,3.14,999,NA)) -fwrite(DT, f<-tempfile(), na="") # old default for na was always "" +fwrite(DT, f<-tempfile(), na="") test(1839.1, fread(f), data.table(a=c(4,NA,2,3.14,999,NA))) test(1839.2, fread(f, blank.lines.skip=TRUE), data.table(a=c(4,2,3.14,999))) test(1839.3, fread(f, fill=TRUE), data.table(a=c(4,NA,2,3.14,999,NA))) test(1839.4, fread(f, fill=TRUE, blank.lines.skip=TRUE), data.table(a=c(4,2,3.14,999))) -fwrite(DT, f) # new default sets na="NA" when ncol==1 -test(1839.5, fread(f), DT) +fwrite(DT, f, na="NA") # base R does not do this though, it writes ,, for NAs in numeric columns (as does fwrite) +test(1839.5, fread(f, na.strings=""), data.table(a=c("4","NA","2","3.14","999","NA"))) +test(1839.6, fread(f, na.strings="NA"), DT) # TOOD: auto handle (unusual, even as written by R) "NA" in numeric columns unlink(f) lines = c("DECLARATION OF INDEPENDENCE", @@ -11419,11 +11428,8 @@ test(1879.5, DT[0:5], DT) DT = data.table(A=rep("True", 2200), B="FALSE", C='0') DT[111, LETTERS[1:3] := .("fread", "is", "faithful")] fwrite(DT, f<-tempfile()) -test(1879.6, fread(f, verbose=TRUE), DT, - output=paste("Column 1.*bumped from 'bool8' to 'string'", - "Column 2.*bumped from 'bool8' to 'string'", - "Column 3.*bumped from 'bool8' to 'string'", - sep = '.*')) +test(1879.6, fread(f, verbose=TRUE, logical01=TRUE), DT, + output="Column 1.*bumped from 'bool8' to 'string'.*\nColumn 2.*bumped from 'bool8' to 'string'.*\nColumn 3.*bumped from 'bool8' to 'string'") unlink(f) # Fix duplicated names arising in merge when by.x in names(y), PR#2631, PR#2653 @@ -11470,7 +11476,8 @@ test(1884, fread('"A","B"\n', sep=NULL), data.table('"A","B"'=logical())) # sep=' ' and blank.lines.skip, #2535 test(1885.1, fread(txt<-"a b 2\nc d 3\n\ne f 4\n", blank.lines.skip=TRUE), ans<-data.table(V1=c("a","c","e"), V2=c("b","d","f"), V3=2:4)) test(1885.2, fread(txt, blank.lines.skip=TRUE, fill=TRUE), ans) -test(1885.3, fread(txt, fill=TRUE), ans[c(1,2,NA,3),][3,1:2:=""]) # TODO when blank strings are filled as NA rather than "", this test will then fail and the := can be removed +test(1885.3, fread(txt, fill=TRUE), ans[c(1,2,NA,3),][3,1:2:=""]) +test(1885.4, fread(txt, fill=TRUE, na.strings=""), ans[c(1,2,NA,3),]) # file detected as no header automatically # (TOOD: undoubling double quotes #1109, #1299) but otherwise, auto mode correct @@ -11495,6 +11502,27 @@ test(1890.1, stats::ts.plot(gpars=DT), error="object must have one or more obser # Inside ts.plot is a gpars$ylab<- which happens before its error. That dispatches to our $<- which does the alloc.col() test(1890.2, DT, data.table(A=1:5)) +# na="" default, #2524 +test(1891.1, fread('A,B,C\n1,foo,4\n2,,5\n3,bar,6\n', na.strings=""), data.table(A=1:3, B=c("foo",NA,"bar"), C=4:6)) +test(1891.2, fread('A,B,C\n1,foo,4\n2,"",5\n3,bar,6\n', na.strings=""), data.table(A=1:3, B=c("foo","","bar"), C=4:6)) +test(1891.3, fread("A,B,C\n1,foo,bar\n2", fill=TRUE, na.strings=""), data.table(A=1:2,B=c("foo",NA),C=c("bar",NA))) +test(1891.4, fread("A,B,C\n1,foo,bar\n2", fill=TRUE, na.strings="NA"), data.table(A=1:2,B=c("foo",""),C=c("bar",""))) + +# preserving "" and NA_character_, #2214 +DT = data.table(chr = c(NA, "", "a"), num = c(NA, NA, 2L)) +test(1892.1, fread({fwrite(DT,f<-tempfile());f}, na.strings=""), DT); unlink(f) +test(1892.2, capture.output(fwrite(DT)), c("chr,num", ",", "\"\"," , "a,2")) +test(1892.3, fread('A,B\n1,"foo"\n2,\n3,""\n', na.strings="")$B, c("foo", NA, "")) # for issue #2217 + +# print(DT) should print NA in character columns using like base R to distinguish from "" and "NA" +DT = data.table(A=1:4, B=c("FOO","",NA,"NA")) +test(1893.1, print(DT), output=txt<-c(" A B", "1: 1 FOO", "2: 2 ", "3: 3 ", "4: 4 NA")) +DF = as.data.frame(DT) +rownames(DF) = paste0(rownames(DF),":") +test(1893.2, print(DF), output=txt) +txt = 'A,B\n109,MT\n7,N\n11,NA\n41,NB\n60,ND\n1,""\n2,\n3,"NA"\n4,NA\n' +test(1893.3, print(fread(txt,na.strings="")), output="A B\n1: 109 MT\n2: 7 N\n3: 11 NA\n4: 41 NB\n5: 60 ND\n6: 1 \n7: 2 \n8: 3 NA\n9: 4 NA") + ################################### # Add new tests above this line # ################################### diff --git a/man/fread.Rd b/man/fread.Rd index 9610141df5..e6fff9bda0 100644 --- a/man/fread.Rd +++ b/man/fread.Rd @@ -10,16 +10,19 @@ } \usage{ fread(input, file, sep="auto", sep2="auto", dec=".", quote="\"", -nrows=Inf, header="auto", na.strings="NA", -stringsAsFactors=FALSE, verbose=getOption("datatable.verbose"), +nrows=Inf, header="auto", +na.strings=getOption("datatable.na.strings","NA"), # due to change to ""; see NEWS +stringsAsFactors=FALSE, verbose=getOption("datatable.verbose", FALSE), skip="__auto__", select=NULL, drop=NULL, colClasses=NULL, -integer64=getOption("datatable.integer64"), # default: "integer64" +integer64=getOption("datatable.integer64", "integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(), -data.table=getOption("datatable.fread.datatable"), -nThread=getDTthreads(), logical01=TRUE, autostart=NA +data.table=getOption("datatable.fread.datatable", TRUE), +nThread=getDTthreads(), +logical01=getOption("datatable.logical01", FALSE), # due to change to TRUE; see NEWS +autostart=NA ) } \arguments{ @@ -28,7 +31,7 @@ nThread=getDTthreads(), logical01=TRUE, autostart=NA \item{sep2}{ The separator \emph{within} columns. A \code{list} column will be returned where each cell is a vector of values. This is much faster using less working memory than \code{strsplit} afterwards or similar techniques. For each column \code{sep2} can be different and is the first character in the same set above [\code{,\\t |;}], other than \code{sep}, that exists inside each field outside quoted regions in the sample. NB: \code{sep2} is not yet implemented. } \item{nrows}{ The maximum number of rows to read. Unlike \code{read.table}, you do not need to set this to an estimate of the number of rows in the file for better speed because that is already automatically determined by \code{fread} almost instantly using the large sample of lines. `nrows=0` returns the column names and typed empty columns determined by the large sample; useful for a dry run of a large file or to quickly check format consistency of a set of files before starting to read any of them. } \item{header}{ Does the first data line contain column names? Defaults according to whether every non-empty field on the first data line is type character. If so, or TRUE is supplied, any empty column names are given a default name. } - \item{na.strings}{ A character vector of strings which are to be interpreted as \code{NA} values. By default \code{",,"} for columns read as type character is read as a blank string (\code{""}) and \code{",NA,"} is read as \code{NA}. Typical alternatives might be \code{na.strings=NULL} (no coercion to NA at all!) or perhaps \code{na.strings=c("NA","N/A","null")}. } + \item{na.strings}{ A character vector of strings which are to be interpreted as \code{NA} values. By default, \code{",,"} for columns of all types, including type `character` is read as \code{NA} for consistency. \code{,"",} is unambiguous and read as an empty string. To read \code{,NA,} as \code{NA}, set \code{na.strings="NA"}. To read \code{,,} as blank string \code{""}, set \code{na.strings=NULL}. When they occur in the file, the strings in \code{na.strings} should not appear quoted since that is how the string literal \code{,"NA",} is distinguished from \code{,NA,}, for example, when \code{na.strings="NA"}. } \item{file}{ File path, useful when we want to ensure that no shell commands will be executed. File path can also be provided to \code{input} argument. } \item{stringsAsFactors}{ Convert all character columns to factors? } \item{verbose}{ Be chatty and report timings? } diff --git a/man/fwrite.Rd b/man/fwrite.Rd index 229264cdf8..3fa8f7b38d 100644 --- a/man/fwrite.Rd +++ b/man/fwrite.Rd @@ -10,10 +10,9 @@ This is new functionality as of Nov 2016. We may need to refine argument names a fwrite(x, file = "", append = FALSE, quote = "auto", sep = ",", sep2 = c("","|",""), eol = if (.Platform$OS.type=="windows") "\r\n" else "\n", - na = if (length(x)>1L) "" else "NA", dec = ".", - row.names = FALSE, col.names = TRUE, + na = "", dec = ".", row.names = FALSE, col.names = TRUE, qmethod = c("double","escape"), - logical01 = getOption("datatable.logical01", TRUE), + logical01 = getOption("datatable.logical01", FALSE), # due to change to TRUE; see NEWS logicalAsInt = logical01, # deprecated dateTimeAs = c("ISO","squash","epoch","write.csv"), buffMB = 8L, nThread = getDTthreads(), diff --git a/src/fread.c b/src/fread.c index 6014834265..364977f7d1 100644 --- a/src/fread.c +++ b/src/fread.c @@ -272,7 +272,7 @@ static inline bool end_of_field(const char *ch) { static inline const char *end_NA_string(const char *fieldStart) { const char* const* nastr = NAstrings; const char *mostConsumed = fieldStart; // tests 1550* includes both 'na' and 'nan' in nastrings. Don't stop after 'na' if 'nan' can be consumed too. - while (*nastr) { + if (nastr) while (*nastr) { const char *ch1 = fieldStart; const char *ch2 = *nastr; while (*ch1==*ch2 && *ch2!='\0') { ch1++; ch2++; } @@ -918,7 +918,7 @@ static void parse_double_hexadecimal(FieldParseContext *ctx) } -/* Parse numbers 0 | 1 as boolean. */ +/* Parse numbers 0 | 1 as boolean and ,, as NA (fwrite's default) */ static void parse_bool_numeric(FieldParseContext *ctx) { const char *ch = *(ctx->ch); @@ -932,7 +932,7 @@ static void parse_bool_numeric(FieldParseContext *ctx) } } -/* Parse uppercase TRUE | FALSE as boolean. */ +/* Parse uppercase TRUE | FALSE | NA as boolean (as written by default by R's write.csv */ static void parse_bool_uppercase(FieldParseContext *ctx) { const char *ch = *(ctx->ch); @@ -943,6 +943,10 @@ static void parse_bool_uppercase(FieldParseContext *ctx) } else if (ch[0]=='F' && ch[1]=='A' && ch[2]=='L' && ch[3]=='S' && ch[4]=='E') { *target = 0; *(ctx->ch) = ch + 5; + } else if (ch[0]=='N' && ch[1]=='A') { + // the default in R's write.csv + *target = NA_BOOL8; + *(ctx->ch) = ch + 2; } else { *target = NA_BOOL8; } @@ -1095,6 +1099,7 @@ int freadMain(freadMainArgs _args) { int64_t nrowLimit = args.nrowLimit; NAstrings = args.NAstrings; + if (NAstrings==NULL) STOP("Internal error: NAstrings is itself NULL. When empty it should be pointer to NULL."); any_number_like_NAstrings = false; blank_is_a_NAstring = false; // if we know there are no nastrings which are numbers (like -999999) then in the number @@ -1104,6 +1109,8 @@ int freadMain(freadMainArgs _args) { while (*nastr) { if (**nastr == '\0') { blank_is_a_NAstring = true; + // if blank is the only one, as is the default, clear NAstrings so that doesn't have to be checked + if (nastr==NAstrings && nastr+1==NULL) NAstrings=NULL; nastr++; continue; }