From 30573726da34438ced856ec8b9bc3466e6967176 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Fri, 2 Mar 2018 15:05:42 -0800 Subject: [PATCH 01/14] ,, now read as NA not empty string --- NEWS.md | 5 +- R/fread.R | 2 +- R/fwrite.R | 2 +- R/utils.R | 2 + inst/tests/tests.Rraw | 110 ++++++++++++++++++++++-------------------- man/fread.Rd | 2 +- man/fwrite.Rd | 3 +- src/fread.c | 13 +++-- 8 files changed, 77 insertions(+), 62 deletions(-) diff --git a/NEWS.md b/NEWS.md index 5b9d92507c..2900243b47 100644 --- a/NEWS.md +++ b/NEWS.md @@ -25,7 +25,7 @@ * Now handles floating-point NaN values in a wide variety of formats, including `NaN`, `sNaN`, `1.#QNAN`, `NaN1234`, `#NUM!` and others, [#1800](https://github.com/Rdatatable/data.table/issues/1800). Thanks to Jori Liesenborgs for highlighting and the PR. * If negative numbers are passed to `select=` the out-of-range error now suggests `drop=` instead, [#2423](https://github.com/Rdatatable/data.table/issues/2423). Thanks to Michael Chirico for the suggestion. * `sep=NULL` or `sep=""` (i.e., no column separator) can now be used to specify single column input reliably like `base::readLines`, [#1616](https://github.com/Rdatatable/data.table/issues/1616). `sep='\\n'` still works (even on Windows where line ending is actually `\\r\\n`) but `NULL` or `""` are now documented and recommended. Thanks to Dmitriy Selivanov for the pull request and many others for comments. As before, `sep=NA` is not valid; use the default `"auto"` for automatic separator detection. `sep='\\n'` may be deprecated in future. - * Single-column input with blank lines is now valid and the blank lines are significant (meaning an NA in the single column). The blank lines are significant even at the very end, which may be surprising on first glance. The change is so that `fread(fwrite(DT))==DT` for single-column inputs containing NA which are written as blank. There is no change when `ncol>1` (i.e., input stops with detailed warning at the first blank line) because a blank line when `ncol>1` is invalid input due to no separators present instead of `ncol-1` separators. + * Single-column input with blank lines is now valid and the blank lines are significant (meaning an NA in the single column). The blank lines are significant even at the very end, which may be surprising on first glance. The change is so that `fread(fwrite(DT))==DT` for single-column inputs containing NA which are written as blank. There is no change when `ncol>1` (i.e., input stops with detailed warning at the first blank line) because a blank line when `ncol>1` is invalid input due to no separators present instead of `ncol-1` separators. Thanks to @skanskan, Michael Chirico, @franknarf1 and Pasha for the testing and discussions, [#2106](https://github.com/Rdatatable/data.table/issues/2106). * Too few column names are now auto filled with default column names, with warning, [#1625](https://github.com/Rdatatable/data.table/issues/1625). If there is just one missing column name it is guessed to be for the first column (row names or an index), otherwise the column names are filled at the end. Similarly, too many column names now automatically sets `fill=TRUE`, with warning. * `skip=` and `nrow=` are more reliable and no longer affected by invalid lines outside the range specified. Thanks to Ziyad Saeed and Kyle Chung for reporting, [#1267](https://github.com/Rdatatable/data.table/issues/1267). Tests added. * Ram disk (`/dev/shm`) is no longer used for the output of system command input. Although faster when it worked, it was causing too many device full errors; e.g., [#1139](https://github.com/Rdatatable/data.table/issues/1139) and [zUMIs/19](https://github.com/sdparekh/zUMIs/issues/19). Thanks to Kyle Chung for reporting. Standard `tempdir()` is now used. If you wish to use ram disk, set TEMPDIR to `/dev/shm`; see `?tempdir`. @@ -34,8 +34,7 @@ 2. `fwrite()`: * empty strings are now always quoted (`,"",`) to distinguish them from `NA` which by default is still empty (`,,`) but can be changed using `na=` as before. If `na=` is provided and `quote=` is the default `'auto'` then `quote=` is set to `TRUE` so that if the `na=` value occurs in the data, it can be distinguished from `NA`. Thanks to Ethan Welty for the request [#2214](https://github.com/Rdatatable/data.table/issues/2214) and Pasha for the code change and tests, [#2215](https://github.com/Rdatatable/data.table/issues/2215). - * `logicalAsInt` has been renamed `logical01` and the default changed from `FALSE` to `TRUE`, both changes for consistency with `fread` (see item above). The old name `logicalAsInt` continues to work but is now deprecated. The previous default can easily be restored without any code changes by setting `options("datatable.logical01" = FALSE)`. - * When `DT` is a single column, `na=` is now set to `"NA"` to avoid blank lines in the output, [#2106](https://github.com/Rdatatable/data.table/issues/2106). Thanks to @skanskan, Michael Chirico and @franknarf1 for the testing and ideas. + * `logicalAsInt` has been renamed `logical01` and the default changed from `FALSE` to `TRUE`, both changes for consistency with `fread` (see item above). The old name `logicalAsInt` continues to work but is now deprecated. The previous default can easily be restored (to enable you to postpone changing your code) by setting `options("datatable.logical01" = FALSE)`. 3. Added helpful message when subsetting by a logical column without wrapping it in parentheses, [#1844](https://github.com/Rdatatable/data.table/issues/1844). Thanks @dracodoc for the suggestion and @MichaelChirico for the PR. diff --git a/R/fread.R b/R/fread.R index 92ad3a1000..98eafe0314 100644 --- a/R/fread.R +++ b/R/fread.R @@ -1,5 +1,5 @@ -fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings="NA",stringsAsFactors=FALSE,verbose=getOption("datatable.verbose"),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(),data.table=getOption("datatable.fread.datatable"),nThread=getDTthreads(),logical01=TRUE,autostart=NA) +fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings="",stringsAsFactors=FALSE,verbose=getOption("datatable.verbose"),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(),data.table=getOption("datatable.fread.datatable"),nThread=getDTthreads(),logical01=TRUE,autostart=NA) { if (is.null(sep)) sep="\n" # C level knows that \n means \r\n on Windows, for example else { diff --git a/R/fwrite.R b/R/fwrite.R index 02b59cc63f..3373709e10 100644 --- a/R/fwrite.R +++ b/R/fwrite.R @@ -1,6 +1,6 @@ fwrite <- function(x, file="", append=FALSE, quote="auto", sep=",", sep2=c("","|",""), eol=if (.Platform$OS.type=="windows") "\r\n" else "\n", - na=if (length(x)>1L) "" else "NA", dec=".", row.names=FALSE, col.names=TRUE, + na="", dec=".", row.names=FALSE, col.names=TRUE, qmethod=c("double","escape"), logical01=getOption("datatable.logical01", TRUE), logicalAsInt=logical01, diff --git a/R/utils.R b/R/utils.R index b14579b172..c2a5f232aa 100644 --- a/R/utils.R +++ b/R/utils.R @@ -69,3 +69,5 @@ vapply_1i <- function (x, fun, ..., use.names = TRUE) { vapply(X = x, FUN = fun, ..., FUN.VALUE = NA_integer_, USE.NAMES = use.names) } +more = function(f) system(paste("more",f)) + diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index 28b236f203..3043c69123 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -2501,16 +2501,12 @@ DT[2,e:=+Inf] DT[3,e:=-Inf] DT[4,e:=NaN] # write.table writes NaN as NA, though, and all.equal considers NaN==NA. fread would read NaN as NaN if "NaN" was in file write.table(DT,f<-tempfile(),sep=",",row.names=FALSE,quote=FALSE) # na="NA" seems like a bad default for string columns here -test(880, fread(f), as.data.table(read.csv(f,stringsAsFactors=FALSE))) -test(881, fread(f), DT) -# test that columns are not coerced if nastring=NULL -DT[3,d:="NA"] -test(882, fread(f,na.strings=NULL)[['d']], DT[['d']]) -DT[3,d:=NA_character_] -unlink(f) -write.table(DT,f<-tempfile(),sep=",",row.names=FALSE,quote=TRUE) -test(883, fread(f), as.data.table(read.csv(f,stringsAsFactors=FALSE))) -test(884, fread(f), DT) +test(880, fread(f,na.strings="NA"), as.data.table(read.csv(f,stringsAsFactors=FALSE))) +write.table(DT,f,sep=",",row.names=FALSE,quote=FALSE,na="") +test(881, fread(f), as.data.table(read.csv(f,stringsAsFactors=FALSE,na.strings=""))) +fwrite(DT,f) +test(882, fread(f), DT) +test(883, fread(f, na.strings=NULL), DT[3,d:=""]) unlink(f) # Test short files. @@ -2544,7 +2540,7 @@ for (ne in seq_along(eols)) { cat(names(headDT),sep=",",file=f) # no \n at the end here for (i in seq_len(nr)) { cat(eol,file=f,append=TRUE) # on unix we simulate windows too. on windows \n will write \r\n (and \r\n will write \r\r\n) - write.table(headDT[i],file=f,quote=FALSE,sep=",",eol="",row.names=FALSE,col.names=FALSE,append=TRUE) + write.table(headDT[i],file=f,quote=TRUE,sep=",",eol="",row.names=FALSE,col.names=FALSE,na="",append=TRUE) # loop approach is to get no \n after last line } testIDtail = nr/100 + nc/1000 + ne/10000 @@ -2593,13 +2589,13 @@ test(900.2, fread(f), as.data.table(read.table(f))[,V5:=as.logical(V5)]) f = testDir("1206FUT.txt") # a CRLF line ending file (DOS) test(901.1, DT<-fread(f,strip.white=FALSE), setDT(read.table(f,sep="\t",header=TRUE,colClasses=as.vector(sapply(DT,class))))) -test(901.2, DT<-fread(f), setDT(read.table(f,sep="\t",header=TRUE,colClasses=as.vector(sapply(DT,class)),strip.white=TRUE))) +test(901.2, DT<-fread(f), setDT(read.table(f,sep="\t",header=TRUE,colClasses=as.vector(sapply(DT,class)),strip.white=TRUE,na.strings=""))) # Test the coerce of column 23 to character on line 179 due to the 'A' for the first time. # As from v1.9.8 the columns are guessed better and there is no longer a warning. Test 899 tests the warning. # Columns 'Cancelled' and 'Diverted' seem boolean (so logical01=TRUE good default for those) but Month just happens to be all-Jan f = testDir("2008head.csv") -test(902, fread(f,logical01=FALSE), as.data.table(read.csv(f,stringsAsFactors=FALSE))) +test(902, fread(f,logical01=FALSE), as.data.table(read.csv(f,stringsAsFactors=FALSE,na.strings=""))) test(903, fread("A,B\n1,3,foo,5\n2,4,barbaz,6"), data.table(A=1:2, B=3:4, V3=c("foo","barbaz"), V4=5:6), warning="Detected 2 column names but.*4.*Added 2 extra default column names at the end") test(904, fread("A,B,C,D\n1,3,foo,5\n2,4,barbaz,6"), DT<-data.table(A=1:2,B=3:4,C=c("foo","barbaz"),D=5:6)) # ok @@ -2711,7 +2707,7 @@ test(945, DT[,b:=a+1], data.table(a=numeric(),b=numeric())) # fread blank column names get default names test(946, fread('A,B,,D\n1,3,foo,5\n2,4,bar,6\n'), data.table(A=1:2,B=3:4,c("foo","bar"),D=5:6)) -test(947, fread('0,2,,4\n1,3,foo,5\n2,4,bar,6\n'), data.table(0:2,2:4,c("","foo","bar"),4:6)) +test(947, fread('0,2,,4\n1,3,foo,5\n2,4,bar,6\n'), data.table(0:2,2:4,c(NA,"foo","bar"),4:6)) test(948, fread('A,B,C\nD,E,F\n',header=TRUE), data.table(A="D",B="E",C="F")) test(949, fread('A,B,\nD,E,F\n',header=TRUE), data.table(A="D",B="E",V3="F")) @@ -5189,12 +5185,12 @@ bla <- data.table(x=c(1,1,2,2), y=c(1,1,1,1)) test(1342, unique(bla)[, bla := 2L], data.table(x=c(1,2),y=1,bla=2L)) # blank and NA fields in logical columns -test(1343.1, fread("A,B\n1,TRUE\n2,\n3,False"), data.table(A=1:3, B=c("TRUE","","False"))) -test(1343.2, fread("A,B\n1,True\n2,\n3,false"), data.table(A=1:3, B=c("True","","false"))) +test(1343.1, fread("A,B\n1,TRUE\n2,\n3,False"), data.table(A=1:3, B=c("TRUE",NA,"False"))) +test(1343.2, fread("A,B\n1,True\n2,\n3,false"), data.table(A=1:3, B=c("True",NA,"false"))) test(1343.3, fread("A,B\n1,TRUE\n2,\n3,FALSE"), data.table(A=1:3, B=c(TRUE,NA,FALSE))) test(1343.4, fread("A,B\n1,True\n2,\n3,False"), data.table(A=1:3, B=c(TRUE,NA,FALSE))) test(1343.5, fread("A,B\n1,true\n2,\n3,false"), data.table(A=1:3, B=c(TRUE,NA,FALSE))) -test(1343.6, fread("A,B\n1,true\n2,NA\n3,"), data.table(A=1:3, B=c(TRUE,NA,NA))) +test(1343.6, fread("A,B\n1,true\n2,NA\n3,"), data.table(A=1:3, B=c("true","NA",NA))) test(1344.1, fread("A,B\n1,2\n0,3\n,1\n", logical01=FALSE), data.table(A=c(1L,0L,NA), B=c(2L,3L,1L))) test(1344.2, fread("A,B\n1,2\n0,3\n,1\n", logical01=TRUE), data.table(A=c(TRUE,FALSE,NA), B=c(2L,3L,1L))) @@ -6188,8 +6184,8 @@ if ("package:bit64" %in% search()) { test(1449, fread(testDir("quoted_multiline.csv"))[c(1,43:44),c(1,22:24),with=FALSE], data.table(GPMLHTLN=as.integer64(c("3308386085360","3440245203140","1305220146734")), BLYBZ = c(0L,4L,6L), - ZBJBLOAJAQI = c("LHCYS AYE ZLEMYA IFU HEI JG FEYE","",""), - JKCRUUBAVQ = c("",".\\YAPCNXJ\\004570_850034_757\\VWBZSS_848482_600874_487_PEKT-6-KQTVIL-7_30\\IRVQT\\HUZWLBSJYHZ\\XFWPXQ-WSPJHC-00-0770000855383.KKZ",""))) + ZBJBLOAJAQI = c("LHCYS AYE ZLEMYA IFU HEI JG FEYE",NA,NA), + JKCRUUBAVQ = c(NA,".\\YAPCNXJ\\004570_850034_757\\VWBZSS_848482_600874_487_PEKT-6-KQTVIL-7_30\\IRVQT\\HUZWLBSJYHZ\\XFWPXQ-WSPJHC-00-0770000855383.KKZ",NA))) } # Fix for #927 @@ -6755,7 +6751,7 @@ if ("package:bit64" %in% search()) { test(1500.3, fread(ll, na=NULL), data.table(V1=x, V2=y)) - x = c("12345678901234", rep("NA", 178), "0.5") + x = c("12345678901234", rep("", 178), "0.5") y = sample(letters, length(x), TRUE) ll = paste(x,y, sep=",", collapse="\n") test(1500.4, fread(ll), data.table(V1=suppressWarnings(as.numeric(x)), V2=y)) @@ -7354,7 +7350,7 @@ test(1551.4, fread(str), data.table(V1=c("2","\"\"foo"), V2=c("3","bar"))) str = 'L1\tsome\tunquoted\tstuff\nL2\tsome\t"half" quoted\tstuff\nL3\tthis\t"should work"\tok though' test(1551.5, fread(str), data.table(L1 = c("L2", "L3"), some = c("some", "this"), unquoted = c("\"half\" quoted", "should work"), stuff = c("stuff", "ok though"))) #1095 -rhs = read.table(testDir("issue_1095_fread.txt"), sep=",", comment.char="", stringsAsFactors=FALSE, quote="", strip.white=TRUE) +rhs = read.table(testDir("issue_1095_fread.txt"), sep=",", comment.char="", stringsAsFactors=FALSE, quote="", strip.white=TRUE, na.strings="") test(1551.6, fread(testDir("issue_1095_fread.txt"), logical01=FALSE), setDT(rhs)) # FR #1314 rest of na.strings issue @@ -7624,14 +7620,14 @@ X = fread("a|b|c|d this|is|row|2 this|NA|NA|3 this|is|row|4", stringsAsFactors = TRUE) -test(1577.1, is.na(X[3, b]), TRUE) -test(1577.2, levels(X$b), "is") +test(1577.1, X[3, b]=="NA", TRUE) +test(1577.2, sort(levels(X$b)), sort(c("NA","is"))) # locales could mean different sort orders, so use sort() twice to be sure X = fread("a|b|c|d this|NA|row|1 this|NA|row|2 this|NA|NA|3 this|NA|row|4", colClasses="character", stringsAsFactors = TRUE) -test(1577.3, levels(X$b), character(0)) +test(1577.3, levels(X$b), "NA") # FR #530, skip blank lines input = "Header not 2 columns\n\n1,3\n2,4" @@ -7792,7 +7788,7 @@ options(datatable.optimize=optim) # Fixed a minor bug in fread when blank.lines.skip=TRUE f1 <- function(x, f=TRUE, b=FALSE) fread(x, fill=f, blank.lines.skip=b, data.table=FALSE, logical01=FALSE) -f2 <- function(x, f=TRUE, b=FALSE) read.table(x, fill=f, blank.lines.skip=b, sep=",", header=TRUE, stringsAsFactors=FALSE) +f2 <- function(x, f=TRUE, b=FALSE) read.table(x, fill=f, blank.lines.skip=b, sep=",", header=TRUE, stringsAsFactors=FALSE, na.strings="") test(1584.1, f1(testDir("fread_blank.txt"), f=FALSE, b=TRUE), f2(testDir("fread_blank.txt"), f=FALSE, b=TRUE)) test(1584.2, f1(testDir("fread_blank2.txt"), f=FALSE, b=TRUE), f2(testDir("fread_blank2.txt"), f=FALSE, b=TRUE)) test(1584.3, f1(testDir("fread_blank3.txt"), f=FALSE, b=TRUE), f2(testDir("fread_blank3.txt"), f=FALSE, b=TRUE)) @@ -7962,7 +7958,9 @@ DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9) test(1605, DT[order(-x, "D")], error="Column 2 is length 1 which differs") # fix for #1503, fread's fill argument polishing -test(1606, fread("2\n1,a,b", fill=TRUE), data.table(V1=2:1, V2=c("","a"), V3=c("","b"))) +test(1606.1, fread("2\n1,a,b", fill=TRUE), data.table(V1=2:1, V2=c(NA,"a"), V3=c(NA,"b"))) +test(1606.2, fread("2\n1,a,b", fill=TRUE, na.strings=NULL), data.table(V1=2:1, V2=c("","a"), V3=c("","b"))) +test(1606.3, fread("2\n1,a,b", fill=TRUE, na.strings="NA"), data.table(V1=2:1, V2=c("","a"), V3=c("","b"))) # fix for #1476 dt = data.table(resp=c(1:5)) @@ -9492,7 +9490,7 @@ dt = data.table(x=1:2, y=c(NA,"a")) f = tempfile() test(1676.1, fwrite(dt, f, na=NULL), error="is not TRUE") fwrite(dt, f, na=NA) -test(1676.2, fread(f), data.table(x=1:2, y=c(NA, "a"))) +test(1676.2, fread(f), data.table(x=1:2, y=c("NA", "a"))) unlink(f) # duplicate names in foverlaps #1730 @@ -9907,9 +9905,9 @@ test(1729.2, fwrite(data.table(V2=c(9.999999999999998223643160599749535322189331 DT = data.table(V1=c(9999999999.99, 0.00000000000000099, 0.0000000000000000000009, 0.9, 9.0, 9.1, 99.9, 0.000000000000000000000999999999999999999999999, 99999999999999999999999999999.999999)) -ans = "\"V1\"\n9999999999.99\n9.9e-16\n9e-22\n0.9\n9\n9.1\n99.9\n1e-21\n1e+29" +ans = "V1\n9999999999.99\n9.9e-16\n9e-22\n0.9\n9\n9.1\n99.9\n1e-21\n1e+29" test(1729.3, fwrite(DT), output=ans) -test(1729.4, write.csv(DT,row.names=FALSE), output=ans) +test(1729.4, write.csv(DT,row.names=FALSE,quote=FALSE), output=ans) options(oldverbose) # same decimal/scientific rule (shortest format) as write.csv @@ -9981,7 +9979,7 @@ DT = data.table(unlist(.Machine[c("double.eps","double.neg.eps","double.xmin","d # double.eps double.neg.eps double.xmin double.xmax # 2.220446e-16 1.110223e-16 2.225074e-308 1.797693e+308 test(1729.13, typeof(DT[[1L]]), "double") -test(1729.14, capture.output(fwrite(DT)), capture.output(write.csv(DT,row.names=FALSE))) +test(1729.14, capture.output(fwrite(DT)), capture.output(write.csv(DT,row.names=FALSE,quote=FALSE))) if ("package:bit64" %in% search()) { test(1730.1, typeof(-2147483647L), "integer") @@ -10138,8 +10136,8 @@ test(1736.6, capture.output(fwrite(DT, sep='|', sep2=c("{",",","}"), logicalAsIn c("A|B|C", "1|{1,2,3,4,5,6,7,8,9,10}|{s,t,u,v,w}", "2|{15,16,17,18}|{1.2,2.3,3.4,3.14159265358979,-9}", "3|{7}|{foo,bar}", "4|{9,10}|{1,1,0}")) DT = data.table(A=c("foo","ba|r","baz")) -test(1736.7, capture.output(fwrite(DT,na="")), c("A","foo","ba|r","baz")) # no list column so no need to quote -test(1736.8, capture.output(fwrite(DT)), c("\"A\"","\"foo\"","\"ba|r\"","\"baz\"")) # column name is quoted because na="NA" due to 1-column +test(1736.7, capture.output(fwrite(DT,na="")), c("A","foo","ba|r","baz")) # no list column so no need to quote +test(1736.8, capture.output(fwrite(DT)), c("A","foo","ba|r","baz")) DT = data.table(A=c("foo","ba|r","baz"), B=list(1:3,1:4,c("fo|o","ba,r","baz"))) # now list column and need to quote test(1736.9, capture.output(fwrite(DT)), c("A,B", "foo,1|2|3", "\"ba|r\",1|2|3|4", "baz,\"fo|o\"|\"ba,r\"|baz")) test(1736.11, capture.output(fwrite(DT,quote=TRUE)), c("\"A\",\"B\"", "\"foo\",1|2|3", "\"ba|r\",1|2|3|4", "\"baz\",\"fo|o\"|\"ba,r\"|\"baz\"")) @@ -10153,7 +10151,7 @@ test(1737.5, fwrite(list(1.2,B=c("foo","bar"))), error="Column 2's length (2) is # fwrite ITime, Date, IDate DT = data.table(A=as.ITime(c("23:59:58","23:59:59","12:00:00","00:00:01",NA,"00:00:00"))) -test(1738.1, capture.output(fwrite(DT)), c("\"A\"","23:59:58","23:59:59","12:00:00","00:00:01","NA","00:00:00")) +test(1738.1, capture.output(fwrite(DT)), c("A","23:59:58","23:59:59","12:00:00","00:00:01","","00:00:00")) test(1738.2, capture.output(fwrite(DT,na="")), capture.output(write.csv(DT,row.names=FALSE,quote=FALSE, na=""))) dts = c("1901-05-17","1907-10-22","1929-10-24","1962-05-28","1987-10-19","2008-09-15", "1968-12-30","1968-12-31","1969-01-01","1969-01-02") @@ -10542,9 +10540,13 @@ if ("package:bit64" %in% search()) { } # end Grouping Sets -# for completeness, added test for NA problem to close #1837. Fixed long ago before release to CRAN. -test(1751.1, capture.output(fwrite(data.table(x=NA_integer_),verbose=FALSE)), c("\"x\"","NA")) -test(1751.2, capture.output(fwrite(data.table(x=NA_integer_),na="",verbose=FALSE)), c("x","")) +# for completeness, added test for NA problem to close #1837. +DT = data.table(x=NA) +test(1751.1, capture.output(fwrite(DT)), c("x","")) +test(1751.2, capture.output(fwrite(DT,na="")), c("x","")) +test(1751.3, capture.output(fwrite(DT,na="NA")), c("\"x\"","NA")) +test(1751.4, fread({fwrite(DT, f<-tempfile());f}), DT) # the important thing +unlink(f) if ("package:nanotime" %in% search()) { DT = data.table(A=nanotime(tt<-c("2016-09-28T15:30:00.000000070Z", @@ -10931,14 +10933,14 @@ test(1777.19, fread("A,B,C\nC,D,4\n", verbose=TRUE), data.table(A="C",B="D",C=4L # unquoted fields containing \r, #2371 test(1778.1, fread("A,B,C\n0,,\n1,hello\rworld,2\n3,test,4\n", verbose=TRUE), - DT <- data.table(A=c(0L,1L,3L), B=c("","hello\rworld","test"), C=c(NA,2L,4L)), + DT <- data.table(A=c(0L,1L,3L), B=c(NA,"hello\rworld","test"), C=c(NA,2L,4L)), output="has been found.*common and ideal") fwrite(DT, f<-tempfile()) -test(1778.2, readLines(f), c("A,B,C", "0,\"\",", "1,\"hello", "world\",2", "3,test,4")) +test(1778.2, readLines(f), c("A,B,C", "0,,", "1,\"hello", "world\",2", "3,test,4")) # fwrite quotes the field containing \r ........... ^^ ............ ^^ # and that reading back in gets us back to DT faithfully test(1778.3, fread(f), DT) -tt = setDT(read.csv(f, stringsAsFactors=FALSE)) +tt = setDT(read.csv(f, stringsAsFactors=FALSE, na.strings="")) tt[2, B:=gsub("\n","\r",B)] # base R changes the \r to a \n, so restore that test(1778.4, tt, DT) unlink(f) @@ -11198,8 +11200,8 @@ test(1834.1, dim(DT<-fread(testDir("grr.csv"), header=FALSE)), INT(2839, 12)) test(1834.2, DT[c(1,2,.N-1,.N), c(1,2,11,12)], data.table(V1="AAAAAAAAAA", V2=c("AAAAAAAA","AAAAAAAAAA","AAAAAAAAAA","AAAAAAAAAA"), - V11=c("AAAAAAAAAAAAAAAA","","AAAAAAAAAA","AAAA"), - V12=c("AAAAAAAAAAAAA","","AAAAAAA","AAA"))) + V11=c("AAAAAAAAAAAAAAAA",NA,"AAAAAAAAAA","AAAA"), + V12=c("AAAAAAAAAAAAA",NA,"AAAAAAA","AAA"))) # Create a file to test a sample jump being skipped due to format error. It will fail later in the read step because # this is a real error. Currently have not constructed an error for which nextGoodLine looks good, but in fact is not. @@ -11225,13 +11227,14 @@ test(1838, fread("default payment next month\n0.5524\n0.2483\n0.1157\n"), data.t # better writing and reading of NA in single column input, #2106 DT = data.table(a=c(4,NA,2,3.14,999,NA)) -fwrite(DT, f<-tempfile(), na="") # old default for na was always "" +fwrite(DT, f<-tempfile(), na="") # default value of na anyway test(1839.1, fread(f), data.table(a=c(4,NA,2,3.14,999,NA))) test(1839.2, fread(f, blank.lines.skip=TRUE), data.table(a=c(4,2,3.14,999))) test(1839.3, fread(f, fill=TRUE), data.table(a=c(4,NA,2,3.14,999,NA))) test(1839.4, fread(f, fill=TRUE, blank.lines.skip=TRUE), data.table(a=c(4,2,3.14,999))) -fwrite(DT, f) # new default sets na="NA" when ncol==1 -test(1839.5, fread(f), DT) +fwrite(DT, f, na="NA") # base R does not do this though, it writes ,, for NAs in numeric columns (as does fwrite) +test(1839.5, fread(f), data.table(a=c("4","NA","2","3.14","999","NA"))) +test(1839.6, fread(f, na="NA"), DT) # TOOD: auto handle (unusual, even as written by R) "NA" in numeric columns unlink(f) lines = c("DECLARATION OF INDEPENDENCE", @@ -11242,7 +11245,7 @@ lines = c("DECLARATION OF INDEPENDENCE", "That to secure these rights, Governments are instituted among Men,", "deriving their just powers from the consent of the governed.") txt = paste(lines, collapse="\n") -test(1839.6, fread(txt, sep=""), data.table("DECLARATION OF INDEPENDENCE"=lines[-1])) # fread should eventually be able auto-detect sep="" +test(1839.7, fread(txt, sep=""), data.table("DECLARATION OF INDEPENDENCE"=lines[-1])[4,1:=NA]) # TODO fread should be able auto-detect sep="" here # readLines behaviour, #1616 txt = 'a,b\n ab,cd,ce\n abcdef\n hjkli \n' # now auto detected as ncol 1 anyway @@ -11558,12 +11561,12 @@ test(1869.6, fread(testDir("colnames4096.csv"), verbose=TRUE)[,c(1,2,585,586)], data.table(Foo000=logical(), Bar001=logical(), Foo584=logical(), B=logical()), output = "Copying file in RAM.*file is very unusual.*ends abruptly.*multiple of 4096") test(1869.7, fread(testDir("onecol4096.csv"), verbose=TRUE)[c(1,2,245,246,249,255:.N),], - data.table(A=c("FooBarBazQux000","FooBarBazQux001","","FooBarBazQux245","","FooBarBazQux254","FooBarBazQux","FooBarBaz12","FooBarBazQux256","","","")), + data.table(A=c("FooBarBazQux000","FooBarBazQux001",NA,"FooBarBazQux245",NA,"FooBarBazQux254","FooBarBazQux","FooBarBaz12","FooBarBazQux256",NA,NA,NA)), output = "Copying file in RAM.*file is very unusual.*one single column, ends with 2 or more end-of-line.*and is a multiple of 4096") # better colname detection by comparing potential column names to the whole sample not just the first row of the sample, #2526 -test(1870.1, fread("A,100,200\n,300,400\n,500,600"), data.table(V1=c("A","",""), V2=c(100L,300L,500L), V3=c(200L,400L,600L))) -test(1870.2, fread("A,100,\n,,\n,500,600"), data.table(V1=c("A","",""), V2=c(100L,NA,500L), V3=c(NA,NA,600L))) +test(1870.1, fread("A,100,200\n,300,400\n,500,600"), data.table(V1=c("A",NA,NA), V2=c(100L,300L,500L), V3=c(200L,400L,600L))) +test(1870.2, fread("A,100,\n,,\n,500,600"), data.table(V1=c("A",NA,NA), V2=c(100L,NA,500L), V3=c(NA,NA,600L))) test(1870.3, fread("A,B,\n,,\n,500,3.4"), data.table(A=NA, B=c(NA,500L), V3=c(NA,3.4))) # nrows= now ignores errors after those nrows as expected and skip= determines first row for sure, #1267 @@ -11660,7 +11663,7 @@ x = sprintf("ABCDEFGHIJKLMNOPQRST%06d", 1:102184) x[51094]="" cat(x, file=f<-tempfile(), sep="\n") test(1874.1, fread(f,header=FALSE,verbose=TRUE)[c(1,51094,.N),], - data.table(V1=c("ABCDEFGHIJKLMNOPQRST000001","","ABCDEFGHIJKLMNOPQRST102184")), + data.table(V1=c("ABCDEFGHIJKLMNOPQRST000001",NA,"ABCDEFGHIJKLMNOPQRST102184")), output="jumps=[0..2)") # ensure jump 1 happened # # out-of-sample short lines in the first jump, not near the jump point @@ -11774,7 +11777,7 @@ test(1882.3, CJ(v, v, v), error="Cross product of elements provided to CJ() woul # no re-read for particular file, #2509 test(1883, fread(testDir("SA2-by-DJZ.csv"), verbose=TRUE, header=FALSE)[c(1,2,1381,.N),], - data.table(V1=c("Goulburn","","",""), V2=c("110018063","110018064","0&&&&&&&&","0@@@@@@@@"), V3=INT(3499,812,250796,7305367), V4=NA), + data.table(V1=c("Goulburn",NA,NA,NA), V2=c("110018063","110018064","0&&&&&&&&","0@@@@@@@@"), V3=INT(3499,812,250796,7305367), V4=NA), warning='Stopped early on line 1394.*First discarded non-empty line: <<"Dataset: 2011 Census of Population and Housing">>', output="0.000s.*Rereading 0 columns") @@ -11784,7 +11787,12 @@ test(1884, fread('"A","B"\n', sep=NULL), data.table('"A","B"'=logical())) # sep=' ' and blank.lines.skip, #2535 test(1885.1, fread(txt<-"a b 2\nc d 3\n\ne f 4\n", blank.lines.skip=TRUE), ans<-data.table(V1=c("a","c","e"), V2=c("b","d","f"), V3=2:4)) test(1885.2, fread(txt, blank.lines.skip=TRUE, fill=TRUE), ans) -test(1885.3, fread(txt, fill=TRUE), ans[c(1,2,NA,3),][3,1:2:=""]) # TODO when blank strings are filled as NA rather than "", this test will then fail and the := can be removed +test(1885.3, fread(txt, fill=TRUE), ans[c(1,2,NA,3),]) + +# na="" default, #2524 +test(1886.1, fread('A,B,C\n1,foo,4\n2,,5\n3,bar,6\n'), data.table(A=1:3, B=c("foo",NA,"bar"), C=4:6)) +test(1886.2, fread('A,B,C\n1,foo,4\n2,"",5\n3,bar,6\n'), data.table(A=1:3, B=c("foo","","bar"), C=4:6)) +test(1886.3, fread("A,B,C\n1,foo,bar\n2", fill=TRUE), data.table(A=1:2,B=c("foo",NA),C=c("bar",NA))) ################################### diff --git a/man/fread.Rd b/man/fread.Rd index 3b45a4772f..234c6dbbb6 100644 --- a/man/fread.Rd +++ b/man/fread.Rd @@ -10,7 +10,7 @@ } \usage{ fread(input, file, sep="auto", sep2="auto", dec=".", quote="\"", -nrows=Inf, header="auto", na.strings="NA", +nrows=Inf, header="auto", na.strings="", stringsAsFactors=FALSE, verbose=getOption("datatable.verbose"), skip="__auto__", select=NULL, drop=NULL, colClasses=NULL, integer64=getOption("datatable.integer64"), # default: "integer64" diff --git a/man/fwrite.Rd b/man/fwrite.Rd index 229264cdf8..65da84e3a0 100644 --- a/man/fwrite.Rd +++ b/man/fwrite.Rd @@ -10,8 +10,7 @@ This is new functionality as of Nov 2016. We may need to refine argument names a fwrite(x, file = "", append = FALSE, quote = "auto", sep = ",", sep2 = c("","|",""), eol = if (.Platform$OS.type=="windows") "\r\n" else "\n", - na = if (length(x)>1L) "" else "NA", dec = ".", - row.names = FALSE, col.names = TRUE, + na = "", dec = ".", row.names = FALSE, col.names = TRUE, qmethod = c("double","escape"), logical01 = getOption("datatable.logical01", TRUE), logicalAsInt = logical01, # deprecated diff --git a/src/fread.c b/src/fread.c index 53c9fbfb67..17061b50c0 100644 --- a/src/fread.c +++ b/src/fread.c @@ -272,7 +272,7 @@ static inline bool end_of_field(const char *ch) { static inline const char *end_NA_string(const char *fieldStart) { const char* const* nastr = NAstrings; const char *mostConsumed = fieldStart; // tests 1550* includes both 'na' and 'nan' in nastrings. Don't stop after 'na' if 'nan' can be consumed too. - while (*nastr) { + if (nastr) while (*nastr) { const char *ch1 = fieldStart; const char *ch2 = *nastr; while (*ch1==*ch2 && *ch2!='\0') { ch1++; ch2++; } @@ -919,7 +919,7 @@ static void parse_double_hexadecimal(FieldParseContext *ctx) } -/* Parse numbers 0 | 1 as boolean. */ +/* Parse numbers 0 | 1 as boolean and ,, as NA (fwrite's default) */ static void parse_bool_numeric(FieldParseContext *ctx) { const char *ch = *(ctx->ch); @@ -933,7 +933,7 @@ static void parse_bool_numeric(FieldParseContext *ctx) } } -/* Parse uppercase TRUE | FALSE as boolean. */ +/* Parse uppercase TRUE | FALSE | NA as boolean (as written by default by R's write.csv */ static void parse_bool_uppercase(FieldParseContext *ctx) { const char *ch = *(ctx->ch); @@ -944,6 +944,10 @@ static void parse_bool_uppercase(FieldParseContext *ctx) } else if (ch[0]=='F' && ch[1]=='A' && ch[2]=='L' && ch[3]=='S' && ch[4]=='E') { *target = 0; *(ctx->ch) = ch + 5; + } else if (ch[0]=='N' && ch[1]=='A') { + // the default in R's write.csv + *target = NA_BOOL8; + *(ctx->ch) = ch + 2; } else { *target = NA_BOOL8; } @@ -1096,6 +1100,7 @@ int freadMain(freadMainArgs _args) { size_t nrowLimit = (size_t) args.nrowLimit; NAstrings = args.NAstrings; + if (NAstrings==NULL) STOP("Internal error: NAstrings is itself NULL. When empty it should be pointer to NULL."); any_number_like_NAstrings = false; blank_is_a_NAstring = false; // if we know there are no nastrings which are numbers (like -999999) then in the number @@ -1105,6 +1110,8 @@ int freadMain(freadMainArgs _args) { while (*nastr) { if (**nastr == '\0') { blank_is_a_NAstring = true; + // if blank is the only one, as is the default, clear NAstrings so that doesn't have to be checked + if (nastr==NAstrings && nastr+1==NULL) NAstrings=NULL; nastr++; continue; } From d3fc8ede97396eaceb2c62376e2adf20285d40b6 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Fri, 2 Mar 2018 15:41:49 -0800 Subject: [PATCH 02/14] nocov on dev util --- R/utils.R | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/utils.R b/R/utils.R index c2a5f232aa..51d53e77b7 100644 --- a/R/utils.R +++ b/R/utils.R @@ -69,5 +69,5 @@ vapply_1i <- function (x, fun, ..., use.names = TRUE) { vapply(X = x, FUN = fun, ..., FUN.VALUE = NA_integer_, USE.NAMES = use.names) } -more = function(f) system(paste("more",f)) +more = function(f) system(paste("more",f)) # nocov (just a dev helper) From 5f7b324be5d72cfee0eddb56f05fbaec2fde2c8a Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Fri, 2 Mar 2018 16:58:50 -0800 Subject: [PATCH 03/14] Added test for #2214 --- inst/tests/tests.Rraw | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index 3043c69123..2ae2799c11 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -11794,6 +11794,11 @@ test(1886.1, fread('A,B,C\n1,foo,4\n2,,5\n3,bar,6\n'), data.table(A=1:3, B=c("fo test(1886.2, fread('A,B,C\n1,foo,4\n2,"",5\n3,bar,6\n'), data.table(A=1:3, B=c("foo","","bar"), C=4:6)) test(1886.3, fread("A,B,C\n1,foo,bar\n2", fill=TRUE), data.table(A=1:2,B=c("foo",NA),C=c("bar",NA))) +# preserving "" and NA_character_, #2214 +DT = data.table(chr = c(NA, "", "a"), num = c(NA, NA, 2L)) +test(1887.1, fread({fwrite(DT,f<-tempfile());f}), DT); unlink(f) +test(1887.2, capture.output(fwrite(DT)), c("chr,num", ",", "\"\"," , "a,2")) + ################################### # Add new tests above this line # From 09cadbeb26c97d5c391d263fff34ca6a0d68d4a6 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Fri, 2 Mar 2018 17:22:14 -0800 Subject: [PATCH 04/14] Added test for #2217 --- inst/tests/tests.Rraw | 1 + 1 file changed, 1 insertion(+) diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index 2ae2799c11..deffbbc68c 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -11798,6 +11798,7 @@ test(1886.3, fread("A,B,C\n1,foo,bar\n2", fill=TRUE), data.table(A=1:2,B=c("foo" DT = data.table(chr = c(NA, "", "a"), num = c(NA, NA, 2L)) test(1887.1, fread({fwrite(DT,f<-tempfile());f}), DT); unlink(f) test(1887.2, capture.output(fwrite(DT)), c("chr,num", ",", "\"\"," , "a,2")) +test(1887.3, fread('A,B\n1,"foo"\n2,\n3,""\n')$B, c("foo", NA, "")) # for issue #2217 ################################### From d58f73c65b095f6739d339ef5d68f6fd7baf37d8 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Fri, 2 Mar 2018 18:57:03 -0800 Subject: [PATCH 05/14] Updated ?fread to resolve #2586 --- man/fread.Rd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/fread.Rd b/man/fread.Rd index 234c6dbbb6..810a1cd512 100644 --- a/man/fread.Rd +++ b/man/fread.Rd @@ -28,7 +28,7 @@ nThread=getDTthreads(), logical01=TRUE, autostart=NA \item{sep2}{ The separator \emph{within} columns. A \code{list} column will be returned where each cell is a vector of values. This is much faster using less working memory than \code{strsplit} afterwards or similar techniques. For each column \code{sep2} can be different and is the first character in the same set above [\code{,\\t |;}], other than \code{sep}, that exists inside each field outside quoted regions in the sample. NB: \code{sep2} is not yet implemented. } \item{nrows}{ The maximum number of rows to read. Unlike \code{read.table}, you do not need to set this to an estimate of the number of rows in the file for better speed because that is already automatically determined by \code{fread} almost instantly using the large sample of lines. `nrows=0` returns the column names and typed empty columns determined by the large sample; useful for a dry run of a large file or to quickly check format consistency of a set of files before starting to read any of them. } \item{header}{ Does the first data line contain column names? Defaults according to whether every non-empty field on the first data line is type character. If so, or TRUE is supplied, any empty column names are given a default name. } - \item{na.strings}{ A character vector of strings which are to be interpreted as \code{NA} values. By default \code{",,"} for columns read as type character is read as a blank string (\code{""}) and \code{",NA,"} is read as \code{NA}. Typical alternatives might be \code{na.strings=NULL} (no coercion to NA at all!) or perhaps \code{na.strings=c("NA","N/A","null")}. } + \item{na.strings}{ A character vector of strings which are to be interpreted as \code{NA} values. By default, \code{",,"} for columns of all types, including type `character` is read as \code{NA} for consistency. \code{,"",} is unambiguous and read as an empty string. To read \code{,NA,} as \code{NA}, set \code{na.strings="NA"}. To read \code{,,} as blank string \code{""}, set \code{na.strings=NULL}. When they occur in the file, the strings in \code{na.strings} should not appear quoted since that is how the string literal \code{,"NA",} is distinguished from \code{,NA,}, for example, when \code{na.strings="NA"}. } \item{file}{ File path, useful when we want to ensure that no shell commands will be executed. File path can also be provided to \code{input} argument. } \item{stringsAsFactors}{ Convert all character columns to factors? } \item{verbose}{ Be chatty and report timings? } From 6fbc9c1c91551c19621acf7f197f9b68e2869b58 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Fri, 2 Mar 2018 23:21:40 -0800 Subject: [PATCH 06/14] print NA as --- NEWS.md | 2 ++ R/print.data.table.R | 2 +- inst/tests/tests.Rraw | 9 +++++++++ 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/NEWS.md b/NEWS.md index 2900243b47..ee976a0bc8 100644 --- a/NEWS.md +++ b/NEWS.md @@ -156,6 +156,8 @@ the behaviour of `base:::merge.data.frame()`. Thanks to @sritchie73 for reportin 35. `CJ()` now fails with proper error message when results would exceed max integer, [#2636](https://github.com/Rdatatable/data.table/issues/2636). +36. `NA` in character columns now display as `` just like base R to distinguish from `""` and `"NA"`. + #### NOTES 0. The license has been changed from GPL to MPL (Mozilla Public License). All contributors were consulted and approved. [PR#2456](https://github.com/Rdatatable/data.table/pull/2456) details the reasons for the change. diff --git a/R/print.data.table.R b/R/print.data.table.R index ae6a8249b5..ae9791e491 100644 --- a/R/print.data.table.R +++ b/R/print.data.table.R @@ -58,7 +58,7 @@ print.data.table <- function(x, topn=getOption("datatable.print.topn"), rn = seq_len(nrow(x)) printdots = FALSE } - toprint=format.data.table(toprint, ...) + toprint=format.data.table(toprint, na.encode=FALSE, ...) # na.encode=FALSE so that NA in character cols print as if ((!"bit64" %chin% loadedNamespaces()) && any(sapply(x,inherits,"integer64"))) require_bit64() # When we depend on R 3.2.0 (Apr 2015) we can use isNamespaceLoaded() added then, instead of %chin% above diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index deffbbc68c..cd7e7cb726 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -11800,6 +11800,15 @@ test(1887.1, fread({fwrite(DT,f<-tempfile());f}), DT); unlink(f) test(1887.2, capture.output(fwrite(DT)), c("chr,num", ",", "\"\"," , "a,2")) test(1887.3, fread('A,B\n1,"foo"\n2,\n3,""\n')$B, c("foo", NA, "")) # for issue #2217 +# print(DT) should print NA in character columns using like base R to distinguish from "" and "NA" +DT = data.table(A=1:4, B=c("FOO","",NA,"NA")) +test(1888.1, print(DT), output=txt<-c(" A B", "1: 1 FOO", "2: 2 ", "3: 3 ", "4: 4 NA")) +DF = as.data.frame(DT) +rownames(DF) = paste0(rownames(DF),":") +test(1888.2, print(DF), output=txt) +txt = 'A,B\n109,MT\n7,N\n11,NA\n41,NB\n60,ND\n1,""\n2,\n3,"NA"\n4,NA\n' +test(1888.3, print(fread(txt)), output="A B\n1: 109 MT\n2: 7 N\n3: 11 NA\n4: 41 NB\n5: 60 ND\n6: 1 \n7: 2 \n8: 3 NA\n9: 4 NA") + ################################### # Add new tests above this line # From 88674d2997df45ba7db8eb1dda89891edabee4d1 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Mon, 5 Mar 2018 15:52:01 -0800 Subject: [PATCH 07/14] na.strings now getOption with no default change yet --- R/fread.R | 2 +- inst/tests/tests.Rraw | 84 ++++++++++++++++++++++--------------------- man/fread.Rd | 2 +- 3 files changed, 46 insertions(+), 42 deletions(-) diff --git a/R/fread.R b/R/fread.R index 98eafe0314..c619b2d25d 100644 --- a/R/fread.R +++ b/R/fread.R @@ -1,5 +1,5 @@ -fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings="",stringsAsFactors=FALSE,verbose=getOption("datatable.verbose"),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(),data.table=getOption("datatable.fread.datatable"),nThread=getDTthreads(),logical01=TRUE,autostart=NA) +fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings=getOption("datatable.na.strings","NA"),stringsAsFactors=FALSE,verbose=getOption("datatable.verbose"),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(),data.table=getOption("datatable.fread.datatable"),nThread=getDTthreads(),logical01=TRUE,autostart=NA) { if (is.null(sep)) sep="\n" # C level knows that \n means \r\n on Windows, for example else { diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index cd7e7cb726..2685e94e87 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -2501,12 +2501,16 @@ DT[2,e:=+Inf] DT[3,e:=-Inf] DT[4,e:=NaN] # write.table writes NaN as NA, though, and all.equal considers NaN==NA. fread would read NaN as NaN if "NaN" was in file write.table(DT,f<-tempfile(),sep=",",row.names=FALSE,quote=FALSE) # na="NA" seems like a bad default for string columns here -test(880, fread(f,na.strings="NA"), as.data.table(read.csv(f,stringsAsFactors=FALSE))) -write.table(DT,f,sep=",",row.names=FALSE,quote=FALSE,na="") -test(881, fread(f), as.data.table(read.csv(f,stringsAsFactors=FALSE,na.strings=""))) -fwrite(DT,f) -test(882, fread(f), DT) -test(883, fread(f, na.strings=NULL), DT[3,d:=""]) +test(880, fread(f), as.data.table(read.csv(f,stringsAsFactors=FALSE))) +test(881, fread(f), DT) +# test that columns are not coerced if nastring=NULL +DT[3,d:="NA"] +test(882, fread(f,na.strings=NULL)[['d']], DT[['d']]) +DT[3,d:=NA_character_] +unlink(f) +write.table(DT,f<-tempfile(),sep=",",row.names=FALSE,quote=TRUE) +test(883, fread(f), as.data.table(read.csv(f,stringsAsFactors=FALSE))) +test(884, fread(f), DT) unlink(f) # Test short files. @@ -2540,7 +2544,7 @@ for (ne in seq_along(eols)) { cat(names(headDT),sep=",",file=f) # no \n at the end here for (i in seq_len(nr)) { cat(eol,file=f,append=TRUE) # on unix we simulate windows too. on windows \n will write \r\n (and \r\n will write \r\r\n) - write.table(headDT[i],file=f,quote=TRUE,sep=",",eol="",row.names=FALSE,col.names=FALSE,na="",append=TRUE) + write.table(headDT[i],file=f,quote=FALSE,sep=",",eol="",row.names=FALSE,col.names=FALSE,append=TRUE) # loop approach is to get no \n after last line } testIDtail = nr/100 + nc/1000 + ne/10000 @@ -2589,13 +2593,13 @@ test(900.2, fread(f), as.data.table(read.table(f))[,V5:=as.logical(V5)]) f = testDir("1206FUT.txt") # a CRLF line ending file (DOS) test(901.1, DT<-fread(f,strip.white=FALSE), setDT(read.table(f,sep="\t",header=TRUE,colClasses=as.vector(sapply(DT,class))))) -test(901.2, DT<-fread(f), setDT(read.table(f,sep="\t",header=TRUE,colClasses=as.vector(sapply(DT,class)),strip.white=TRUE,na.strings=""))) +test(901.2, DT<-fread(f), setDT(read.table(f,sep="\t",header=TRUE,colClasses=as.vector(sapply(DT,class)),strip.white=TRUE))) # Test the coerce of column 23 to character on line 179 due to the 'A' for the first time. # As from v1.9.8 the columns are guessed better and there is no longer a warning. Test 899 tests the warning. # Columns 'Cancelled' and 'Diverted' seem boolean (so logical01=TRUE good default for those) but Month just happens to be all-Jan f = testDir("2008head.csv") -test(902, fread(f,logical01=FALSE), as.data.table(read.csv(f,stringsAsFactors=FALSE,na.strings=""))) +test(902, fread(f,logical01=FALSE), as.data.table(read.csv(f,stringsAsFactors=FALSE))) test(903, fread("A,B\n1,3,foo,5\n2,4,barbaz,6"), data.table(A=1:2, B=3:4, V3=c("foo","barbaz"), V4=5:6), warning="Detected 2 column names but.*4.*Added 2 extra default column names at the end") test(904, fread("A,B,C,D\n1,3,foo,5\n2,4,barbaz,6"), DT<-data.table(A=1:2,B=3:4,C=c("foo","barbaz"),D=5:6)) # ok @@ -2707,7 +2711,7 @@ test(945, DT[,b:=a+1], data.table(a=numeric(),b=numeric())) # fread blank column names get default names test(946, fread('A,B,,D\n1,3,foo,5\n2,4,bar,6\n'), data.table(A=1:2,B=3:4,c("foo","bar"),D=5:6)) -test(947, fread('0,2,,4\n1,3,foo,5\n2,4,bar,6\n'), data.table(0:2,2:4,c(NA,"foo","bar"),4:6)) +test(947, fread('0,2,,4\n1,3,foo,5\n2,4,bar,6\n'), data.table(0:2,2:4,c("","foo","bar"),4:6)) test(948, fread('A,B,C\nD,E,F\n',header=TRUE), data.table(A="D",B="E",C="F")) test(949, fread('A,B,\nD,E,F\n',header=TRUE), data.table(A="D",B="E",V3="F")) @@ -5185,12 +5189,12 @@ bla <- data.table(x=c(1,1,2,2), y=c(1,1,1,1)) test(1342, unique(bla)[, bla := 2L], data.table(x=c(1,2),y=1,bla=2L)) # blank and NA fields in logical columns -test(1343.1, fread("A,B\n1,TRUE\n2,\n3,False"), data.table(A=1:3, B=c("TRUE",NA,"False"))) -test(1343.2, fread("A,B\n1,True\n2,\n3,false"), data.table(A=1:3, B=c("True",NA,"false"))) +test(1343.1, fread("A,B\n1,TRUE\n2,\n3,False"), data.table(A=1:3, B=c("TRUE","","False"))) +test(1343.2, fread("A,B\n1,True\n2,\n3,false"), data.table(A=1:3, B=c("True","","false"))) test(1343.3, fread("A,B\n1,TRUE\n2,\n3,FALSE"), data.table(A=1:3, B=c(TRUE,NA,FALSE))) test(1343.4, fread("A,B\n1,True\n2,\n3,False"), data.table(A=1:3, B=c(TRUE,NA,FALSE))) test(1343.5, fread("A,B\n1,true\n2,\n3,false"), data.table(A=1:3, B=c(TRUE,NA,FALSE))) -test(1343.6, fread("A,B\n1,true\n2,NA\n3,"), data.table(A=1:3, B=c("true","NA",NA))) +test(1343.6, fread("A,B\n1,true\n2,NA\n3,"), data.table(A=1:3, B=c(TRUE,NA,NA))) test(1344.1, fread("A,B\n1,2\n0,3\n,1\n", logical01=FALSE), data.table(A=c(1L,0L,NA), B=c(2L,3L,1L))) test(1344.2, fread("A,B\n1,2\n0,3\n,1\n", logical01=TRUE), data.table(A=c(TRUE,FALSE,NA), B=c(2L,3L,1L))) @@ -6184,8 +6188,8 @@ if ("package:bit64" %in% search()) { test(1449, fread(testDir("quoted_multiline.csv"))[c(1,43:44),c(1,22:24),with=FALSE], data.table(GPMLHTLN=as.integer64(c("3308386085360","3440245203140","1305220146734")), BLYBZ = c(0L,4L,6L), - ZBJBLOAJAQI = c("LHCYS AYE ZLEMYA IFU HEI JG FEYE",NA,NA), - JKCRUUBAVQ = c(NA,".\\YAPCNXJ\\004570_850034_757\\VWBZSS_848482_600874_487_PEKT-6-KQTVIL-7_30\\IRVQT\\HUZWLBSJYHZ\\XFWPXQ-WSPJHC-00-0770000855383.KKZ",NA))) + ZBJBLOAJAQI = c("LHCYS AYE ZLEMYA IFU HEI JG FEYE","",""), + JKCRUUBAVQ = c("",".\\YAPCNXJ\\004570_850034_757\\VWBZSS_848482_600874_487_PEKT-6-KQTVIL-7_30\\IRVQT\\HUZWLBSJYHZ\\XFWPXQ-WSPJHC-00-0770000855383.KKZ",""))) } # Fix for #927 @@ -6751,7 +6755,7 @@ if ("package:bit64" %in% search()) { test(1500.3, fread(ll, na=NULL), data.table(V1=x, V2=y)) - x = c("12345678901234", rep("", 178), "0.5") + x = c("12345678901234", rep("NA", 178), "0.5") y = sample(letters, length(x), TRUE) ll = paste(x,y, sep=",", collapse="\n") test(1500.4, fread(ll), data.table(V1=suppressWarnings(as.numeric(x)), V2=y)) @@ -7350,7 +7354,7 @@ test(1551.4, fread(str), data.table(V1=c("2","\"\"foo"), V2=c("3","bar"))) str = 'L1\tsome\tunquoted\tstuff\nL2\tsome\t"half" quoted\tstuff\nL3\tthis\t"should work"\tok though' test(1551.5, fread(str), data.table(L1 = c("L2", "L3"), some = c("some", "this"), unquoted = c("\"half\" quoted", "should work"), stuff = c("stuff", "ok though"))) #1095 -rhs = read.table(testDir("issue_1095_fread.txt"), sep=",", comment.char="", stringsAsFactors=FALSE, quote="", strip.white=TRUE, na.strings="") +rhs = read.table(testDir("issue_1095_fread.txt"), sep=",", comment.char="", stringsAsFactors=FALSE, quote="", strip.white=TRUE) test(1551.6, fread(testDir("issue_1095_fread.txt"), logical01=FALSE), setDT(rhs)) # FR #1314 rest of na.strings issue @@ -7620,14 +7624,14 @@ X = fread("a|b|c|d this|is|row|2 this|NA|NA|3 this|is|row|4", stringsAsFactors = TRUE) -test(1577.1, X[3, b]=="NA", TRUE) -test(1577.2, sort(levels(X$b)), sort(c("NA","is"))) # locales could mean different sort orders, so use sort() twice to be sure +test(1577.1, is.na(X[3, b]), TRUE) +test(1577.2, levels(X$b), "is") X = fread("a|b|c|d this|NA|row|1 this|NA|row|2 this|NA|NA|3 this|NA|row|4", colClasses="character", stringsAsFactors = TRUE) -test(1577.3, levels(X$b), "NA") +test(1577.3, levels(X$b), character(0)) # FR #530, skip blank lines input = "Header not 2 columns\n\n1,3\n2,4" @@ -7788,7 +7792,7 @@ options(datatable.optimize=optim) # Fixed a minor bug in fread when blank.lines.skip=TRUE f1 <- function(x, f=TRUE, b=FALSE) fread(x, fill=f, blank.lines.skip=b, data.table=FALSE, logical01=FALSE) -f2 <- function(x, f=TRUE, b=FALSE) read.table(x, fill=f, blank.lines.skip=b, sep=",", header=TRUE, stringsAsFactors=FALSE, na.strings="") +f2 <- function(x, f=TRUE, b=FALSE) read.table(x, fill=f, blank.lines.skip=b, sep=",", header=TRUE, stringsAsFactors=FALSE) test(1584.1, f1(testDir("fread_blank.txt"), f=FALSE, b=TRUE), f2(testDir("fread_blank.txt"), f=FALSE, b=TRUE)) test(1584.2, f1(testDir("fread_blank2.txt"), f=FALSE, b=TRUE), f2(testDir("fread_blank2.txt"), f=FALSE, b=TRUE)) test(1584.3, f1(testDir("fread_blank3.txt"), f=FALSE, b=TRUE), f2(testDir("fread_blank3.txt"), f=FALSE, b=TRUE)) @@ -7958,9 +7962,7 @@ DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9) test(1605, DT[order(-x, "D")], error="Column 2 is length 1 which differs") # fix for #1503, fread's fill argument polishing -test(1606.1, fread("2\n1,a,b", fill=TRUE), data.table(V1=2:1, V2=c(NA,"a"), V3=c(NA,"b"))) -test(1606.2, fread("2\n1,a,b", fill=TRUE, na.strings=NULL), data.table(V1=2:1, V2=c("","a"), V3=c("","b"))) -test(1606.3, fread("2\n1,a,b", fill=TRUE, na.strings="NA"), data.table(V1=2:1, V2=c("","a"), V3=c("","b"))) +test(1606, fread("2\n1,a,b", fill=TRUE), data.table(V1=2:1, V2=c("","a"), V3=c("","b"))) # fix for #1476 dt = data.table(resp=c(1:5)) @@ -9490,7 +9492,7 @@ dt = data.table(x=1:2, y=c(NA,"a")) f = tempfile() test(1676.1, fwrite(dt, f, na=NULL), error="is not TRUE") fwrite(dt, f, na=NA) -test(1676.2, fread(f), data.table(x=1:2, y=c("NA", "a"))) +test(1676.2, fread(f), data.table(x=1:2, y=c(NA, "a"))) unlink(f) # duplicate names in foverlaps #1730 @@ -10933,14 +10935,14 @@ test(1777.19, fread("A,B,C\nC,D,4\n", verbose=TRUE), data.table(A="C",B="D",C=4L # unquoted fields containing \r, #2371 test(1778.1, fread("A,B,C\n0,,\n1,hello\rworld,2\n3,test,4\n", verbose=TRUE), - DT <- data.table(A=c(0L,1L,3L), B=c(NA,"hello\rworld","test"), C=c(NA,2L,4L)), + DT <- data.table(A=c(0L,1L,3L), B=c("","hello\rworld","test"), C=c(NA,2L,4L)), output="has been found.*common and ideal") fwrite(DT, f<-tempfile()) -test(1778.2, readLines(f), c("A,B,C", "0,,", "1,\"hello", "world\",2", "3,test,4")) +test(1778.2, readLines(f), c("A,B,C", "0,\"\",", "1,\"hello", "world\",2", "3,test,4")) # fwrite quotes the field containing \r ........... ^^ ............ ^^ # and that reading back in gets us back to DT faithfully test(1778.3, fread(f), DT) -tt = setDT(read.csv(f, stringsAsFactors=FALSE, na.strings="")) +tt = setDT(read.csv(f, stringsAsFactors=FALSE)) tt[2, B:=gsub("\n","\r",B)] # base R changes the \r to a \n, so restore that test(1778.4, tt, DT) unlink(f) @@ -11200,8 +11202,8 @@ test(1834.1, dim(DT<-fread(testDir("grr.csv"), header=FALSE)), INT(2839, 12)) test(1834.2, DT[c(1,2,.N-1,.N), c(1,2,11,12)], data.table(V1="AAAAAAAAAA", V2=c("AAAAAAAA","AAAAAAAAAA","AAAAAAAAAA","AAAAAAAAAA"), - V11=c("AAAAAAAAAAAAAAAA",NA,"AAAAAAAAAA","AAAA"), - V12=c("AAAAAAAAAAAAA",NA,"AAAAAAA","AAA"))) + V11=c("AAAAAAAAAAAAAAAA","","AAAAAAAAAA","AAAA"), + V12=c("AAAAAAAAAAAAA","","AAAAAAA","AAA"))) # Create a file to test a sample jump being skipped due to format error. It will fail later in the read step because # this is a real error. Currently have not constructed an error for which nextGoodLine looks good, but in fact is not. @@ -11227,14 +11229,14 @@ test(1838, fread("default payment next month\n0.5524\n0.2483\n0.1157\n"), data.t # better writing and reading of NA in single column input, #2106 DT = data.table(a=c(4,NA,2,3.14,999,NA)) -fwrite(DT, f<-tempfile(), na="") # default value of na anyway +fwrite(DT, f<-tempfile(), na="") test(1839.1, fread(f), data.table(a=c(4,NA,2,3.14,999,NA))) test(1839.2, fread(f, blank.lines.skip=TRUE), data.table(a=c(4,2,3.14,999))) test(1839.3, fread(f, fill=TRUE), data.table(a=c(4,NA,2,3.14,999,NA))) test(1839.4, fread(f, fill=TRUE, blank.lines.skip=TRUE), data.table(a=c(4,2,3.14,999))) fwrite(DT, f, na="NA") # base R does not do this though, it writes ,, for NAs in numeric columns (as does fwrite) -test(1839.5, fread(f), data.table(a=c("4","NA","2","3.14","999","NA"))) -test(1839.6, fread(f, na="NA"), DT) # TOOD: auto handle (unusual, even as written by R) "NA" in numeric columns +test(1839.5, fread(f, na.strings=""), data.table(a=c("4","NA","2","3.14","999","NA"))) +test(1839.6, fread(f, na.strings="NA"), DT) # TOOD: auto handle (unusual, even as written by R) "NA" in numeric columns unlink(f) lines = c("DECLARATION OF INDEPENDENCE", @@ -11245,7 +11247,7 @@ lines = c("DECLARATION OF INDEPENDENCE", "That to secure these rights, Governments are instituted among Men,", "deriving their just powers from the consent of the governed.") txt = paste(lines, collapse="\n") -test(1839.7, fread(txt, sep=""), data.table("DECLARATION OF INDEPENDENCE"=lines[-1])[4,1:=NA]) # TODO fread should be able auto-detect sep="" here +test(1839.6, fread(txt, sep=""), data.table("DECLARATION OF INDEPENDENCE"=lines[-1])) # fread should eventually be able auto-detect sep="" # readLines behaviour, #1616 txt = 'a,b\n ab,cd,ce\n abcdef\n hjkli \n' # now auto detected as ncol 1 anyway @@ -11561,12 +11563,12 @@ test(1869.6, fread(testDir("colnames4096.csv"), verbose=TRUE)[,c(1,2,585,586)], data.table(Foo000=logical(), Bar001=logical(), Foo584=logical(), B=logical()), output = "Copying file in RAM.*file is very unusual.*ends abruptly.*multiple of 4096") test(1869.7, fread(testDir("onecol4096.csv"), verbose=TRUE)[c(1,2,245,246,249,255:.N),], - data.table(A=c("FooBarBazQux000","FooBarBazQux001",NA,"FooBarBazQux245",NA,"FooBarBazQux254","FooBarBazQux","FooBarBaz12","FooBarBazQux256",NA,NA,NA)), + data.table(A=c("FooBarBazQux000","FooBarBazQux001","","FooBarBazQux245","","FooBarBazQux254","FooBarBazQux","FooBarBaz12","FooBarBazQux256","","","")), output = "Copying file in RAM.*file is very unusual.*one single column, ends with 2 or more end-of-line.*and is a multiple of 4096") # better colname detection by comparing potential column names to the whole sample not just the first row of the sample, #2526 -test(1870.1, fread("A,100,200\n,300,400\n,500,600"), data.table(V1=c("A",NA,NA), V2=c(100L,300L,500L), V3=c(200L,400L,600L))) -test(1870.2, fread("A,100,\n,,\n,500,600"), data.table(V1=c("A",NA,NA), V2=c(100L,NA,500L), V3=c(NA,NA,600L))) +test(1870.1, fread("A,100,200\n,300,400\n,500,600"), data.table(V1=c("A","",""), V2=c(100L,300L,500L), V3=c(200L,400L,600L))) +test(1870.2, fread("A,100,\n,,\n,500,600"), data.table(V1=c("A","",""), V2=c(100L,NA,500L), V3=c(NA,NA,600L))) test(1870.3, fread("A,B,\n,,\n,500,3.4"), data.table(A=NA, B=c(NA,500L), V3=c(NA,3.4))) # nrows= now ignores errors after those nrows as expected and skip= determines first row for sure, #1267 @@ -11663,7 +11665,7 @@ x = sprintf("ABCDEFGHIJKLMNOPQRST%06d", 1:102184) x[51094]="" cat(x, file=f<-tempfile(), sep="\n") test(1874.1, fread(f,header=FALSE,verbose=TRUE)[c(1,51094,.N),], - data.table(V1=c("ABCDEFGHIJKLMNOPQRST000001",NA,"ABCDEFGHIJKLMNOPQRST102184")), + data.table(V1=c("ABCDEFGHIJKLMNOPQRST000001","","ABCDEFGHIJKLMNOPQRST102184")), output="jumps=[0..2)") # ensure jump 1 happened # # out-of-sample short lines in the first jump, not near the jump point @@ -11777,7 +11779,7 @@ test(1882.3, CJ(v, v, v), error="Cross product of elements provided to CJ() woul # no re-read for particular file, #2509 test(1883, fread(testDir("SA2-by-DJZ.csv"), verbose=TRUE, header=FALSE)[c(1,2,1381,.N),], - data.table(V1=c("Goulburn",NA,NA,NA), V2=c("110018063","110018064","0&&&&&&&&","0@@@@@@@@"), V3=INT(3499,812,250796,7305367), V4=NA), + data.table(V1=c("Goulburn","","",""), V2=c("110018063","110018064","0&&&&&&&&","0@@@@@@@@"), V3=INT(3499,812,250796,7305367), V4=NA), warning='Stopped early on line 1394.*First discarded non-empty line: <<"Dataset: 2011 Census of Population and Housing">>', output="0.000s.*Rereading 0 columns") @@ -11787,7 +11789,9 @@ test(1884, fread('"A","B"\n', sep=NULL), data.table('"A","B"'=logical())) # sep=' ' and blank.lines.skip, #2535 test(1885.1, fread(txt<-"a b 2\nc d 3\n\ne f 4\n", blank.lines.skip=TRUE), ans<-data.table(V1=c("a","c","e"), V2=c("b","d","f"), V3=2:4)) test(1885.2, fread(txt, blank.lines.skip=TRUE, fill=TRUE), ans) -test(1885.3, fread(txt, fill=TRUE), ans[c(1,2,NA,3),]) +test(1885.3, fread(txt, fill=TRUE), ans[c(1,2,NA,3),][3,1:2:=""]) +test(1885.4, fread(txt, fill=TRUE, na.strings=""), ans[c(1,2,NA,3),]) + # na="" default, #2524 test(1886.1, fread('A,B,C\n1,foo,4\n2,,5\n3,bar,6\n'), data.table(A=1:3, B=c("foo",NA,"bar"), C=4:6)) diff --git a/man/fread.Rd b/man/fread.Rd index 810a1cd512..9fbdb2eb4b 100644 --- a/man/fread.Rd +++ b/man/fread.Rd @@ -10,7 +10,7 @@ } \usage{ fread(input, file, sep="auto", sep2="auto", dec=".", quote="\"", -nrows=Inf, header="auto", na.strings="", +nrows=Inf, header="auto", na.strings=getOption("datatable.na.strings","NA"), stringsAsFactors=FALSE, verbose=getOption("datatable.verbose"), skip="__auto__", select=NULL, drop=NULL, colClasses=NULL, integer64=getOption("datatable.integer64"), # default: "integer64" From bf04b2edaf2e0eeb14d7e2a32f3d85c780af0973 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Mon, 5 Mar 2018 16:04:19 -0800 Subject: [PATCH 08/14] New tests need na.strings= as default change is now postponed --- inst/tests/tests.Rraw | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index d6853b1433..97aab31ee3 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -11792,17 +11792,21 @@ test(1885.2, fread(txt, blank.lines.skip=TRUE, fill=TRUE), ans) test(1885.3, fread(txt, fill=TRUE), ans[c(1,2,NA,3),][3,1:2:=""]) test(1885.4, fread(txt, fill=TRUE, na.strings=""), ans[c(1,2,NA,3),]) +# file detected as no header automatically +# (TOOD: undoubling double quotes #1109, #1299) but otherwise, auto mode correct +test(1886, fread(testDir("quoted_no_header.csv"))[c(1,.N),list(V1,V6)], data.table(V1=c("John","Joan \"\"the bone\"\", Anne"), V6=INT(8075,123))) # na="" default, #2524 -test(1886.1, fread('A,B,C\n1,foo,4\n2,,5\n3,bar,6\n'), data.table(A=1:3, B=c("foo",NA,"bar"), C=4:6)) -test(1886.2, fread('A,B,C\n1,foo,4\n2,"",5\n3,bar,6\n'), data.table(A=1:3, B=c("foo","","bar"), C=4:6)) -test(1886.3, fread("A,B,C\n1,foo,bar\n2", fill=TRUE), data.table(A=1:2,B=c("foo",NA),C=c("bar",NA))) +test(1886.1, fread('A,B,C\n1,foo,4\n2,,5\n3,bar,6\n', na.strings=""), data.table(A=1:3, B=c("foo",NA,"bar"), C=4:6)) +test(1886.2, fread('A,B,C\n1,foo,4\n2,"",5\n3,bar,6\n', na.strings=""), data.table(A=1:3, B=c("foo","","bar"), C=4:6)) +test(1886.3, fread("A,B,C\n1,foo,bar\n2", fill=TRUE, na.strings=""), data.table(A=1:2,B=c("foo",NA),C=c("bar",NA))) +test(1886.4, fread("A,B,C\n1,foo,bar\n2", fill=TRUE, na.strings="NA"), data.table(A=1:2,B=c("foo",""),C=c("bar",""))) # preserving "" and NA_character_, #2214 DT = data.table(chr = c(NA, "", "a"), num = c(NA, NA, 2L)) -test(1887.1, fread({fwrite(DT,f<-tempfile());f}), DT); unlink(f) +test(1887.1, fread({fwrite(DT,f<-tempfile());f}, na.strings=""), DT); unlink(f) test(1887.2, capture.output(fwrite(DT)), c("chr,num", ",", "\"\"," , "a,2")) -test(1887.3, fread('A,B\n1,"foo"\n2,\n3,""\n')$B, c("foo", NA, "")) # for issue #2217 +test(1887.3, fread('A,B\n1,"foo"\n2,\n3,""\n', na.strings="")$B, c("foo", NA, "")) # for issue #2217 # print(DT) should print NA in character columns using like base R to distinguish from "" and "NA" DT = data.table(A=1:4, B=c("FOO","",NA,"NA")) @@ -11811,11 +11815,7 @@ DF = as.data.frame(DT) rownames(DF) = paste0(rownames(DF),":") test(1888.2, print(DF), output=txt) txt = 'A,B\n109,MT\n7,N\n11,NA\n41,NB\n60,ND\n1,""\n2,\n3,"NA"\n4,NA\n' -test(1888.3, print(fread(txt)), output="A B\n1: 109 MT\n2: 7 N\n3: 11 NA\n4: 41 NB\n5: 60 ND\n6: 1 \n7: 2 \n8: 3 NA\n9: 4 NA") - -# file detected as no header automatically -# (TOOD: undoubling double quotes #1109, #1299) but otherwise, auto mode correct -test(1886, fread(testDir("quoted_no_header.csv"))[c(1,.N),list(V1,V6)], data.table(V1=c("John","Joan \"\"the bone\"\", Anne"), V6=INT(8075,123))) +test(1888.3, print(fread(txt,na.strings="")), output="A B\n1: 109 MT\n2: 7 N\n3: 11 NA\n4: 41 NB\n5: 60 ND\n6: 1 \n7: 2 \n8: 3 NA\n9: 4 NA") ################################### From 30d1c0bbe3bd6e953e59c740f449283b6df08641 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Mon, 5 Mar 2018 17:32:00 -0800 Subject: [PATCH 09/14] Breaking changes section added to NEWS. fread(logical01=getOption) too. --- NEWS.md | 21 ++++++++++++++++----- R/fread.R | 2 +- man/fread.Rd | 2 +- 3 files changed, 18 insertions(+), 7 deletions(-) diff --git a/NEWS.md b/NEWS.md index ee976a0bc8..b5d162a3a0 100644 --- a/NEWS.md +++ b/NEWS.md @@ -3,6 +3,18 @@ ### Changes in v1.10.5 ( in development ) +#### POTENTIALLY BREAKING CHANGES + +1. `fread()`'s `na.strings=` argument : + ``` + na.strings="NA" # was + getOption("datatable.na.strings", "NA") # this release; i.e. no change yet + getOption("datatable.na.strings", "") # future release + ``` +This option controls how `,,` is read in character columns. It does not affect numeric columns which read `,,` as `NA` regardless. We would like `,,`=>`NA` for consistency with numeric types, and `,"",`=>empty string to be the standard default for `fwrite/fread` character columns so that `fread(fwrite(DT))==DT` without needing any change to any parameters. `fwrite` has never written `NA` as `"NA"`, by default it already writes `,,`. The use of R's `getOption()` allows data.table users to move forward early, or restore old behaviour when the default's default is changed in future. + +2. `fread` now reads a column of all 0's and 1's as `logical` rather than `integer`, for convenience to avoid needing to change the type afterwards or use `colClasses`. The old behaviour can be restored with `options(datatable.logical01=FALSE)`. We felt this default change was ok to make because in all operations there should be no difference: R treats `logical` and `integer` the same. If this change does cause a problem, the option is provided to restore old behaviour while you update your code. Similarly, `fwrite` now writes `logical` columns as `0/1` by default, controlled by the same option. `0/1` is smaller and faster than `"TRUE"/"FALSE"`, which can make a significant difference to space and time the more `logical` columns there are. Further, a column of `TRUE/FALSE`s is ok, as well as `True/False`s and `true/false`s, but mixing styles (e.g. `TRUE/false`) is not and will be read as type `character`. + #### NEW FEATURES 1. `fread()`: @@ -17,17 +29,16 @@ * `dec=','` is now implemented directly so there is no dependency on locale. The options `datatable.fread.dec.experiment` and `datatable.fread.dec.locale` have been removed. * `\\r\\r\\n` line endings are now handled such as produced by `base::download.file()` when it doubles up `\\r`. Other rare line endings (`\\r` and `\\n\\r`) are now more robust. * Mixed line endings are now handled; e.g. a file formed by concatenating a Unix file and a Windows file so that some lines end with `\\n` while others end with `\\r\\n`. - * Improved automatic detection of whether the first row is column names by comparing the types of the fields on the first and second row. + * Improved automatic detection of whether the first row is column names by comparing the types of the fields on the first row against the column types ascertained by the 10,000 rows sample (or `colClasses` if provided). If a numeric column has a string value at the top, then column names are deemed present. * Detects GB-18030 and UTF-16 encodings and in verbose mode prints a message about BOM detection. * Detects and ignores trailing ^Z end-of-file control character sometimes created on MS DOS/Windows, [#1612](https://github.com/Rdatatable/data.table/issues/1612). Thanks to Gergely Daróczi for reporting and providing a file. - * Added option `logical01` to read a column of only `0`s and `1`s as `logical`, default `TRUE` for convenience in most cases. The large sample of rows throughout the file means that `fread` will be confident that the column really does just contain `0`s and `1`s, enabling and encouraging this convenient and efficient choice to save needing conversion afterwards or setting `colClasses` manually. In R, `logical` is `integer` anyway and can be treated as such in calculations. Further, it is no longer allowed to have mixed-case literals within a single column; i.e., a column of `TRUE/FALSE`s is ok, as well as `True/False`s and `true/false`s, but mixing different styles together is not. * Added ability to recognize and parse hexadecimal floating point numbers, as used for example in Java. Thanks for @scottstanfield [#2316](https://github.com/Rdatatable/data.table/issues/2316) for the report. * Now handles floating-point NaN values in a wide variety of formats, including `NaN`, `sNaN`, `1.#QNAN`, `NaN1234`, `#NUM!` and others, [#1800](https://github.com/Rdatatable/data.table/issues/1800). Thanks to Jori Liesenborgs for highlighting and the PR. * If negative numbers are passed to `select=` the out-of-range error now suggests `drop=` instead, [#2423](https://github.com/Rdatatable/data.table/issues/2423). Thanks to Michael Chirico for the suggestion. - * `sep=NULL` or `sep=""` (i.e., no column separator) can now be used to specify single column input reliably like `base::readLines`, [#1616](https://github.com/Rdatatable/data.table/issues/1616). `sep='\\n'` still works (even on Windows where line ending is actually `\\r\\n`) but `NULL` or `""` are now documented and recommended. Thanks to Dmitriy Selivanov for the pull request and many others for comments. As before, `sep=NA` is not valid; use the default `"auto"` for automatic separator detection. `sep='\\n'` may be deprecated in future. - * Single-column input with blank lines is now valid and the blank lines are significant (meaning an NA in the single column). The blank lines are significant even at the very end, which may be surprising on first glance. The change is so that `fread(fwrite(DT))==DT` for single-column inputs containing NA which are written as blank. There is no change when `ncol>1` (i.e., input stops with detailed warning at the first blank line) because a blank line when `ncol>1` is invalid input due to no separators present instead of `ncol-1` separators. Thanks to @skanskan, Michael Chirico, @franknarf1 and Pasha for the testing and discussions, [#2106](https://github.com/Rdatatable/data.table/issues/2106). + * `sep=NULL` or `sep=""` (i.e., no column separator) can now be used to specify single column input reliably like `base::readLines`, [#1616](https://github.com/Rdatatable/data.table/issues/1616). `sep='\\n'` still works (even on Windows where line ending is actually `\\r\\n`) but `NULL` or `""` are now documented and recommended. Thanks to Dmitriy Selivanov for the pull request and many others for comments. As before, `sep=NA` is not valid; use the default `"auto"` for automatic separator detection. `sep='\\n'` is now deprecated and in future will start to warn when used. + * Single-column input with blank lines is now valid and the blank lines are significant (representing `NA`). The blank lines are significant even at the very end, which may be surprising on first glance. The change is so that `fread(fwrite(DT))==DT` for single-column inputs containing `NA` which are written as blank. There is no change when `ncol>1`; i.e., input stops with detailed warning at the first blank line, because a blank line when `ncol>1` is invalid input due to no separators being present. Thanks to @skanskan, Michael Chirico, @franknarf1 and Pasha for the testing and discussions, [#2106](https://github.com/Rdatatable/data.table/issues/2106). * Too few column names are now auto filled with default column names, with warning, [#1625](https://github.com/Rdatatable/data.table/issues/1625). If there is just one missing column name it is guessed to be for the first column (row names or an index), otherwise the column names are filled at the end. Similarly, too many column names now automatically sets `fill=TRUE`, with warning. - * `skip=` and `nrow=` are more reliable and no longer affected by invalid lines outside the range specified. Thanks to Ziyad Saeed and Kyle Chung for reporting, [#1267](https://github.com/Rdatatable/data.table/issues/1267). Tests added. + * `skip=` and `nrow=` are more reliable and are no longer affected by invalid lines outside the range specified. Thanks to Ziyad Saeed and Kyle Chung for reporting, [#1267](https://github.com/Rdatatable/data.table/issues/1267). * Ram disk (`/dev/shm`) is no longer used for the output of system command input. Although faster when it worked, it was causing too many device full errors; e.g., [#1139](https://github.com/Rdatatable/data.table/issues/1139) and [zUMIs/19](https://github.com/sdparekh/zUMIs/issues/19). Thanks to Kyle Chung for reporting. Standard `tempdir()` is now used. If you wish to use ram disk, set TEMPDIR to `/dev/shm`; see `?tempdir`. * Detecting whether a very long input string is a file name or data is now much faster, [#2531](https://github.com/Rdatatable/data.table/issues/2531). Many thanks to @javrucebo for the detailed report, benchmarks and suggestions. * Many thanks to @yaakovfeldman, Guillermo Ponce, Arun Srinivasan, Hugh Parsonage, Mark Klik, Pasha Stetsenko, Mahyar K, Tom Crockett, @cnoelke, @qinjs, @etienne-s, Mark Danese, Avraham Adler, @franknarf1, @MichaelChirico, @tdhock, Luke Tierney for testing dev and reporting these regressions before release to CRAN: [#2070](https://github.com/Rdatatable/data.table/issues/2070), [#2073](https://github.com/Rdatatable/data.table/issues/2073), [#2087](https://github.com/Rdatatable/data.table/issues/2087), [#2091](https://github.com/Rdatatable/data.table/issues/2091), [#2107](https://github.com/Rdatatable/data.table/issues/2107), [fst#50](https://github.com/fstpackage/fst/issues/50#issuecomment-294287846), [#2118](https://github.com/Rdatatable/data.table/issues/2118), [#2092](https://github.com/Rdatatable/data.table/issues/2092), [#1888](https://github.com/Rdatatable/data.table/issues/1888), [#2123](https://github.com/Rdatatable/data.table/issues/2123), [#2167](https://github.com/Rdatatable/data.table/issues/2167), [#2194](https://github.com/Rdatatable/data.table/issues/2194), [#2238](https://github.com/Rdatatable/data.table/issues/2238), [#2228](https://github.com/Rdatatable/data.table/issues/2228), [#1464](https://github.com/Rdatatable/data.table/issues/1464), [#2201](https://github.com/Rdatatable/data.table/issues/2201), [#2287](https://github.com/Rdatatable/data.table/issues/2287), [#2299](https://github.com/Rdatatable/data.table/issues/2299), [#2285](https://github.com/Rdatatable/data.table/issues/2285), [#2251](https://github.com/Rdatatable/data.table/issues/2251), [#2347](https://github.com/Rdatatable/data.table/issues/2347), [#2222](https://github.com/Rdatatable/data.table/issues/2222), [#2352](https://github.com/Rdatatable/data.table/issues/2352), [#2246](https://github.com/Rdatatable/data.table/issues/2246), [#2370](https://github.com/Rdatatable/data.table/issues/2370), [#2371](https://github.com/Rdatatable/data.table/issues/2371), [#2404](https://github.com/Rdatatable/data.table/issues/2404), [#2196](https://github.com/Rdatatable/data.table/issues/2196), [#2322](https://github.com/Rdatatable/data.table/issues/2322), [#2453](https://github.com/Rdatatable/data.table/issues/2453), [#2446](https://github.com/Rdatatable/data.table/issues/2446), [#2464](https://github.com/Rdatatable/data.table/issues/2464), [#2457](https://github.com/Rdatatable/data.table/issues/2457), [#1895](https://github.com/Rdatatable/data.table/issues/1895), [#2481](https://github.com/Rdatatable/data.table/pull/2481), [#2499](https://github.com/Rdatatable/data.table/issues/2499), [#2516](https://github.com/Rdatatable/data.table/issues/2516), [#2520](https://github.com/Rdatatable/data.table/issues/2520), [#2512](https://github.com/Rdatatable/data.table/issues/2512), [#2523](https://github.com/Rdatatable/data.table/issues/2523), [#2542](https://github.com/Rdatatable/data.table/issues/2542), [#2526](https://github.com/Rdatatable/data.table/issues/2526), [#2518](https://github.com/Rdatatable/data.table/issues/2518), [#2515](https://github.com/Rdatatable/data.table/issues/2515), [#1671](https://github.com/Rdatatable/data.table/issues/1671), [#2267](https://github.com/Rdatatable/data.table/issues/2267), [#2561](https://github.com/Rdatatable/data.table/issues/2561), [#2625](https://github.com/Rdatatable/data.table/issues/2625), [#2265](https://github.com/Rdatatable/data.table/issues/2265), [#2548](https://github.com/Rdatatable/data.table/issues/2548), [#2535](https://github.com/Rdatatable/data.table/issues/2535) diff --git a/R/fread.R b/R/fread.R index c619b2d25d..3a0302d9bf 100644 --- a/R/fread.R +++ b/R/fread.R @@ -1,5 +1,5 @@ -fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings=getOption("datatable.na.strings","NA"),stringsAsFactors=FALSE,verbose=getOption("datatable.verbose"),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(),data.table=getOption("datatable.fread.datatable"),nThread=getDTthreads(),logical01=TRUE,autostart=NA) +fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings=getOption("datatable.na.strings","NA"),stringsAsFactors=FALSE,verbose=getOption("datatable.verbose"),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(),data.table=getOption("datatable.fread.datatable"),nThread=getDTthreads(),logical01=getOption("datatable.logical01", TRUE),autostart=NA) { if (is.null(sep)) sep="\n" # C level knows that \n means \r\n on Windows, for example else { diff --git a/man/fread.Rd b/man/fread.Rd index 9fbdb2eb4b..bff542fb4f 100644 --- a/man/fread.Rd +++ b/man/fread.Rd @@ -19,7 +19,7 @@ check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(), data.table=getOption("datatable.fread.datatable"), -nThread=getDTthreads(), logical01=TRUE, autostart=NA +nThread=getDTthreads(), logical01=getOption("datatable.logical01", TRUE), autostart=NA ) } \arguments{ From 90d42f92a2a141f642e775fab6c019f0e8cd7bc6 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Fri, 16 Mar 2018 23:31:32 -0700 Subject: [PATCH 10/14] Reverted logical01 to FALSE (old default, no change) thanks to review comments and reflected at the top of NEWS. --- NEWS.md | 21 +++++++++++++++------ R/fread.R | 2 +- R/fwrite.R | 2 +- R/onLoad.R | 7 +++---- inst/tests/tests.Rraw | 32 +++++++++++++++----------------- man/fread.Rd | 13 ++++++++----- man/fwrite.Rd | 2 +- 7 files changed, 44 insertions(+), 35 deletions(-) diff --git a/NEWS.md b/NEWS.md index efc8e9c5ce..620378c72c 100644 --- a/NEWS.md +++ b/NEWS.md @@ -3,17 +3,25 @@ ### Changes in v1.10.5 ( in development ) -#### POTENTIALLY BREAKING CHANGES +#### NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES 1. `fread()`'s `na.strings=` argument : ``` - na.strings="NA" # was - getOption("datatable.na.strings", "NA") # this release; i.e. no change yet + "NA" # old default + getOption("datatable.na.strings", "NA") # this release; i.e. the same; no change yet getOption("datatable.na.strings", "") # future release ``` -This option controls how `,,` is read in character columns. It does not affect numeric columns which read `,,` as `NA` regardless. We would like `,,`=>`NA` for consistency with numeric types, and `,"",`=>empty string to be the standard default for `fwrite/fread` character columns so that `fread(fwrite(DT))==DT` without needing any change to any parameters. `fwrite` has never written `NA` as `"NA"`, by default it already writes `,,`. The use of R's `getOption()` allows data.table users to move forward early, or restore old behaviour when the default's default is changed in future. +This option controls how `,,` is read in character columns. It does not affect numeric columns which read `,,` as `NA` regardless. We would like `,,`=>`NA` for consistency with numeric types, and `,"",`=>empty string to be the standard default for `fwrite/fread` character columns so that `fread(fwrite(DT))==DT` without needing any change to any parameters. `fwrite` has never written `NA` as `"NA"`; it already writes `,,` by default. The use of R's `getOption()` allows users to move forward now, using `options(datatable.fread.na.strings="")`, or restore old behaviour when the default's default is changed in future, using `options(datatable.fread.na.strings="NA")`. -2. `fread` now reads a column of all 0's and 1's as `logical` rather than `integer`, for convenience to avoid needing to change the type afterwards or use `colClasses`. The old behaviour can be restored with `options(datatable.logical01=FALSE)`. We felt this default change was ok to make because in all operations there should be no difference: R treats `logical` and `integer` the same. If this change does cause a problem, the option is provided to restore old behaviour while you update your code. Similarly, `fwrite` now writes `logical` columns as `0/1` by default, controlled by the same option. `0/1` is smaller and faster than `"TRUE"/"FALSE"`, which can make a significant difference to space and time the more `logical` columns there are. Further, a column of `TRUE/FALSE`s is ok, as well as `True/False`s and `true/false`s, but mixing styles (e.g. `TRUE/false`) is not and will be read as type `character`. +2. `fread()` and `fwrite()`'s `logical01=` argument : + ``` + logical01 = FALSE # old default + getOption("datatable.logical01", FALSE) # this release; i.e. the same; no change yet + getOption("datatable.logical01", TRUE) # future release + ``` +This option controls whether a column of all 0's and 1's is read as `integer`, or `logical` directly to avoid needing to change the type afterwards to `logical` or use `colClasses`. `0/1` is smaller and faster than `"TRUE"/"FALSE"`, which can make a significant difference to space and time the more `logical` columns there are. When the default's default changes to `TRUE` for `fread` we do not expect much impact since all arithmetic operators that are currently receiving 0's and 1's as type `integer` (think `sum()`) but instead could receive `logical`, would return exactly the same result on the 0's and 1's as `logical` type. However, code that is manipulating column types using `is.integer` or `is.logical` on `fread`'s result, could require change. It could be painful if `DT[(logical_column)]` (i.e. `DT[logical_column==TRUE]`) changed behaviour due to `logical_column` no longer being type `logical` but `integer`. But that is not the change proposed. The change is the other way around; i.e., a previously `integer` column holding only 0's and 1's would now be type `logical`. Since it's that way around, we believe the scope for breakage is limited. We think a lot of code is converting 0/1 integer columns to logical anyway, either using `colClasses=` or afterwards with an assign. For `fwrite`, the level of breakage depends on the consumer of the output file. We believe `0/1` is a better more standard default choice to move to. See notes below about improvements to `fread`'s sampling for type guessing, and automatic rereading in the rare cases of out-of-sample type surprises. + +These options are meant for temporary use to aid your migration. You are not meant to set them to the old default and then not migrate your code that is dependent on the default. Either set the argument explicitly to the old value in all calls, or change the code to cope with the new default. In a few years we will start to remove the options, warning you if you are using them, and return to a simple default. See the history of NEWS and NEWS.0 for past migrations that have, generally speaking, been successfully managed in this way. For example, at the end of NOTES in this version's release notes, is a note about usage of `datatable.old.unique.by.key` now warning, as you were warned it would do over a year ago. #### NEW FEATURES @@ -41,11 +49,12 @@ This option controls how `,,` is read in character columns. It does not affect n * `skip=` and `nrow=` are more reliable and are no longer affected by invalid lines outside the range specified. Thanks to Ziyad Saeed and Kyle Chung for reporting, [#1267](https://github.com/Rdatatable/data.table/issues/1267). * Ram disk (`/dev/shm`) is no longer used for the output of system command input. Although faster when it worked, it was causing too many device full errors; e.g., [#1139](https://github.com/Rdatatable/data.table/issues/1139) and [zUMIs/19](https://github.com/sdparekh/zUMIs/issues/19). Thanks to Kyle Chung for reporting. Standard `tempdir()` is now used. If you wish to use ram disk, set TEMPDIR to `/dev/shm`; see `?tempdir`. * Detecting whether a very long input string is a file name or data is now much faster, [#2531](https://github.com/Rdatatable/data.table/issues/2531). Many thanks to @javrucebo for the detailed report, benchmarks and suggestions. + * A column of `TRUE/FALSE`s is ok, as well as `True/False`s and `true/false`s, but mixing styles (e.g. `TRUE/false`) is not and will be read as type `character`. * Many thanks to @yaakovfeldman, Guillermo Ponce, Arun Srinivasan, Hugh Parsonage, Mark Klik, Pasha Stetsenko, Mahyar K, Tom Crockett, @cnoelke, @qinjs, @etienne-s, Mark Danese, Avraham Adler, @franknarf1, @MichaelChirico, @tdhock, Luke Tierney for testing dev and reporting these regressions before release to CRAN: #2070, #2073, #2087, #2091, #2107, #2118, #2092, #1888, #2123, #2167, #2194, #2238, #2228, #1464, #2201, #2287, #2299, #2285, #2251, #2347, #2222, #2352, #2246, #2370, #2371, #2404, #2196, #2322, #2453, #2446, #2464, #2457, #1895, #2481, #2499, #2516, #2520, #2512, #2523, #2542, #2526, #2518, #2515, #1671, #2267, #2561, #2625, #2265, #2548, #2535 2. `fwrite()`: * empty strings are now always quoted (`,"",`) to distinguish them from `NA` which by default is still empty (`,,`) but can be changed using `na=` as before. If `na=` is provided and `quote=` is the default `'auto'` then `quote=` is set to `TRUE` so that if the `na=` value occurs in the data, it can be distinguished from `NA`. Thanks to Ethan Welty for the request [#2214](https://github.com/Rdatatable/data.table/issues/2214) and Pasha for the code change and tests, [#2215](https://github.com/Rdatatable/data.table/issues/2215). - * `logicalAsInt` has been renamed `logical01` and the default changed from `FALSE` to `TRUE`, both changes for consistency with `fread` (see item above). The old name `logicalAsInt` continues to work but is now deprecated. The previous default can easily be restored (to enable you to postpone changing your code) by setting `options("datatable.logical01" = FALSE)`. + * `logical01` has been added and the old name `logicalAsInt` retained. Pease move to the new name when convenient for you. The old argument name (`logicalAsInt`) will slowly be deprecated over the next few years. The default is unchanged: `FALSE`, so `logical` is still written as `"TRUE"`/`"FALSE"` in full by default. We intend to change the default's default in future to `TRUE`; see the notice at the top of these release notes. 3. Added helpful message when subsetting by a logical column without wrapping it in parentheses, [#1844](https://github.com/Rdatatable/data.table/issues/1844). Thanks @dracodoc for the suggestion and @MichaelChirico for the PR. diff --git a/R/fread.R b/R/fread.R index 3a0302d9bf..f6e7d774e2 100644 --- a/R/fread.R +++ b/R/fread.R @@ -1,5 +1,5 @@ -fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings=getOption("datatable.na.strings","NA"),stringsAsFactors=FALSE,verbose=getOption("datatable.verbose"),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(),data.table=getOption("datatable.fread.datatable"),nThread=getDTthreads(),logical01=getOption("datatable.logical01", TRUE),autostart=NA) +fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings=getOption("datatable.na.strings","NA"),stringsAsFactors=FALSE,verbose=getOption("datatable.verbose",FALSE),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64","integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(), data.table=getOption("datatable.fread.datatable",TRUE), nThread=getDTthreads(), logical01=getOption("datatable.logical01", FALSE), autostart=NA) { if (is.null(sep)) sep="\n" # C level knows that \n means \r\n on Windows, for example else { diff --git a/R/fwrite.R b/R/fwrite.R index 3373709e10..914c23c2cd 100644 --- a/R/fwrite.R +++ b/R/fwrite.R @@ -2,7 +2,7 @@ fwrite <- function(x, file="", append=FALSE, quote="auto", sep=",", sep2=c("","|",""), eol=if (.Platform$OS.type=="windows") "\r\n" else "\n", na="", dec=".", row.names=FALSE, col.names=TRUE, qmethod=c("double","escape"), - logical01=getOption("datatable.logical01", TRUE), + logical01=getOption("datatable.logical01", FALSE), # due to change to TRUE; see NEWS logicalAsInt=logical01, dateTimeAs = c("ISO","squash","epoch","write.csv"), buffMB=8, nThread=getDTthreads(), diff --git a/R/onLoad.R b/R/onLoad.R index 6690b604e5..f92d66003e 100644 --- a/R/onLoad.R +++ b/R/onLoad.R @@ -29,6 +29,8 @@ lockBinding("rbind.data.frame",baseenv()) } # Set options for the speed boost in v1.8.0 by avoiding 'default' arg of getOption(,default=) + # In fread and fwrite we have moved back to using getOption's default argument since it is unlikely fread and fread will be called in a loop many times, plus they + # are relatively heavy functions where the overhead in getOption() would not be noticed. It's only really [.data.table where getOption default bit. # TODO: submit improvement to .Internal(getOption(x)) in base::getOption to return NULL when option not set, to avoid (relatively slow) 'x %in% names(options())' there. opts = c("datatable.verbose"="FALSE", # datatable. "datatable.nomatch"="NA_integer_", # datatable. @@ -43,13 +45,10 @@ "datatable.dfdispatchwarn"="TRUE", # not a function argument "datatable.warnredundantby"="TRUE", # not a function argument "datatable.alloccol"="1024L", # argument 'n' of alloc.col. Over-allocate 1024 spare column slots - "datatable.integer64"="'integer64'", # datatable. integer64|double|character "datatable.auto.index"="TRUE", # DT[col=="val"] to auto add index so 2nd time faster "datatable.use.index"="TRUE", # global switch to address #1422 - "datatable.fread.datatable"="TRUE", "datatable.prettyprint.char" = NULL, # FR #1091 - "datatable.old.unique.by.key" = "FALSE", # TODO: change warnings in duplicated.R to error on or after Jan 2019 then remove in Jan 2020. - "datatable.logical01" = "TRUE" # fwrite/fread to revert to FALSE. TODO: warn in next release and remove after 1 year + "datatable.old.unique.by.key" = "FALSE" # TODO: change warnings in duplicated.R to error on or after Jan 2019 then remove in Jan 2020. ) for (i in setdiff(names(opts),names(options()))) { eval(parse(text=paste("options(",i,"=",opts[i],")",sep=""))) diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index b550106d07..4bfc28c672 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -2455,7 +2455,7 @@ if (test_bit64) { # getwd() has been set by test.data.table() to the location of this tests.Rraw file. Test files should be in the same directory. f = testDir("ch11b.dat") # http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat test(900.1, fread(f, logical01=FALSE), as.data.table(read.table(f))) -test(900.2, fread(f), as.data.table(read.table(f))[,V5:=as.logical(V5)]) +test(900.2, fread(f, logical01=TRUE), as.data.table(read.table(f))[,V5:=as.logical(V5)]) f = testDir("1206FUT.txt") # a CRLF line ending file (DOS) test(901.1, DT<-fread(f,strip.white=FALSE), setDT(read.table(f,sep="\t",header=TRUE,colClasses=as.vector(sapply(DT,class))))) @@ -2759,14 +2759,14 @@ test(1009, DT[,list(mean(a), sum(a)),by=b], data.table(b=c(1,2),V1=c(NA,0),V2=c( # an fread error shouldn't hold a lock on the file on Windows # TODO: now that these are warnings and not errors, we need another way to trigger a STOP() inside fread.c. options(warn=2) isn't enough. cat('A,B\n1,2\n3\n5,6\n', file=(f<-tempfile())) -test(1010.1, fread(f), ans<-data.table(A=TRUE, B=2L), warning=(txt<-"Stopped early on line 3.*Expected 2 fields but found 1.*fill.*TRUE.*<<3>>")) -test(1010.2, fread(f), ans, warning=txt) +test(1010.1, fread(f,logical01=TRUE), ans<-data.table(A=TRUE, B=2L), warning=(txt<-"Stopped early on line 3.*Expected 2 fields but found 1.*fill.*TRUE.*<<3>>")) +test(1010.2, fread(f,logical01=TRUE), ans, warning=txt) cat('7\n8,9',file=f,append=TRUE) # that append works after error test(1010.3, fread(f,fill=TRUE), data.table(A=INT(1,3,5,7,8), B=INT(2,NA,6,NA,9))) -test(1010.4, fread(f), ans, warning=txt) +test(1010.4, fread(f,logical01=TRUE), ans, warning=txt) cat('A,B\n1,2\n3\n5,6\n', file=f) # that overwrite works after error test(1010.5, fread(f,fill=TRUE), data.table(A=INT(1,3,5), B=INT(2,NA,6))) -test(1010.6, fread(f), ans, warning=txt) +test(1010.6, fread(f,logical01=TRUE), ans, warning=txt) unlink(f) # that file can be removed after error test(1010.7, !file.exists(f)) @@ -3052,7 +3052,8 @@ b = rbind(a, data.table(z=2,x=1)) test(1080, b$z, c(1,2,3,2)) # mid row logical detection -test(1081, fread("A,B,C\n1,T,2\n"), data.table(A=TRUE,B="T",C=2L)) +test(1081.1, fread("A,B,C\n1,T,2\n",logical01=TRUE), data.table(A=TRUE,B="T",C=2L)) +test(1081.2, fread("A,B,C\n1,T,2\n",logical01=FALSE), data.table(A=1L,B="T",C=2L)) # cartesian join answer's key should contain only the columns considered in binary search. Fixes #2677 set.seed(45) @@ -10253,8 +10254,8 @@ if (test_nanotime) { # check too many fields error from ,\n line ending highlighted in #2044 test(1753.1, fread("X,Y\n1,2\n3,4\n5,6"), data.table(X=INT(1,3,5),Y=INT(2,4,6))) -test(1753.2, fread("X,Y\n1,2\n3,4,\n5,6"), ans<-data.table(X=TRUE,Y=2L), warning="Stopped.*line 3. Expected 2 fields but found 3.*discarded.*<<3,4,>>") -test(1753.3, fread("X,Y\n1,2\n3,4,7\n5,6"), ans, warning="Stopped.*line 3. Expected 2 fields but found 3.*discarded.*<<3,4,7>>") +test(1753.2, fread("X,Y\n1,2\n3,4,\n5,6",logical01=TRUE), ans<-data.table(X=TRUE,Y=2L), warning="Stopped.*line 3. Expected 2 fields but found 3.*discarded.*<<3,4,>>") +test(1753.3, fread("X,Y\n1,2\n3,4,7\n5,6",logical01=TRUE), ans, warning="Stopped.*line 3. Expected 2 fields but found 3.*discarded.*<<3,4,7>>") # issue 2051 where a quoted field contains ", New quote rule detection handles it. test(1753.4, fread(testDir("issue_2051.csv"))[2,grep("^Our.*tool$",COLUMN50)], 1L) @@ -10274,7 +10275,7 @@ test(1753.4, fread(testDir("issue_2051.csv"))[2,grep("^Our.*tool$",COLUMN50)], 1 test(1754, fread(testDir("allchar.csv"))[c(1,2,17575,17576),col2], c("AAN","BAN","YZZ","ZZZ")) # unescaped embedded quotes from here: http://stackoverflow.com/questions/42939866/fread-multiple-separators-in-a-string -test(1755, fread(testDir("unescaped.csv")), +test(1755, fread(testDir("unescaped.csv"), logical01=TRUE), data.table(No =c(FALSE,TRUE), Comment=c('he said:"wonderful."', 'The problem is: reading table, and also "a problem, yes." keep going on.'), Type =c('A','A'))) @@ -10704,7 +10705,7 @@ test(1808.2, fread("A,B\r1,2\r3,4\r"), data.table(A=c(1L,3L),B=c(2L,4L))) cat("A,B\r1,2\r3,4",file=f<-tempfile()) test(1808.3, fread(f), data.table(A=c(1L,3L),B=c(2L,4L))) unlink(f) -test(1808.4, fread("A,B\r1,3\r\r\r2,4\r"), data.table(A=TRUE, B=3L), warning="Discarded single-line footer: <<2,4>>") +test(1808.4, fread("A,B\r1,3\r\r\r2,4\r", logical01=TRUE), data.table(A=TRUE, B=3L), warning="Discarded single-line footer: <<2,4>>") test(1808.5, fread("A,B\r4,3\r\r \r2,4\r"), data.table(A=4L, B=3L), warning="Discarded single-line footer: <<2,4>>") test(1808.6, fread("A,B\r1,3\r\r \r2,4\r", blank.lines.skip=TRUE), data.table(A=1:2, B=3:4)) test(1808.7, fread("A,B\r1,3\r\r \r2,4\r", fill=TRUE), data.table(A=c(1L,NA,NA,2L), B=c(3L,NA,NA,4L))) @@ -10874,8 +10875,8 @@ test(1832.2, any(grepl("^Column writers.* [.][.][.] ", capture.output(fwrite(DT, unlink(f) # ensure explicitly setting select to default value doesn't error, #2007 -test(1833, fread('V1,V2\n1,2', select = NULL), - data.table(V1 = TRUE, V2 = 2L)) +test(1833, fread('V1,V2\n5,6', select = NULL), + data.table(V1 = 5L, V2 = 6L)) # fread file for which nextGoodLine in sampling struggles, #2404 # Every line in the first 100 has a quoted field containing sep, so the quote rule was being @@ -11424,11 +11425,8 @@ test(1879.5, DT[0:5], DT) DT = data.table(A=rep("True", 2200), B="FALSE", C='0') DT[111, LETTERS[1:3] := .("fread", "is", "faithful")] fwrite(DT, f<-tempfile()) -test(1879.6, fread(f, verbose=TRUE), DT, - output=paste("Column 1.*bumped from 'bool8' to 'string'", - "Column 2.*bumped from 'bool8' to 'string'", - "Column 3.*bumped from 'bool8' to 'string'", - sep = '.*')) +test(1879.6, fread(f, verbose=TRUE, logical01=TRUE), DT, + output="Column 1.*bumped from 'bool8' to 'string'.*\nColumn 2.*bumped from 'bool8' to 'string'.*\nColumn 3.*bumped from 'bool8' to 'string'") unlink(f) # Fix duplicated names arising in merge when by.x in names(y), PR#2631, PR#2653 diff --git a/man/fread.Rd b/man/fread.Rd index 2067d0b321..e6fff9bda0 100644 --- a/man/fread.Rd +++ b/man/fread.Rd @@ -10,16 +10,19 @@ } \usage{ fread(input, file, sep="auto", sep2="auto", dec=".", quote="\"", -nrows=Inf, header="auto", na.strings=getOption("datatable.na.strings","NA"), -stringsAsFactors=FALSE, verbose=getOption("datatable.verbose"), +nrows=Inf, header="auto", +na.strings=getOption("datatable.na.strings","NA"), # due to change to ""; see NEWS +stringsAsFactors=FALSE, verbose=getOption("datatable.verbose", FALSE), skip="__auto__", select=NULL, drop=NULL, colClasses=NULL, -integer64=getOption("datatable.integer64"), # default: "integer64" +integer64=getOption("datatable.integer64", "integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(), -data.table=getOption("datatable.fread.datatable"), -nThread=getDTthreads(), logical01=getOption("datatable.logical01", TRUE), autostart=NA +data.table=getOption("datatable.fread.datatable", TRUE), +nThread=getDTthreads(), +logical01=getOption("datatable.logical01", FALSE), # due to change to TRUE; see NEWS +autostart=NA ) } \arguments{ diff --git a/man/fwrite.Rd b/man/fwrite.Rd index 65da84e3a0..3fa8f7b38d 100644 --- a/man/fwrite.Rd +++ b/man/fwrite.Rd @@ -12,7 +12,7 @@ fwrite(x, file = "", append = FALSE, quote = "auto", eol = if (.Platform$OS.type=="windows") "\r\n" else "\n", na = "", dec = ".", row.names = FALSE, col.names = TRUE, qmethod = c("double","escape"), - logical01 = getOption("datatable.logical01", TRUE), + logical01 = getOption("datatable.logical01", FALSE), # due to change to TRUE; see NEWS logicalAsInt = logical01, # deprecated dateTimeAs = c("ISO","squash","epoch","write.csv"), buffMB = 8L, nThread = getDTthreads(), From b67921578953c5d3ff1781c724a9f1ce987d78b0 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Sat, 17 Mar 2018 00:11:59 -0700 Subject: [PATCH 11/14] Coverage --- inst/tests/tests.Rraw | 3 +++ 1 file changed, 3 insertions(+) diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index 4bfc28c672..9b958f2e36 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -9073,6 +9073,9 @@ test(1658.28, fwrite(data.table(a=1)[NULL,]), error="ncol(x) > 0L is not TRUE") # 0.0 written as 0, but TODO #2398, probably related to the 2 lines after l==0 missing coverage in writeFloat64 test(1658.29, fwrite(data.table(id=c("A","B","C"), v=c(1.1,0.0,9.9))), output="id,v\nA,1.1\nB,0\nC,9.9") +# logical NA as "NA", instead of the default na="" which writes all types including in character column as ,, consistently. +test(1658.30, fwrite(data.table(id=1:3,bool=c(TRUE,NA,FALSE)),na="NA"), output="\"id\",\"bool\"\n1,TRUE\n2,NA\n3,FALSE") + ## End fwrite tests # tests for #679, inrange(), FR #707 From 48146082cc7b1ae9e619780d580f518ff5604478 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Sat, 17 Mar 2018 01:32:10 -0700 Subject: [PATCH 12/14] Link added to getOption 100x speedup submitted to R-core --- R/onLoad.R | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/onLoad.R b/R/onLoad.R index f92d66003e..d7ffc1c5da 100644 --- a/R/onLoad.R +++ b/R/onLoad.R @@ -31,7 +31,7 @@ # Set options for the speed boost in v1.8.0 by avoiding 'default' arg of getOption(,default=) # In fread and fwrite we have moved back to using getOption's default argument since it is unlikely fread and fread will be called in a loop many times, plus they # are relatively heavy functions where the overhead in getOption() would not be noticed. It's only really [.data.table where getOption default bit. - # TODO: submit improvement to .Internal(getOption(x)) in base::getOption to return NULL when option not set, to avoid (relatively slow) 'x %in% names(options())' there. + # Improvement to base::getOption() now submitted (100x; 5s down to 0.05s): https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17394 opts = c("datatable.verbose"="FALSE", # datatable. "datatable.nomatch"="NA_integer_", # datatable. "datatable.optimize"="Inf", # datatable. From 2d2267cfff6afcff1dd90809ef4190cc1348cb89 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Sat, 17 Mar 2018 01:49:19 -0700 Subject: [PATCH 13/14] New test covers what it was supposed to now. --- inst/tests/tests.Rraw | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index 9b958f2e36..6c722b7d8c 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -9073,8 +9073,8 @@ test(1658.28, fwrite(data.table(a=1)[NULL,]), error="ncol(x) > 0L is not TRUE") # 0.0 written as 0, but TODO #2398, probably related to the 2 lines after l==0 missing coverage in writeFloat64 test(1658.29, fwrite(data.table(id=c("A","B","C"), v=c(1.1,0.0,9.9))), output="id,v\nA,1.1\nB,0\nC,9.9") -# logical NA as "NA", instead of the default na="" which writes all types including in character column as ,, consistently. -test(1658.30, fwrite(data.table(id=1:3,bool=c(TRUE,NA,FALSE)),na="NA"), output="\"id\",\"bool\"\n1,TRUE\n2,NA\n3,FALSE") +# logical NA as "NA" when logical01=TRUE, instead of the default na="" which writes all types including in character column as ,, consistently. +test(1658.30, fwrite(data.table(id=1:3,bool=c(TRUE,NA,FALSE)),na="NA",logical01=TRUE), output="\"id\",\"bool\"\n1,1\n2,NA\n3,0") ## End fwrite tests From 1b035ac4b2f658f60e7077607e3d83ea4b33a1c3 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Tue, 20 Mar 2018 10:52:50 -0700 Subject: [PATCH 14/14] NEW item only. Added PR link and embellished wording. --- NEWS.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/NEWS.md b/NEWS.md index 620378c72c..537ed18966 100644 --- a/NEWS.md +++ b/NEWS.md @@ -11,7 +11,7 @@ getOption("datatable.na.strings", "NA") # this release; i.e. the same; no change yet getOption("datatable.na.strings", "") # future release ``` -This option controls how `,,` is read in character columns. It does not affect numeric columns which read `,,` as `NA` regardless. We would like `,,`=>`NA` for consistency with numeric types, and `,"",`=>empty string to be the standard default for `fwrite/fread` character columns so that `fread(fwrite(DT))==DT` without needing any change to any parameters. `fwrite` has never written `NA` as `"NA"`; it already writes `,,` by default. The use of R's `getOption()` allows users to move forward now, using `options(datatable.fread.na.strings="")`, or restore old behaviour when the default's default is changed in future, using `options(datatable.fread.na.strings="NA")`. +This option controls how `,,` is read in character columns. It does not affect numeric columns which read `,,` as `NA` regardless. We would like `,,`=>`NA` for consistency with numeric types, and `,"",`=>empty string to be the standard default for `fwrite/fread` character columns so that `fread(fwrite(DT))==DT` without needing any change to any parameters. `fwrite` has never written `NA` as `"NA"` in case `"NA"` is a valid string in the data; e.g., 2 character id columns sometimes do. Instead, `fwrite` has always written `,,` by default for an `` in a character columns. The use of R's `getOption()` allows users to move forward now, using `options(datatable.fread.na.strings="")`, or restore old behaviour when the default's default is changed in future, using `options(datatable.fread.na.strings="NA")`. 2. `fread()` and `fwrite()`'s `logical01=` argument : ``` @@ -21,7 +21,7 @@ This option controls how `,,` is read in character columns. It does not affect n ``` This option controls whether a column of all 0's and 1's is read as `integer`, or `logical` directly to avoid needing to change the type afterwards to `logical` or use `colClasses`. `0/1` is smaller and faster than `"TRUE"/"FALSE"`, which can make a significant difference to space and time the more `logical` columns there are. When the default's default changes to `TRUE` for `fread` we do not expect much impact since all arithmetic operators that are currently receiving 0's and 1's as type `integer` (think `sum()`) but instead could receive `logical`, would return exactly the same result on the 0's and 1's as `logical` type. However, code that is manipulating column types using `is.integer` or `is.logical` on `fread`'s result, could require change. It could be painful if `DT[(logical_column)]` (i.e. `DT[logical_column==TRUE]`) changed behaviour due to `logical_column` no longer being type `logical` but `integer`. But that is not the change proposed. The change is the other way around; i.e., a previously `integer` column holding only 0's and 1's would now be type `logical`. Since it's that way around, we believe the scope for breakage is limited. We think a lot of code is converting 0/1 integer columns to logical anyway, either using `colClasses=` or afterwards with an assign. For `fwrite`, the level of breakage depends on the consumer of the output file. We believe `0/1` is a better more standard default choice to move to. See notes below about improvements to `fread`'s sampling for type guessing, and automatic rereading in the rare cases of out-of-sample type surprises. -These options are meant for temporary use to aid your migration. You are not meant to set them to the old default and then not migrate your code that is dependent on the default. Either set the argument explicitly to the old value in all calls, or change the code to cope with the new default. In a few years we will start to remove the options, warning you if you are using them, and return to a simple default. See the history of NEWS and NEWS.0 for past migrations that have, generally speaking, been successfully managed in this way. For example, at the end of NOTES in this version's release notes, is a note about usage of `datatable.old.unique.by.key` now warning, as you were warned it would do over a year ago. +These options are meant for temporary use to aid your migration, [#2652](https://github.com/Rdatatable/data.table/pull/2652). You are not meant to set them to the old default and then not migrate your code that is dependent on the default. Either set the argument explicitly so your code is not dependent on the default, or change the code to cope with the new default. Over the next few years we will slowly start to remove these options, warning you if you are using them, and return to a simple default. See the history of NEWS and NEWS.0 for past migrations that have, generally speaking, been successfully managed in this way. For example, at the end of NOTES for this version (below in this file) is a note about the usage of `datatable.old.unique.by.key` now warning, as you were warned it would do over a year ago. When that change was introduced, the default was changed and that option provided an option to restore the old behaviour. These `fread`/`fwrite` changes are even more cautious and not even changing the default's default yet. Giving you extra warning by way of this notice to move forward. And giving you a chance to object. #### NEW FEATURES