Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@

2. `mean(na.rm=TRUE)` by group is now GForce optimized, [#4849](https://github.com/Rdatatable/data.table/issues/4849). Thanks to the [h2oai/db-benchmark](https://github.com/h2oai/db-benchmark) project for spotting this issue. The 1 billion row example in the issue shows 48s reduced to 14s. The optimization also applies to type `integer64` resulting in a difference to the `bit64::mean.integer64` method: `data.table` returns a `double` result whereas `bit64` rounds the mean to the nearest integer.

3. `fwrite()` now writes UTF-8 or native csv files by specifying the `encoding=` argument, [#1770](https://github.com/Rdatatable/data.table/pull/1770). Thanks to @shrektan for the request and the PR.

## BUG FIXES

1. `by=.EACHI` when `i` is keyed but `on=` different columns than `i`'s key could create an invalidly keyed result, [#4603](https://github.com/Rdatatable/data.table/issues/4603) [#4911](https://github.com/Rdatatable/data.table/issues/4911). Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where a `data.table` is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries.
Expand Down
8 changes: 6 additions & 2 deletions R/fwrite.R
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,12 @@ fwrite = function(x, file="", append=FALSE, quote="auto",
compress = c("auto", "none", "gzip"),
yaml = FALSE,
bom = FALSE,
verbose=getOption("datatable.verbose", FALSE)) {
verbose=getOption("datatable.verbose", FALSE),
encoding = "") {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could have signature encoding = c("", "UTF-8", "native") and use match.args instead of the check below

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The encoding arg of fread() doesn't use this signature, either. The reason is the empty string seems not work well with match.arg() :

f <- function(x = c("", "a", "b")) { 
  match.arg(x)
}
f()
#> [1] ""
f("a")
#> [1] "a"
f("b")
#> [1] "b"
f("") # this throws
#> Error in match.arg(x): 'arg' should be one of "", "a", "b"

Created on 2020-10-30 by the reprex package (v0.3.0)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

na = as.character(na[1L]) # fix for #1725
if (length(encoding) != 1L || !encoding %chin% c("", "UTF-8", "native")) {
stop("Argument 'encoding' must be '', 'UTF-8' or 'native'.")
}
if (missing(qmethod)) qmethod = qmethod[1L]
if (missing(compress)) compress = compress[1L]
if (missing(dateTimeAs)) { dateTimeAs = dateTimeAs[1L] }
Expand Down Expand Up @@ -108,7 +112,7 @@ fwrite = function(x, file="", append=FALSE, quote="auto",
file = enc2native(file) # CfwriteR cannot handle UTF-8 if that is not the native encoding, see #3078.
.Call(CfwriteR, x, file, sep, sep2, eol, na, dec, quote, qmethod=="escape", append,
row.names, col.names, logical01, scipen, dateTimeAs, buffMB, nThread,
showProgress, is_gzip, bom, yaml, verbose)
showProgress, is_gzip, bom, yaml, verbose, encoding)
invisible()
}

17 changes: 17 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -17302,3 +17302,20 @@ test(2165.5, X[Y, on=.(A), x.B, by=.EACHI], data.table(A=2:1, x.B=c(2L,NA)))
# missing j was caught in groupingsets but not cube, leading to unexpected error message, #4282
DT = data.table(a=1)
test(2166, cube(DT, by='a'), error="Argument 'j' is required")

# fwrite support encoding "native" and "UTF-8", #1770
latin1 = "fa\xE7ile"
Encoding(latin1) = "latin1"
utf8 = iconv(latin1, "latin1", "UTF-8")
text = c(latin1, utf8, "aaaaaaaa")
dt = data.table(A = text, B = as.factor(text))
dt2 = data.table(A = text, B = text)
csvfile = tempfile(fileext = ".csv")
fwrite(dt, csvfile, encoding = "UTF-8", bom = TRUE)
test(2167.1, fread(csvfile, encoding = "UTF-8"), dt2)
if (identical(text, enc2native(text))) { # ensure native encoding can represent latin1 strings
fwrite(dt, csvfile, encoding = "native")
test(2167.2, fread(csvfile), dt2)
}
test(2167.3, fwrite(dt, csvfile, encoding="nativ"), error="Argument 'encoding' must be")
unlink(csvfile)
4 changes: 3 additions & 1 deletion man/fwrite.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ fwrite(x, file = "", append = FALSE, quote = "auto",
compress = c("auto", "none", "gzip"),
yaml = FALSE,
bom = FALSE,
verbose = getOption("datatable.verbose", FALSE))
verbose = getOption("datatable.verbose", FALSE),
encoding = "")
}
\arguments{
\item{x}{Any \code{list} of same length vectors; e.g. \code{data.frame} and \code{data.table}. If \code{matrix}, it gets internally coerced to \code{data.table} preserving col names but not row names}
Expand Down Expand Up @@ -59,6 +60,7 @@ fwrite(x, file = "", append = FALSE, quote = "auto",
\item{yaml}{If \code{TRUE}, \code{fwrite} will output a CSVY file, that is, a CSV file with metadata stored as a YAML header, using \code{\link[yaml]{as.yaml}}. See \code{Details}. }
\item{bom}{If \code{TRUE} a BOM (Byte Order Mark) sequence (EF BB BF) is added at the beginning of the file; format 'UTF-8 with BOM'.}
\item{verbose}{Be chatty and report timings?}
\item{encoding}{ The encoding of the strings written to the CSV file. Default is \code{""}, which means writting raw bytes without considering the encoding. Other possible options are \code{"UTF-8"} and \code{"native"}. }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this matches the behavior of write.table (or other base function), we should mention that here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little bit different and the fileEncoding section of base::write.table() seems not very clear to me.

}
\details{
\code{fwrite} began as a community contribution with \href{https://github.com/Rdatatable/data.table/pull/1613}{pull request #1613} by Otto Seiskari. This gave Matt Dowle the impetus to specialize the numeric formatting and to parallelize: \url{https://www.h2o.ai/blog/fast-csv-writing-for-r/}. Final items were tracked in \href{https://github.com/Rdatatable/data.table/issues/1664}{issue #1664} such as automatic quoting, \code{bit64::integer64} support, decimal/scientific formatting exactly matching \code{write.csv} between 2.225074e-308 and 1.797693e+308 to 15 significant figures, \code{row.names}, dates (between 0000-03-01 and 9999-12-31), times and \code{sep2} for \code{list} columns where each cell can itself be a vector.
Expand Down
17 changes: 13 additions & 4 deletions src/fwriteR.c
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,23 @@
#define DATETIMEAS_EPOCH 2
#define DATETIMEAS_WRITECSV 3

static bool utf8=false;
static bool native=false;
#define TO_UTF8(s) (utf8 && NEED2UTF8(s))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: TO_UTF8 reads to me like "convert s to UTF-8" but IINM it's a logical check. Unfortunately the most logical choice is NEED2UTF8 which is taken... REALLY_NEED2UTF8? 😛

Nothing comes to mind unfortunately, worth considering some more though.

#define TO_NATIVE(s) (native && (s)!=NA_STRING && !IS_ASCII(s))
Copy link
Copy Markdown
Member

@MichaelChirico MichaelChirico Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangorecki I think you recommended putting IS_ASCII last right? We should do that in data.table.h then too right (2nd, before IS_UTF8)?

#define NEED2UTF8(s) !(IS_ASCII(s) || (s)==NA_STRING || IS_UTF8(s))

#define ENCODED_CHAR(s) (TO_UTF8(s) ? translateCharUTF8(s) : (TO_NATIVE(s) ? translateChar(s) : CHAR(s)))
Copy link
Copy Markdown
Member

@MichaelChirico MichaelChirico Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that TO_UTF8 and TO_NATIVE both run TO_ASCII and S==NA_STRING, so I think there's some redundancy here.

how about

s == NA_STRING ? CHAR(s) : ( IS_ASCII(s) ? CHAR(s) : ( native ? translateChar(s) : IS_UTF8(s) ? translateCharUTF8(s) : CHAR(s) ) )

or

s == NA_STRING || IS_ASCII(s) ? CHAR(s) : ( IS_UTF8(s) ? (utf8 ? CHAR(s) : translateChar(s)) : (utf8 ? translateCharUTF8(s) : translateChar(s) )

based on this truth table (the other 22/32 combinations are impossible because utf8 ➡️ !native and IS_ASCII ➡️ !IS_NATIVE (and vice versa)

Truth table of all possible IS_ASCII, IS_UTF8, s==NA_STRING, utf8, native values

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right but after a 2nd thought, I still prefer the current version. The reasons are:

  1. In terms of speed, the original version doesn't do anything unnecessary. For the default encoding value "", checking two boolean values should be much faster than checking NA_STRING and IS_ASCII.

  2. In terms of clarity, despite there's a small redundancy, the current version is easier to be understood.


static char sep2; // '\0' if there are no list columns. Otherwise, the within-column separator.
static bool logical01=true; // should logicals be written as 0|1 or true|false. Needed by list column writer too in case a cell is a logical vector.
static int dateTimeAs=0; // 0=ISO(yyyy-mm-dd), 1=squash(yyyymmdd), 2=epoch, 3=write.csv
static const char *sep2start, *sep2end;
// sep2 is in main fwrite.c so that writeString can quote other fields if sep2 is present in them
// if there are no list columns, set sep2=='\0'

// Non-agnostic helpers ...

const char *getString(SEXP *col, int64_t row) { // TODO: inline for use in fwrite.c
SEXP x = col[row];
return x==NA_STRING ? NULL : CHAR(x);
return x==NA_STRING ? NULL : ENCODED_CHAR(x);
}

int getStringLen(SEXP *col, int64_t row) {
Expand Down Expand Up @@ -45,7 +50,7 @@ int getMaxCategLen(SEXP col) {
const char *getCategString(SEXP col, int64_t row) {
// the only writer that needs to have the header of the SEXP column, to get to the levels
int x = INTEGER(col)[row];
return x==NA_INTEGER ? NULL : CHAR(STRING_ELT(getAttrib(col, R_LevelsSymbol), x-1));
return x==NA_INTEGER ? NULL : ENCODED_CHAR(STRING_ELT(getAttrib(col, R_LevelsSymbol), x-1));
}

writer_fun_t funs[] = {
Expand Down Expand Up @@ -164,10 +169,12 @@ SEXP fwriteR(
SEXP is_gzip_Arg,
SEXP bom_Arg,
SEXP yaml_Arg,
SEXP verbose_Arg
SEXP verbose_Arg,
SEXP encoding_Arg
)
{
if (!isNewList(DF)) error(_("fwrite must be passed an object of type list; e.g. data.frame, data.table"));

fwriteMainArgs args = {0}; // {0} to quieten valgrind's uninitialized, #4639
args.is_gzip = LOGICAL(is_gzip_Arg)[0];
args.bom = LOGICAL(bom_Arg)[0];
Expand Down Expand Up @@ -224,6 +231,8 @@ SEXP fwriteR(
dateTimeAs = INTEGER(dateTimeAs_Arg)[0];
logical01 = LOGICAL(logical01_Arg)[0];
args.scipen = INTEGER(scipen_Arg)[0];
utf8 = !strcmp(CHAR(STRING_ELT(encoding_Arg, 0)), "UTF-8");
native = !strcmp(CHAR(STRING_ELT(encoding_Arg, 0)), "native");

int firstListColumn = 0;
for (int j=0; j<args.ncol; j++) {
Expand Down