Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
73f229b
add support for native parsing of iso8601 timestamps
May 20, 2020
1a833b7
also add ISO date parser
May 20, 2020
e7f8f92
tz offset parsing, return POSIXct/IDate columns to R
May 21, 2020
a4c9c79
add NEWS item, mention in ?fread
May 21, 2020
4b48f01
initial assay
May 21, 2020
ae26400
fix broken old tests
May 21, 2020
aa58ab4
coverage tests
May 21, 2020
656224c
API for colClasses
May 21, 2020
a97d967
use proper type in freadLookups
May 21, 2020
9e6b409
Merge branch 'master' into fread-iso8601
mattdowle Jun 29, 2020
af2718c
coverage, and news item tweak
mattdowle Jun 29, 2020
eaf2d57
as.POSIXct comments, extra tests, one test fail is correct for now
mattdowle Jul 2, 2020
e95f420
flexibility to parse date-alike columns as POSIXct manually
Jul 2, 2020
ebcc5fd
fix regression
Jul 2, 2020
597d261
fixed for real this time
Jul 2, 2020
da49cd9
force tz on new test
Jul 2, 2020
d7f29f9
added option to restore old behaviour, and tweaked news item
mattdowle Jul 11, 2020
00959f3
Restored as.POSIXct and covered it to ensure that's what's called whe…
mattdowle Jul 11, 2020
9300fe0
Merge branch 'master' of github.com:Rdatatable/data.table into fread-…
Jul 11, 2020
b6bc5cd
further tweak NEWS item
Jul 11, 2020
0034385
further tweak of NEWS item
Jul 11, 2020
e89d159
restore unset TZ in test 2124; test 2150.15 now correctly fails
mattdowle Jul 13, 2020
baa3c12
parse datetime only when Z or offset is present; 3 tests left to fix
mattdowle Jul 14, 2020
4102825
news item update
mattdowle Jul 14, 2020
8a9c19c
split test 2150.12 into two
mattdowle Jul 14, 2020
6efff33
pass 2150.13; date-only mixed with UTC-marked datetime
mattdowle Jul 14, 2020
bcd8995
colClasses='POSIXct' on date-only should bump to character and as.POS…
mattdowle Jul 14, 2020
318a605
cover POSIXct in select= on a date-only
mattdowle Jul 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .dev/.bash_aliases
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
# git config --global difftool.prompt false
alias gd='git difftool &> /dev/null'
alias gdm='git difftool master &> /dev/null'
# If meld has scrolling issues, turn off GTK animation (which I don't need anyway):
# https://gitlab.gnome.org/GNOME/meld/-/issues/479#note_866040

alias Rdevel='~/build/R-devel/bin/R --vanilla'
alias Rdevel-strict-gcc='~/build/R-devel-strict-gcc/bin/R --vanilla'
Expand Down
5 changes: 5 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@

# data.table [v1.12.9](https://github.com/Rdatatable/data.table/milestone/19) (in development)

## POTENTIALLY BREAKING CHANGES

1. `fread` now supports native parsing of `%Y-%m-%d`, and [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) `%Y-%m-%dT%H:%M:%OS%z`, [#4464](https://github.com/Rdatatable/data.table/pull/4464). Dates are returned as `data.table`'s `integer`-backed `IDate` class (see `?IDate`), and datetimes are returned as `POSIXct` provided either `Z` or the offset from `UTC` is present; e.g. `fwrite()` outputs UTC by default including the final `Z`. `IDate` inherits from R's `Date` and is identical other than it uses the `integer` type where (oddly) R uses the `double` type for dates (8 bytes instead of 4). Since this is a potentially breaking change, i.e. existing code may depend on dates and datetimes being read as type character as before, a temporary option is provided to restore the old behaviour should you need it: `options(datatable.old.fread.datetime.character=TRUE)`. However, in most cases, we expect existing code to still work with no changes. For example, calls already using `colClasses="POSIXct"` will now use the faster parser if the `Z` is present, otherwise R's `as.POSIXct` will be used as before which interprets datetimes that are missing the UTC marker to be in the local timezone.
The minor version number is bumped from 12 to 13, i.e. `v1.13.0`, where the `.0` conveys 'be-aware' as is common practice. As with any new feature, there may be bugs to fix and changes to defaults required in future. In addition to convenience, `fread` is now significantly faster in the presence of dates, and UTC-marked datetimes.

## NEW FEATURES

1. `%chin%` and `chmatch(x, table)` are faster when `x` is length 1, `table` is long, and `x` occurs near the start of `table`. Thanks to Michael Chirico for the suggestion, [#4117](https://github.com/Rdatatable/data.table/pull/4117#discussion_r358378409).
Expand Down
5 changes: 4 additions & 1 deletion R/fread.R
Original file line number Diff line number Diff line change
Expand Up @@ -295,7 +295,10 @@ yaml=FALSE, autostart=NA, tmpdir=tempdir())
"complex" = as.complex(v),
"raw" = as_raw(v), # Internal implementation
"Date" = as.Date(v),
"POSIXct" = as.POSIXct(v),
"POSIXct" = as.POSIXct(v), # test 2150.14 covers this by setting the option to restore old behaviour. Otherwise types that
# are recognized by freadR.c (e.g. POSIXct; #4464) result in user-override-bump at C level before reading so do not reach this switch
# see https://github.com/Rdatatable/data.table/pull/4464#discussion_r447275278.
# Aside: as(v,"POSIXct") fails with error in R so has to be caught explicitly above
# finally:
methods::as(v, new_class))
},
Expand Down
28 changes: 17 additions & 11 deletions R/test.data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -50,11 +50,13 @@ test.data.table = function(script="tests.Rraw", verbose=FALSE, pkg=".", silent=F
}
fn = setNames(file.path(fulldir, fn), file.path(subdir, fn))

# These environment variables are restored to their previous state (including not defined) after sourcing test script
oldEnv = Sys.getenv(c("_R_CHECK_LENGTH_1_LOGIC2_", "TZ"), unset=NA_character_)
# From R 3.6.0 onwards, we can check that && and || are using only length-1 logicals (in the test suite)
# rather than relying on x && y being equivalent to x[[1L]] && y[[1L]] silently.
orig__R_CHECK_LENGTH_1_LOGIC2_ = Sys.getenv("_R_CHECK_LENGTH_1_LOGIC2_", unset = NA_character_)
Sys.setenv("_R_CHECK_LENGTH_1_LOGIC2_" = TRUE)
# This environment variable is restored to its previous state (including not defined) after sourcing test script
# TZ is not changed here so that tests run under the user's timezone. But we save and restore it here anyway just in case
# the test script stops early during a test that changes TZ (e.g. 2124 referred to in PR #4464).

oldRNG = suppressWarnings(RNGversion("3.5.0"))
# sample method changed in R 3.6 to remove bias; see #3431 for links and notes
Expand All @@ -81,7 +83,8 @@ test.data.table = function(script="tests.Rraw", verbose=FALSE, pkg=".", silent=F
warnPartialMatchArgs = base::getRversion()>="3.6.0", # ensure we don't rely on partial argument matching in internal code, #3664; >=3.6.0 for #3865
warnPartialMatchAttr = TRUE,
warnPartialMatchDollar = TRUE,
width = max(getOption('width'), 80L) # some tests (e.g. 1066, 1293) rely on capturing output that will be garbled with small width
width = max(getOption('width'), 80L), # some tests (e.g. 1066, 1293) rely on capturing output that will be garbled with small width
datatable.old.fread.datetime.character = FALSE
)

cat("getDTthreads(verbose=TRUE):\n") # for tracing on CRAN; output to log before anything is attempted
Expand Down Expand Up @@ -115,10 +118,11 @@ test.data.table = function(script="tests.Rraw", verbose=FALSE, pkg=".", silent=F
err = try(sys.source(fn, envir=env), silent=silent)

options(oldOptions)
if (is.na(orig__R_CHECK_LENGTH_1_LOGIC2_)) {
Sys.unsetenv("_R_CHECK_LENGTH_1_LOGIC2_")
} else {
Sys.setenv("_R_CHECK_LENGTH_1_LOGIC2_" = orig__R_CHECK_LENGTH_1_LOGIC2_) # nocov
for (i in oldEnv) {
if (is.na(oldEnv[i]))
Sys.unsetenv(names(oldEnv)[i])
else
do.call("Sys.setenv", as.list(oldEnv[i])) # nocov
}
# Sys.setlocale("LC_CTYPE", oldlocale)
suppressWarnings(do.call("RNGkind",as.list(oldRNG)))
Expand All @@ -129,14 +133,16 @@ test.data.table = function(script="tests.Rraw", verbose=FALSE, pkg=".", silent=F
# of those 13 line and give a better chance of seeing more of the output before it. Having said that, CRAN
# does show the full file output these days, so the 13 line limit no longer bites so much. It still bit recently
# when receiving output of R CMD check sent over email, though.
tz = Sys.getenv("TZ", unset=NA)
cat("\n", date(), # so we can tell exactly when these tests ran on CRAN to double-check the result is up to date
" endian==", .Platform$endian,
", sizeof(long double)==", .Machine$sizeof.longdouble,
", sizeof(pointer)==", .Machine$sizeof.pointer,
", TZ=", suppressWarnings(Sys.timezone()),
", locale='", Sys.getlocale(), "'",
", l10n_info()='", paste0(names(l10n_info()), "=", l10n_info(), collapse="; "), "'",
", getDTthreads()='", paste0(gsub("[ ][ ]+","==",gsub("^[ ]+","",capture.output(invisible(getDTthreads(verbose=TRUE))))), collapse="; "), "'",
", TZ==", if (is.na(tz)) "unset" else paste0("'",tz,"'"),
", Sys.timezone()=='", suppressWarnings(Sys.timezone()), "'",
", Sys.getlocale()=='", Sys.getlocale(), "'",
", l10n_info()=='", paste0(names(l10n_info()), "=", l10n_info(), collapse="; "), "'",
", getDTthreads()=='", paste0(gsub("[ ][ ]+","==",gsub("^[ ]+","",capture.output(invisible(getDTthreads(verbose=TRUE))))), collapse="; "), "'",
"\n", sep="")

if (inherits(err,"try-error")) {
Expand Down
74 changes: 71 additions & 3 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -8645,6 +8645,8 @@ if (test_R.utils) {
# fix for #1573
ans1 = fread(testDir("issue_1573_fill.txt"), fill=TRUE, na.strings="")
ans2 = setDT(read.table(testDir("issue_1573_fill.txt"), header=TRUE, fill=TRUE, stringsAsFactors=FALSE, na.strings=""))
date_cols = c('SD2', 'SD3', 'SD4')
ans2[ , (date_cols) := lapply(.SD, as.IDate), .SDcols = date_cols]
Comment thread
mattdowle marked this conversation as resolved.
test(1622.1, ans1, ans2)
test(1622.2, ans1, fread(testDir("issue_1573_fill.txt"), fill=TRUE, sep=" ", na.strings=""))

Expand Down Expand Up @@ -10756,7 +10758,9 @@ test(1743.08, sapply(fread("a,b,c\n2017-01-01,1,1+3i", colClasses=c("Date", "int
test(1743.09, sapply(fread("a,b,c\n2017-01-01,1,1+3i", colClasses=c("Date", "integer", "complex")), class), c(a="Date", b="integer", c="complex"))
test(1743.10, sapply(fread("a,b,c,d\n2017-01-01,1,1+3i,05", colClasses=c("Date", "integer", "complex", NA)), class), c(a="Date",b="integer",c="complex",d="integer"))
test(1743.11, sapply(fread("a,b,c,d\n2017-01-01,1,1+3i,05", colClasses=c("Date", "integer", "complex", "raw")), class), c(a="Date",b="integer",c="complex",d="raw"))
test(1743.12, x = vapply(fread("a,b\n2015-01-01,2015-01-01", colClasses = c(NA, "IDate")), inherits, what = "IDate", FUN.VALUE = logical(1)), y = c(a=FALSE, b=TRUE))
test(1743.121, sapply(fread("a,b\n2015-01-01,2015-01-01", colClasses=c(NA,"IDate")), inherits, what="IDate"), c(a=TRUE, b=TRUE))
test(1743.122, fread("a,b\n2015-01-01,2015-01-01", colClasses=c("POSIXct","Date")), data.table(a=as.POSIXct("2015-01-01"), b=as.Date("2015-01-01")))
test(1743.123, fread("a,b\n1+3i,2015-01-01", colClasses=c(NA,"IDate")), data.table(a="1+3i", b=as.IDate("2015-01-01")))

## Attempts to impose incompatible colClasses is a warning (not an error)
## and does not change the value of the columns
Expand Down Expand Up @@ -16611,12 +16615,13 @@ dt = data.table(SomeNumberA=c(1,1,1),SomeNumberB=c(1,1,1))
test(2123, dt[, .(.N, TotalA=sum(SomeNumberA), TotalB=sum(SomeNumberB)), by=SomeNumberA], data.table(SomeNumberA=1, N=3L, TotalA=1, TotalB=3))

# system timezone is not usually UTC, so as.ITime.POSIXct shouldn't assume so, #4085
oldtz=Sys.getenv('TZ')
oldtz=Sys.getenv('TZ', unset=NA)
Sys.setenv(TZ='Asia/Jakarta') # UTC+7
t0 = as.POSIXct('2019-10-01')
test(2124.1, format(as.ITime(t0)), '00:00:00')
test(2124.2, format(as.IDate(t0)), '2019-10-01')
Sys.setenv(TZ=oldtz)
if (is.na(oldtz)) Sys.unsetenv("TZ") else Sys.setenv(TZ=oldtz)
# careful to unset because TZ="" means UTC whereas unset TZ means local

# trunc.cols in print.data.table, #4074
old_width = options("width" = 40)
Expand Down Expand Up @@ -17014,3 +17019,66 @@ setkey(dt, a)
dt2 <- shallow(dt)
setnames(dt2, 'a', 'A')
test(2149.3, key(dt), 'a')

# native reading of [-]?[0-9]+[-][0-9]{2}[-][0-9]{2} dates and
# <date>[T ][0-9]{2}[:][0-9]{2}[:][0-9]{2}(?:[.][0-9]+)?(?:Z|[+-][0-9]{2}[:]?[0-9]{2})? timestamps
dates = as.IDate(c(9610, 19109, 19643, 20385, -1413, 9847, 4116, -11145, -2327, 1760))
times = .POSIXct(tz = 'UTC', c(
937402277.067304, -626563403.382897, -506636228.039861, -2066740882.02417,
-2398617863.28256, -1054008563.60793, 1535199547.55902, 2075410085.54399,
1201364458.72486, 939956943.690777
))
DT = data.table(dates, times)
tmp = tempfile()
## ISO8601 format (%FT%TZ) by default
fwrite(DT, tmp)
test(2150.01, fread(tmp), DT) # defaults for fwrite/fread simple and preserving
fwrite(DT, tmp, dateTimeAs='write.csv') # writes the UTC times as-is not local because the time column has tzone=="UTC", but without the Z marker
test(2150.021, sapply(fread(tmp), typeof), c(dates="integer", times="character")) # as before v1.13.0, datetime with missing timezone read as character
oldtz = Sys.getenv("TZ", unset=NA)
Sys.setenv(TZ="UTC")
# as before v1.13.0, dispatches to as.POSIXct() which interprets as local time, so we need to set TZ here to get the original UTC times from the write.csv version
tt = fread(tmp, colClasses=list(POSIXct="times"))
test(2150.022, attr(tt$times, "tzone"), "") # as.POSIXct puts "" on the result (testing the write.csv version here with missing tzone)
setattr(tt$times, "tzone", "UTC")
test(2150.023, tt, DT)
if (is.na(oldtz)) Sys.unsetenv("TZ") else Sys.setenv(TZ=oldtz)
fwrite(copy(DT)[ , times := format(times, '%FT%T+00:00')], tmp)
test(2150.03, fread(tmp), DT)
fwrite(copy(DT)[ , times := format(times, '%FT%T+0000')], tmp)
test(2150.04, fread(tmp), DT)
fwrite(copy(DT)[ , times := format(times, '%FT%T+0115')], tmp)
test(2150.05, fread(tmp), copy(DT)[ , times := times - 4500])
fwrite(copy(DT)[ , times := format(times, '%FT%T+01')], tmp)
test(2150.06, fread(tmp), copy(DT)[ , times := times - 3600])
## invalid tz specifiers
fwrite(copy(DT)[ , times := format(times, '%FT%T+3600')], tmp)
test(2150.07, fread(tmp), copy(DT)[ , times := format(times, '%FT%T+3600')])
fwrite(copy(DT)[ , times := format(times, '%FT%T+36')], tmp)
test(2150.08, fread(tmp), copy(DT)[ , times := format(times, '%FT%T+36')])
fwrite(copy(DT)[ , times := format(times, '%FT%T+XXX')], tmp)
test(2150.09, fread(tmp), copy(DT)[ , times := format(times, '%FT%T+XXX')])
fwrite(copy(DT)[ , times := format(times, '%FT%T+00:XX')], tmp)
test(2150.10, fread(tmp), copy(DT)[ , times := format(times, '%FT%T+00:XX')])
# allow colClasses='POSIXct' to force YMD column to read as POSIXct
test(2150.11,fread("a,b\n2015-01-01,2015-01-01", colClasses="POSIXct"), # local time for backwards compatibility
data.table(a=as.POSIXct("2015-01-01"), b=as.POSIXct("2015-01-01")))
test(2150.12,fread("a,b\n2015-01-01,2015-01-01", select=c(a="Date",b="POSIXct")), # select colClasses form, for coverage
data.table(a=as.Date("2015-01-01"), b=as.POSIXct("2015-01-01")))
test(2150.13, fread("a,b\n2015-01-01,1.1\n2015-01-02 01:02:03,1.2"), # no Z so as character as before v1.13.0
data.table(a=c("2015-01-01","2015-01-02 01:02:03"), b=c(1.1, 1.2)))
# some rows are date-only, some rows UTC-timestamp --> read the date-only in UTC too
test(2150.14, fread("a,b\n2015-01-01,1.1\n2015-01-02T01:02:03Z,1.2"),
data.table(a = .POSIXct(1420070400 + c(0, 90123), tz="UTC"), b = c(1.1, 1.2)))
old = options(datatable.old.fread.datetime.character=TRUE)
test(2150.15, fread("a,b,c\n2015-01-01,2015-01-02,2015-01-03T01:02:03Z"),
data.table(a="2015-01-01", b="2015-01-02", c="2015-01-03T01:02:03Z"))
test(2150.16, fread("a,b,c\n2015-01-01,2015-01-02,2015-01-03 01:02:03", colClasses=c("Date","IDate","POSIXct")),
ans<-data.table(a=as.Date("2015-01-01"), b=as.IDate("2015-01-02"), c=as.POSIXct("2015-01-03 01:02:03")))
ans_print = capture.output(print(ans))
options(datatable.old.fread.datetime.character=NULL)
test(2150.17, fread("a,b,c\n2015-01-01,2015-01-02,2015-01-03 01:02:03", colClasses=c("Date","IDate","POSIXct")),
ans, output=ans_print)
test(2150.18, fread("a,b,c\n2015-01-01,2015-01-02,2015-01-03 01:02:03", colClasses=c("Date",NA,NA)),
data.table(a=as.Date("2015-01-01"), b=as.IDate("2015-01-02"), c="2015-01-03 01:02:03"), output=ans_print)
options(old)
6 changes: 3 additions & 3 deletions man/fread.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@
\alias{fread}
\title{ Fast and friendly file finagler }
\description{
Similar to \code{read.table} but faster and more convenient. All controls such as \code{sep}, \code{colClasses} and \code{nrows} are automatically detected. \code{bit64::integer64} types are also detected and read directly without needing to read as character before converting.
Similar to \code{read.table} but faster and more convenient. All controls such as \code{sep}, \code{colClasses} and \code{nrows} are automatically detected.

Dates are read as character currently. They can be converted afterwards using the excellent \code{fasttime} package or standard base functions.
\code{bit64::integer64}, \code{\link{IDate}}, and \code{\link{POSIXct}} types are also detected and read directly without needing to read as character before converting.

`fread` is for \emph{regular} delimited files; i.e., where every row has the same number of columns. In future, secondary separator (\code{sep2}) may be specified \emph{within} each column. Such columns will be read as type \code{list} where each cell is itself a vector.
\code{fread} is for \emph{regular} delimited files; i.e., where every row has the same number of columns. In future, secondary separator (\code{sep2}) may be specified \emph{within} each column. Such columns will be read as type \code{list} where each cell is itself a vector.
}
\usage{
fread(input, file, text, cmd, sep="auto", sep2="auto", dec=".", quote="\"",
Expand Down
4 changes: 4 additions & 0 deletions src/data.table.h
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,8 @@ extern SEXP char_ITime;
extern SEXP char_IDate;
extern SEXP char_Date;
extern SEXP char_POSIXct;
extern SEXP char_POSIXt;
extern SEXP char_UTC;
extern SEXP char_nanotime;
extern SEXP char_lens;
extern SEXP char_indices;
Expand All @@ -97,6 +99,8 @@ extern SEXP sym_verbose;
extern SEXP SelfRefSymbol;
extern SEXP sym_inherits;
extern SEXP sym_datatable_locked;
extern SEXP sym_tzone;
extern SEXP sym_old_fread_datetime_character;
extern double NA_INT64_D;
extern long long NA_INT64_LL;
extern Rcomplex NA_CPLX; // initialized in init.c; see there for comments
Expand Down
Loading