From 220c10cecf928d1ac94f24ddb510c0a3ea98dc45 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?V=C3=A1clav=20Tlap=C3=A1k?= <55213630+tlapak@users.noreply.github.com> Date: Wed, 16 Jun 2021 21:18:40 +0200 Subject: [PATCH 1/4] Fix segfaults --- src/fread.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/src/fread.c b/src/fread.c index da3e0e18ea..d00cffcd1f 100644 --- a/src/fread.c +++ b/src/fread.c @@ -2361,6 +2361,9 @@ int freadMain(freadMainArgs _args) { if (j+fieldsRemaining != ncol) break; checkedNumberOfFields = true; } + if (thisType <= -NUMTYPE) { + break; // Improperly quoted char field needs to be healed below, other columns will be filled #5041 and #4774 + } #pragma omp critical { joldType = type[j]; // fetch shared value again in case another thread bumped it while I was waiting. @@ -2369,8 +2372,8 @@ int freadMain(freadMainArgs _args) { if (verbose) { char temp[1001]; int len = snprintf(temp, 1000, - _("Column %d (\"%.*s\") bumped from '%s' to '%s' due to <<%.*s>> on row %"PRIu64"\n"), - j+1, colNames[j].len, colNamesAnchor + colNames[j].off, + _("Column %d%s%.*s%s bumped from '%s' to '%s' due to <<%.*s>> on row %"PRIu64"\n"), + j+1, colNames?" <<":"", colNames?(colNames[j].len):0, colNames?(colNamesAnchor+colNames[j].off):"", colNames?">>":"", typeName[abs(joldType)], typeName[abs(thisType)], (int)(tch-fieldStart), fieldStart, (uint64_t)(ctx.DTi+myNrow)); if (len > 1000) len = 1000; From c066dbb47e51f4214d63091df26ac51bd2e54545 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?V=C3=A1clav=20Tlap=C3=A1k?= <55213630+tlapak@users.noreply.github.com> Date: Wed, 16 Jun 2021 21:55:58 +0200 Subject: [PATCH 2/4] Add tests --- inst/tests/tests.Rraw | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index 5b21448fa6..7ba12ae924 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -17728,3 +17728,14 @@ if (test_bit64) { test(2193.2, X[Y, `:=`(y=i.y), on="x", by=.EACHI], data.table(x=1:3, y=as.integer64(10L,20L,NA))) } +# Improperly quoted character field with fill = TRUE would lead to segfault before, #4774 and #5041 +test(2194, + fread(paste0(paste(rep(c('a; b'), 100), collapse = '\n'), c('\n"a" 2;b')), fill = TRUE, quote = '\"'), + data.table(a = c(rep('a', 99), '"a" 2'), b = rep('b', 100)), warning = 'Found and resolved improper quoting') + +# Verbose output would segfault when no header present, out-of-sample type error and fill = TRUE, similar to #4644 +# Test vaildity may depend on sampling behaviour of fread since the type bump needs to occur out of sample to trigger the segfault +sampleText = paste0(paste(rep(c('1; 2'), 100), collapse = '\n'), c('\n"a";2\n1; 2\n'), paste(rep(c('1; 2'), 100), collapse = '\n')) +test(2195, fread(sampleText, fill = TRUE, quote = '\"', verbose = TRUE, header = FALSE), + fread(sampleText, fill = TRUE, quote = '\"', verbose = FALSE, header = FALSE), + output = 'Column 1 bumped') From 934c01a463d0e9a425ad896db0223e48a5090576 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?V=C3=A1clav=20Tlap=C3=A1k?= <55213630+tlapak@users.noreply.github.com> Date: Wed, 16 Jun 2021 22:13:04 +0200 Subject: [PATCH 3/4] Add news items --- NEWS.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/NEWS.md b/NEWS.md index 8c3693dbc4..8d1e83d14e 100644 --- a/NEWS.md +++ b/NEWS.md @@ -133,6 +133,10 @@ 19. A fix to `as.Date(c("", ...))` in R 4.0.3, [17909](https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17909), has been backported to `data.table::as.IDate()` so that it too now returns `NA` for the first item when it is blank, even in older versions of R back to 3.1.0, rather than the incorrect error `character string is not in a standard unambiguous format`, [#4676](https://github.com/Rdatatable/data.table/issues/4676). Thanks to Arun Srinivasan for reporting, and Michael Chirico both for the `data.table` PR and for submitting the patch to R that was accepted and included in R 4.0.3. +20. `fread(file, fill = TRUE)` could segfault if `file` contained an improperly quoted character field, [#4774](https://github.com/Rdatatable/data.table/issues/4774) [#5041](https://github.com/Rdatatable/data.table/issues/5041). Thanks to @AndeolEvain and @e-nascimento for reporting and to Václav Tlapák for the PR. + +21. `fread(file, fill = TRUE, verbose = TRUE)` would segfault if `file` did not contain column names and contained a column where the initial type guess turned out to be wrong, e.g. a column of mostly integers with a character field somewhere in the middle. The segfault would occur when creating the verbose output about the type bump, [](https://github.com/Rdatatable/data.table/pull/). Thanks to Václav Tlapák for the PR. + ## NOTES 1. New feature 29 in v1.12.4 (Oct 2019) introduced zero-copy coercion. Our thinking is that requiring you to get the type right in the case of `0` (type double) vs `0L` (type integer) is too inconvenient for you the user. So such coercions happen in `data.table` automatically without warning. Thanks to zero-copy coercion there is no speed penalty, even when calling `set()` many times in a loop, so there's no speed penalty to warn you about either. However, we believe that assigning a character value such as `"2"` into an integer column is more likely to be a user mistake that you would like to be warned about. The type difference (character vs integer) may be the only clue that you have selected the wrong column, or typed the wrong variable to be assigned to that column. For this reason we view character to numeric-like coercion differently and will warn about it. If it is correct, then the warning is intended to nudge you to wrap the RHS with `as.()` so that it is clear to readers of your code that a coercion from character to that type is intended. For example : From 431e02762c2598af870a953c449bed84ccee4fec Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?V=C3=A1clav=20Tlap=C3=A1k?= <55213630+tlapak@users.noreply.github.com> Date: Wed, 16 Jun 2021 22:29:34 +0200 Subject: [PATCH 4/4] reference to PR --- NEWS.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/NEWS.md b/NEWS.md index 8d1e83d14e..5c48ae62fb 100644 --- a/NEWS.md +++ b/NEWS.md @@ -135,7 +135,7 @@ 20. `fread(file, fill = TRUE)` could segfault if `file` contained an improperly quoted character field, [#4774](https://github.com/Rdatatable/data.table/issues/4774) [#5041](https://github.com/Rdatatable/data.table/issues/5041). Thanks to @AndeolEvain and @e-nascimento for reporting and to Václav Tlapák for the PR. -21. `fread(file, fill = TRUE, verbose = TRUE)` would segfault if `file` did not contain column names and contained a column where the initial type guess turned out to be wrong, e.g. a column of mostly integers with a character field somewhere in the middle. The segfault would occur when creating the verbose output about the type bump, [](https://github.com/Rdatatable/data.table/pull/). Thanks to Václav Tlapák for the PR. +21. `fread(file, fill = TRUE, verbose = TRUE)` would segfault if `file` did not contain column names and contained a column where the initial type guess turned out to be wrong, e.g. a column of mostly integers with a character field somewhere in the middle. The segfault would occur when creating the verbose output about the type bump, [5046](https://github.com/Rdatatable/data.table/pull/5046). Thanks to Václav Tlapák for the PR. ## NOTES