From 61a3c8ce5b557f6165de004cc2bdc48ed6c5c256 Mon Sep 17 00:00:00 2001 From: Toby Dylan Hocking Date: Sat, 26 Sep 2020 07:10:08 -0700 Subject: [PATCH 1/7] bump variable when all data are missing --- inst/tests/tests.Rraw | 5 ++--- src/fmelt.c | 8 +++----- 2 files changed, 5 insertions(+), 8 deletions(-) diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index 058c7db3b2..024e1e2610 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -3172,9 +3172,9 @@ Sep,33.5,19.4,15.7,11.9,0,100.8,100.8,0,12.7,12.7,0,174.1") x[, c("y1","z1"):=NA] test(1037.405, dim(melt(x, measure.vars=patterns("^y", "^z"))), INT(4,5)) test(1037.406, dim(ans<-melt(x, measure.vars=patterns("^y", "^z"), na.rm=TRUE)), INT(2,5)) - test(1037.407, ans$variable, factor(c("1","1"))) + test(1037.407, ans$variable, factor(c("2","2"), c("1", "2"))) test(1037.408, dim(ans<-melt(x, measure.vars=patterns("^y", "^z"), na.rm=TRUE, variable.factor=FALSE)), INT(2,5)) - test(1037.409, ans$variable, c("1","1")) + test(1037.409, ans$variable, c("2","2")) test(1037.410, melt(data.table(NULL), verbose=TRUE), data.table(NULL), output="ncol(data) is 0. Nothing to melt") @@ -17137,4 +17137,3 @@ test(2153.3, ans<-DT[, .(list(.NGRP)), by=x], data.table(x=1:2, V1=list(2L,2L))) test(2153.4, address(ans$V1[[1L]]), address(ans$V1[[2L]])) # .NGRP doesn't change group to group so the same object can be referenced many times unlike .N and .GRP test(2153.5, DT[, .(list(c(0L,.N,0L))), by=x], # c() here will create new object so this is ok anyway; i.e. address(.N) is not present in j's result data.table(x=1:2, V1=list(c(0L,1L,0L), c(0L,2L,0L)))) - diff --git a/src/fmelt.c b/src/fmelt.c index 22a4ac1fc5..3a84afd635 100644 --- a/src/fmelt.c +++ b/src/fmelt.c @@ -538,7 +538,6 @@ SEXP getvarcols(SEXP DT, SEXP dtnames, Rboolean varfactor, Rboolean verbose, str } else { for (int j=0, ansloc=0, level=1; jlmax; ++j) { const int thislen = data->narm ? length(VECTOR_ELT(data->naidx, j)) : data->nrow; - if (thislen==0) continue; // so as not to bump level char buff[20]; snprintf(buff, 20, "%d", level++); SEXP str = PROTECT(mkChar(buff)); @@ -546,11 +545,11 @@ SEXP getvarcols(SEXP DT, SEXP dtnames, Rboolean varfactor, Rboolean verbose, str UNPROTECT(1); } } - } else { + } else {// varfactor==TRUE SET_VECTOR_ELT(ansvars, 0, target=allocVector(INTSXP, data->totlen)); SEXP levels; int *td = INTEGER(target); - if (data->lvalues == 1) { + if (data->lvalues == 1) {//single output column. SEXP thisvaluecols = VECTOR_ELT(data->valuecols, 0); int len = length(thisvaluecols); levels = PROTECT(allocVector(STRSXP, len)); protecti++; @@ -573,12 +572,11 @@ SEXP getvarcols(SEXP DT, SEXP dtnames, Rboolean varfactor, Rboolean verbose, str const int thislen = data->narm ? length(VECTOR_ELT(data->naidx, j)) : data->nrow; for (int k=0; klmax)); protecti++; for (int j=0, ansloc=0; jlmax; ++j) { const int thislen = data->narm ? length(VECTOR_ELT(data->naidx, j)) : data->nrow; - if (thislen==0) continue; // so as not to bump level char buff[20]; snprintf(buff, 20, "%d", nlevel+1); SET_STRING_ELT(levels, nlevel++, mkChar(buff)); // generate levels = 1:nlevels From e535423475793ac2a0b13ff2e94c7bf21eb4c8f6 Mon Sep 17 00:00:00 2001 From: Toby Dylan Hocking Date: Sun, 27 Sep 2020 22:30:33 -0700 Subject: [PATCH 2/7] bugfix for melt with na.rm=T and list for measure.vars --- NEWS.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/NEWS.md b/NEWS.md index 308d427250..42859117b0 100644 --- a/NEWS.md +++ b/NEWS.md @@ -14,6 +14,8 @@ 3. Operating on columns of type `list`, e.g. `dt[, listCol[[1]], by=id]`, suffered a performance regression in v1.13.0, [#4646](https://github.com/Rdatatable/data.table/issues/4646) [#4658](https://github.com/Rdatatable/data.table/issues/4658). Thanks to @fabiocs8 and @sandoronodi for the detailed reports, and to Cole Miller for substantial debugging, investigation and proposals at C level which enabled the root cause to be fixed. +4. When using melt with na.rm=TRUE and a list for measure.vars, variable was not incremented for groups with all missing values, which caused inconsistent results between na.rm=TRUE and FALSE, [#4455](https://github.com/Rdatatable/data.table/issues/4455). Thanks to @tdhock for reporting and fixing. + ## NOTES 1. `bit64` v4.0.2 and `bit` v4.0.3, both released on 30th July, broke `data.table`'s tests. It seems that reverse dependency testing of `bit64` (i.e. testing of the packages which use `bit64`) did not include `data.table` because `data.table` suggests `bit64` but does not depend on it. Like other packages on our `Suggest` list, we test `data.table` works with `bit64` in our tests. In testing of our own reverse dependencies (packages which use `data.table`) we do include packages which suggest `data.table`, although it appears it is not CRAN policy to do so. We have requested that CRAN policy be improved to include suggests in reverse dependency testing. From 999c91b99a482212bc69258f617b60cfcd0a2714 Mon Sep 17 00:00:00 2001 From: Toby Dylan Hocking Date: Mon, 28 Sep 2020 21:44:01 -0700 Subject: [PATCH 3/7] start with melt --- NEWS.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/NEWS.md b/NEWS.md index 8fd5f9fd31..5ebd01a6ab 100644 --- a/NEWS.md +++ b/NEWS.md @@ -16,7 +16,7 @@ 4. `fread("1.2\n", colClasses='integer')` would segfault when creating the warning message due to no column names in the output, [#4644](https://github.com/Rdatatable/data.table/issues/4644). It now warns with `Attempt to override column 1 of inherent type 'float64' down to 'int32' ignored.` When column names are present, the warning message includes the name as before; i.e., `fread("A\n1.2\n", colClasses='integer')` produces `Attempt to override column 1 <> of inherent type 'float64' down to 'int32' ignored.`. Thanks to Kun Ren for reporting. -4. When using melt with na.rm=TRUE and a list for measure.vars, variable was not incremented for groups with all missing values, which caused inconsistent results between na.rm=TRUE and FALSE, [#4455](https://github.com/Rdatatable/data.table/issues/4455). Thanks to @tdhock for reporting and fixing. +4. `melt` with a list for `measure.vars` would output `variable` inconsistently between `na.rm=TRUE` and `FALSE`, [#4455](https://github.com/Rdatatable/data.table/issues/4455). Thanks to @tdhock for reporting and fixing. ## NOTES From c1fb10f590e590e746cd35d6826e98e77151ab5e Mon Sep 17 00:00:00 2001 From: Toby Dylan Hocking Date: Wed, 30 Sep 2020 22:36:45 -0700 Subject: [PATCH 4/7] document values in variable column --- man/melt.data.table.Rd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/melt.data.table.Rd b/man/melt.data.table.Rd index a9d69b5f66..b73f4643e5 100644 --- a/man/melt.data.table.Rd +++ b/man/melt.data.table.Rd @@ -31,7 +31,7 @@ non-measure columns will be assigned to it. If integer, must be positive; see De } For convenience/clarity in the case of multiple \code{melt}ed columns, resulting column names can be supplied as names to the elements \code{measure.vars} (in the \code{list} and \code{patterns} usages). See also \code{Examples}. } -\item{variable.name}{name for the measured variable names column. The default name is \code{'variable'}.} +\item{variable.name}{name (default \code{'variable'}) of output column containing information about which input column(s) were melted. If \code{measure.vars} is an integer/character vector, then each entry of this column contains the name of a melted column from \code{data}. If \code{measure.vars} is a list of integer/character vectors, then each entry of this column contains an integer indicating an index/position in each of those vectors.} \item{value.name}{name for the molten data values column(s). The default name is \code{'value'}. Multiple names can be provided here for the case when \code{measure.vars} is a \code{list}, though note well that the names provided in \code{measure.vars} take precedence. } \item{na.rm}{If \code{TRUE}, \code{NA} values will be removed from the molten data.} From 25d9dfe0f820dbb63e5568bee3ace0b65588f81e Mon Sep 17 00:00:00 2001 From: Toby Dylan Hocking Date: Fri, 2 Oct 2020 22:21:10 -0700 Subject: [PATCH 5/7] remove factor renumbering code which is never used anymore --- src/fmelt.c | 7 ------- 1 file changed, 7 deletions(-) diff --git a/src/fmelt.c b/src/fmelt.c index 3a84afd635..39b93d2e81 100644 --- a/src/fmelt.c +++ b/src/fmelt.c @@ -582,13 +582,6 @@ SEXP getvarcols(SEXP DT, SEXP dtnames, Rboolean varfactor, Rboolean verbose, str SET_STRING_ELT(levels, nlevel++, mkChar(buff)); // generate levels = 1:nlevels for (int k=0; klmax) { - // data->narm is true and there are some all-NA items causing at least one 'if (thislen==0) continue' above - // shrink the levels - SEXP newlevels = PROTECT(allocVector(STRSXP, nlevel)); protecti++; - for (int i=0; i Date: Sat, 3 Oct 2020 08:15:37 -0700 Subject: [PATCH 6/7] test to increase code coverage --- inst/tests/tests.Rraw | 3 +++ 1 file changed, 3 insertions(+) diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index 63d8c5b8bf..02d6b0014c 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -3014,6 +3014,9 @@ test(1034, as.data.table(x<-as.character(sample(letters, 5))), data.table(V1=x)) error="Unknown 'id.vars' type raw") test(1035.012, melt(DT, id.vars=1:3, measure.vars=as.raw(0)), error="Unknown 'measure.vars' type raw") + test(1035.013, melt(data.table(a=1, b=1), id.vars=c(1,1)), data.table(a=1, a.1=1, variable=factor("b"), value=1)) + test(1035.014, melt(data.table(a1=1, b1=1, b2=2), na.rm=TRUE, measure.vars=list(a="a1", b=c("b1","b2"))), data.table(variable=factor(1,c("1","2")), a=1, b=1)) + test(1035.015, melt(data.table(a=1+2i, b=1), id.vars="a"), error="Unknown column type 'complex' for column 'a' in 'data'") ans1 = cbind(DT[, c(1,2,8), with=FALSE], variable=factor("l_1")) ans1[, value := DT$l_1] From 53f1f0e3cd6da390adffce818827d5e1964044b0 Mon Sep 17 00:00:00 2001 From: Matt Dowle Date: Fri, 7 May 2021 15:09:19 -0600 Subject: [PATCH 7/7] moved news item up --- NEWS.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/NEWS.md b/NEWS.md index bd4895e000..549ce093e3 100644 --- a/NEWS.md +++ b/NEWS.md @@ -78,6 +78,8 @@ 11. Assigning a set of 2 or more all-NA values to a factor column could segfault, [#4824](https://github.com/Rdatatable/data.table/issues/4824). Thanks to @clerousset for reporting and @shrektan for fixing. +12. `melt` with a list for `measure.vars` would output `variable` inconsistently between `na.rm=TRUE` and `FALSE`, [#4455](https://github.com/Rdatatable/data.table/issues/4455). Thanks to @tdhock for reporting and fixing. + ## NOTES 1. New feature 29 in v1.12.4 (Oct 2019) introduced zero-copy coercion. Our thinking is that requiring you to get the type right in the case of `0` (type double) vs `0L` (type integer) is too inconvenient for you the user. So such coercions happen in `data.table` automatically without warning. Thanks to zero-copy coercion there is no speed penalty, even when calling `set()` many times in a loop, so there's no speed penalty to warn you about either. However, we believe that assigning a character value such as `"2"` into an integer column is more likely to be a user mistake that you would like to be warned about. The type difference (character vs integer) may be the only clue that you have selected the wrong column, or typed the wrong variable to be assigned to that column. For this reason we view character to numeric-like coercion differently and will warn about it. If it is correct, then the warning is intended to nudge you to wrap the RHS with `as.()` so that it is clear to readers of your code that a coercion from character to that type is intended. For example : @@ -170,8 +172,6 @@ 5. `dplyr::mutate(setDT(as.list(1:64)), V1=11)` threw error `can't set ALTREP truelength`, [#4734](https://github.com/Rdatatable/data.table/issues/4734). Thanks to @etryn for the reproducible example, and to Cole Miller for refinements. -4. `melt` with a list for `measure.vars` would output `variable` inconsistently between `na.rm=TRUE` and `FALSE`, [#4455](https://github.com/Rdatatable/data.table/issues/4455). Thanks to @tdhock for reporting and fixing. - ## NOTES 1. `bit64` v4.0.2 and `bit` v4.0.3, both released on 30th July, correctly broke `data.table`'s tests. Like other packages on our `Suggest` list, we check `data.table` works with `bit64` in our tests. The first break was because `all.equal` always returned `TRUE` in previous versions of `bit64`. Now that `all.equal` works for `integer64`, the incorrect test comparison was revealed. If you use `bit64`, or `nanotime` which uses `bit64`, it is highly recommended to upgrade to the latest `bit64` version. Thanks to Cole Miller for the PR to accommodate `bit64`'s update.