Skip to content

setkey changing data (not just sorting) #2540

@patrickhowerter

Description

@patrickhowerter

I have found a case where setkey can actually change the underlying rows of data (more than just sorting). It is like to structure that indexes the rows for each vector is out of sync.

The case happens when:

  1. I update columns by the syntax dt[ , c('col1', 'col2') := somecolumn] rather than dt[, .(col1=somecolumn, col2=somecolumn)].
  2. Execute setkey on the columns that were updated by the example above.

Please see the reproducible example:

library(data.table)

# set up some dummy data
a <- c('A', 'B', 'D', 'C')
b <- as.numeric(c(20160101,20160131, 20160102 ))
ab <- CJ(a=a, b=b, sorted = FALSE)
c <- as.numeric(c(20170101,20170131, 20170102 ))

ab2 <- CJ(a = a, b = c, sorted = FALSE)
ab <- rbindlist(list(ab, ab2))

# set up the test data.table that will give us strange results
test <- data.table(a = ab$a)
# this must be issue ?
test[, c('astart', 'aend') := as.integer(ab$b)]

# once we set the keys some unque records are removed and some are duplicated
setkey(test, a, astart, aend)

# duplicate data
ab[ (a == "A") & (b == 20160101)] # there was one row
test[(a == "A") & (astart == 20160101)] # now there are two rows?

# some of the rows have been removed
test[(a == "A") & (astart == 20170101)] # now there are no rows where a == "A"?
ab[ (a == "A") & (b == 20170101)] # there was one row

# Output of sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] shiny_1.0.5 mdo_0.3.3 data.table_1.10.4-3

loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 compiler_3.4.2 bindr_0.1 tools_3.4.2 xts_0.10-0 digest_0.6.12 bit_1.1-12 evaluate_0.10.1 lubridate_1.7.1 jsonlite_1.5 tibble_1.3.4 lattice_0.20-35
[13] ff_2.2-13 pkgconfig_2.0.1 rlang_0.1.4 fastmatch_1.1-0 rstudioapi_0.7 yaml_2.1.15 bindrcpp_0.2 dplyr_0.7.4 stringr_1.2.0 knitr_1.17 htmlwidgets_0.9 rprojroot_1.2
[25] DT_0.2 grid_3.4.2 glue_1.2.0 R6_2.2.2 bookdown_0.5 rmarkdown_1.8 magrittr_1.5 backports_1.1.1 htmltools_0.3.6 rsconnect_0.8.5 assertthat_0.2.0 mime_0.5
[37] xtable_1.8-2 httpuv_1.3.5 stringi_1.1.6 zoo_1.8-0

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions