Skip to content

Segfault merging data.tables with keyed NA_character_ columns #5070

@JorisChau

Description

@JorisChau

Issue

Merging two data.tables where one data.table (or both) has a keyed column containing only NA_character_'s produces a segfault and crashes the R session.

Reproducible example

library(data.table)

dt1 <- data.table(x1 = rep(letters[1:4], each = 3), x2 = NA_character_)
dt2 <- data.table(x1 = letters[1:3])
  
setkey(dt1, x2)

dt3 <- dt1[dt2, on = "x1"]
dt3[, .(x1, x2)]

With valgrind enabled, using the current data.table development version (1.14.1), the above code returns:

==10795== Use of uninitialised value of size 8
==10795==    at 0x4FB7910: LEVELS (in /usr/lib/R/lib/libR.so)
==10795==    by 0x101824AA: issorted (in /home/jchau/R/x86_64-pc-linux-gnu-library/4.0/data.table/libs/datatable.so)
==10795==    by 0x4F352AB: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F7540B: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F7F66F: Rf_eval (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F8148E: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F82256: Rf_applyClosure (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F76908: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F7F66F: Rf_eval (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F8148E: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F82256: Rf_applyClosure (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4FC5362: ??? (in /usr/lib/R/lib/libR.so)
==10795==  Uninitialised value was created by a heap allocation
==10795==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10795==    by 0x4FBE353: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4FBFE81: Rf_allocVector3 (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F892F7: R_bcEncode (in /usr/lib/R/lib/libR.so)
==10795==    by 0x50215C6: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x502163F: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x502095C: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x501FCA9: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x5021A2D: R_Unserialize (in /usr/lib/R/lib/libR.so)
==10795==    by 0x5022DC9: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x5023200: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F7FBF5: Rf_eval (in /usr/lib/R/lib/libR.so)
==10795== 
==10795== Invalid read of size 2
==10795==    at 0x4FB7910: LEVELS (in /usr/lib/R/lib/libR.so)
==10795==    by 0x101824AA: issorted (in /home/jchau/R/x86_64-pc-linux-gnu-library/4.0/data.table/libs/datatable.so)
==10795==    by 0x4F352AB: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F7540B: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F7F66F: Rf_eval (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F8148E: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F82256: Rf_applyClosure (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F76908: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F7F66F: Rf_eval (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F8148E: ??? (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4F82256: Rf_applyClosure (in /usr/lib/R/lib/libR.so)
==10795==    by 0x4FC5362: ??? (in /usr/lib/R/lib/libR.so)
==10795==  Address 0x1000000010001 is not stack'd, malloc'd or (recently) free'd
==10795== 

 *** caught segfault ***
address (nil), cause 'unknown'

Traceback:
 1: is.sorted(jval, by = key(x))
 2: `[.data.table`(dt3, , .(x1, x2))
 3: dt3[, .(x1, x2)]

Session info

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.1

loaded via a namespace (and not attached):
[1] compiler_4.0.2

Note

Using an explicit merge does work as expected:

library(data.table)

dt1 <- data.table(x1 = rep(letters[1:4], each = 3), x2 = NA_character_)
dt2 <- data.table(x1 = letters[1:3])
  
setkey(dt1, x2)

dt3 <- merge(dt1, dt2, by = "x1")
dt3[, .(x1, x2)]

#>    x1 x2
#> 1:  a <NA>
#> 2:  a <NA>
#> 3:  a <NA>
#> 4:  b <NA>
#> 5:  b <NA>
#> 6:  b <NA>
#> 7:  c <NA>
#> 8:  c <NA>
#> 9:  c <NA>

Metadata

Metadata

Assignees

No one assigned

    Labels

    joinsUse label:"non-equi joins" for rolling, overlapping, and non-equi joinsopenmpsegfault

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions