Skip to content

[BUG] dcast.data.table leads to inconsistent keying with NA #2202

@Galileo-Galilei

Description

@Galileo-Galilei

Hello all,

this is my very first post on Github and I want to thank the creators of data.table for their amazing job!

I currently use data.table to deal with huge datasets (50 datasets of 15 millions rows / dozens of columns each) and I noticed a strange behavior of dcast.data.table which should likely be considered as a bug (although it could just be a misunderstanding of data.table’s internals).

Many thanks to those who will take the time to understand and address my problem,

Galileo

P.S : I looked for a similar issue on SO and Github, but did not encountered it, so I decided to open a new issue. please let me know if a similar bug has already been reported.

P.S.2 : The data.table version I use is the "data.table_1.10.4" which seems to be the latest on CRAN.


Bug description

dcast data.table marks as « keyed » the cast data.table, while it handles NA differently than setkey. This leads to inconsistent results when further merging.

To my understanding, and related to this SO answer, setkey considers NA as large negative integer and sorts them consistently with base:: sort(x,na.last=FALSE). This seems to be the desired behaviour of data.table
dcast.data.table seems to not sort the NA.

Minimal Reproducible Example

First, I create the dataset :

# first create a toy dataset
toy_data = structure(list(id = structure(c(1L, 1L, 1L, 1L, NA, NA),
                                         .Label = "123456", class = "factor"),
                          factor_var = structure(c(1L, 1L, 1L, 1L, 1L, 1L),
                                                 .Label = "U", class = "factor"),
                          num_var1 = c(0,300, 600, 500, 0, 800),
                          num_var2 = c(0,15, 50, 30, 0, 50)),
                     .Names = c("id", "factor_var", "num_var1", "num_var2"),
                     row.names = c(NA, -6L),
                     class = c("data.table","data.frame"))

#      id factor_var num_var1 num_var2
# 1: 123456          U        0        0
# 2: 123456          U      300       15
# 3: 123456          U      600       50
# 4: 123456          U      500       30
# 5:    NA          U        0        0
# 6:    NA          U      800       50

#REMARK1 : The dataset is NOT KEYED!
key(toy_data_agg )
#REMARK2 : The dataset the factor has only one level here, but not in my real data so It is very unlikely that it is the cause of the error

I want to aggregate some values by id, since it is the only relevant information :

# Aggregate by id
toy_data_agg = toy_data[,.(num_var1 = sum(num_var1),
                           num_var2 = sum(num_var2)),
                        by=.(id,factor_var)]   
#      id factor_var num_var1 num_var2
# 1: 123456          U     1400       95
# 2:    NA          U      800       50

#REMARK1 : The dataset is NOT KEYED!
key(toy_data_agg )
# REMARK2 : NA appears in last !

I want to cast the data.table to turn rows as columns (in the real dataset, I have much more variables and levels in factor_var) :

# cast by factor variable
toy_data_cast = dcast.data.table(data = toy_data_agg,
                                 formula = id ~ factor_var,
                                 value.var = c("num_var1","num_var2"))
#      id num_var1_U num_var2_U
# 1: 123456       1400         95
# 2:    NA        800         50
#REMARK1 : The dataset is now KEYED, without any warning!
key(toy_data_cast)
# REMARK2 : NA appears in LAST! THIS SEEMS INCONSISTENT WITH KEYING NA? SEE BELOW.

This is where the bug lies : the dataset is marked as « keyed » while this is not coherent with setkey.

setkey(toy_data_cast,id) 
# Warning message:
# In setkeyv(x, cols, verbose = verbose, physical = physical) :
# Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood 
# please let datatable-help know so the root cause can be fixed.

toy_data_cast
#      id num_var1_U num_var2_U
# 1:    NA        800         50
# 2: 123456       1400         95
# REMARK : NA appears in FIRST!

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions