Skip to content

set changes original data (as.data.table on DESeq/GRanges objects) #3230

@NikdAK

Description

@NikdAK

I noticed a real pitfall (bug) while examining wrong results in my analysis.
Changing a data.table object also alters the original data from which it was copied with as.data.table. This happens for objects like DESeqResults and GRanges.

According to the vignette:

as.data.table methods returns a copy of original data

But this is apparently not true.
Here is a minimal example:

library(data.table)
library(DESeq2)
dds <- makeExampleDESeqDataSet(betaSD=1) #some data
dds <- dds[1:3] #minimalize the example

rownames(dds) <- c("C","B","A") #random order of gene names
dds <- DESeq(dds)
res <- results(dds)

DT <- as.data.table(res) #this should create a copy
DT[, name := rownames(res)] #rename for clarification


DT[,.(name, baseMean,padj)] #print reduced versions
#   name   baseMean       padj
# 1:    C  11.594847 0.01511568
# 2:    B 605.910995 0.99999893
# 3:    A   3.010566 0.99999893

res[,c("baseMean", "padj")]
# DataFrame with 3 rows and 2 columns
#          baseMean               padj
#         <numeric>          <numeric>
# C   11.59484650746 0.0151156828069631
# B 605.910995247964  0.999998925569034
# A 3.01056578221606  0.999998925569034

Now using a set function on DT also changes the values in the original res object.

setkey(DT, "name")

DT[,.(name, baseMean,padj)]
#    name   baseMean       padj
# 1:    A   3.010566 0.99999893
# 2:    B 605.910995 0.99999893
# 3:    C  11.594847 0.01511568

res[,c("baseMean", "padj")]
# DataFrame with 3 rows and 2 columns
#           baseMean               padj
#          <numeric>          <numeric>
# C 3.01056578221606  0.999998925569034
# B 605.910995247964  0.999998925569034
# A   11.59484650746 0.0151156828069631

We notice that the values for genes A and C are swapped. This is probably due to the fact that the values are sorted by setkey but the rownames of res are not!
Therefore any analysis using the original res will be completly wrong.

I am aware that there are fixes for this like:

DT <- copy(as.data.table(res))
DT <- data.table(as.data.frame(res))

but my main issue is the fact, that this behaviour is not obvious and very dangerous for downstream work.

This was already somehow reported but nothing has changed.
data.table issue
GRanges copy

I love using data.table, it is simply amazing.
Hopefully you can address this issue.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions