I noticed a real pitfall (bug) while examining wrong results in my analysis.
Changing a data.table object also alters the original data from which it was copied with as.data.table. This happens for objects like DESeqResults and GRanges.
According to the vignette:
as.data.table methods returns a copy of original data
But this is apparently not true.
Here is a minimal example:
library(data.table)
library(DESeq2)
dds <- makeExampleDESeqDataSet(betaSD=1) #some data
dds <- dds[1:3] #minimalize the example
rownames(dds) <- c("C","B","A") #random order of gene names
dds <- DESeq(dds)
res <- results(dds)
DT <- as.data.table(res) #this should create a copy
DT[, name := rownames(res)] #rename for clarification
DT[,.(name, baseMean,padj)] #print reduced versions
# name baseMean padj
# 1: C 11.594847 0.01511568
# 2: B 605.910995 0.99999893
# 3: A 3.010566 0.99999893
res[,c("baseMean", "padj")]
# DataFrame with 3 rows and 2 columns
# baseMean padj
# <numeric> <numeric>
# C 11.59484650746 0.0151156828069631
# B 605.910995247964 0.999998925569034
# A 3.01056578221606 0.999998925569034
Now using a set function on DT also changes the values in the original res object.
setkey(DT, "name")
DT[,.(name, baseMean,padj)]
# name baseMean padj
# 1: A 3.010566 0.99999893
# 2: B 605.910995 0.99999893
# 3: C 11.594847 0.01511568
res[,c("baseMean", "padj")]
# DataFrame with 3 rows and 2 columns
# baseMean padj
# <numeric> <numeric>
# C 3.01056578221606 0.999998925569034
# B 605.910995247964 0.999998925569034
# A 11.59484650746 0.0151156828069631
We notice that the values for genes A and C are swapped. This is probably due to the fact that the values are sorted by setkey but the rownames of res are not!
Therefore any analysis using the original res will be completly wrong.
I am aware that there are fixes for this like:
DT <- copy(as.data.table(res))
DT <- data.table(as.data.frame(res))
but my main issue is the fact, that this behaviour is not obvious and very dangerous for downstream work.
This was already somehow reported but nothing has changed.
data.table issue
GRanges copy
I love using data.table, it is simply amazing.
Hopefully you can address this issue.
Thank you!
I noticed a real pitfall (bug) while examining wrong results in my analysis.
Changing a data.table object also alters the original data from which it was copied with as.data.table. This happens for objects like DESeqResults and GRanges.
According to the vignette:
But this is apparently not true.
Here is a minimal example:
Now using a set function on DT also changes the values in the original res object.
We notice that the values for genes A and C are swapped. This is probably due to the fact that the values are sorted by setkey but the rownames of res are not!
Therefore any analysis using the original res will be completly wrong.
I am aware that there are fixes for this like:
but my main issue is the fact, that this behaviour is not obvious and very dangerous for downstream work.
This was already somehow reported but nothing has changed.
data.table issue
GRanges copy
I love using data.table, it is simply amazing.
Hopefully you can address this issue.
Thank you!