I have been using fwrite in a large AWS environment detailed below:
Instance: m4.4xlarge
RAM: 64gb
Threads: 16
Cost: $0.862 hourly
I notice that when writing out numeric values, the written output is incorrect. When the same data is coerced to the integer class, the output seems to be correct. This does not seem to be a problem with setDTthreads(1). However, errors start to creep in when setDTthreads(2), albeit less errors than setDTthreads(16). I have detailed the two scenarios below.
Writing numeric output with fwrite
# Set assumptions
library(data.table) # data.table 1.10.0
# Also tested for 1.10.1 IN DEVELOPMENT built
# 2016-12-09 03:17:34 UTC
N=1e8
set.seed(1)
# Make test data
check <- data.table(ID=1:N, Time=0:49)
check[, paste0("V",1:5) := lapply(rep(N, 5), rpois, .N)]
sapply(check, class)
# Modify type of test data
check[, names(check) := lapply(.SD, function(x) x+0.1)]
sapply(check, class)
check[, names(check) := lapply(.SD, function(x) x-0.1)]
# Write data
fwrite(check, "check.csv")
rm(check)
gc()
# Read written data
checkWrite <- fread("check.csv")
sapply(checkWrite, class)
# Tests
checkWrite[, .N, by=.(Time)]
# Time N
# 1: 0 2000000
# 2: 1 1986674
# 3: 2 1985449
# 4: 3 1984278
# 5: 4 1984690
# ---
#44725: 99983607 1
#44726: 99980978 1
#44727: 99646303 1
#44728: 99649539 1
#44729: 99653656 1
checkWrite[V1 != round(V1,0)]
# ID Time V1 V2 V3 V4 V5
# 1: 449 48 97664.38965 99985524 99998767 100009754 100010494
# 2: 2509 8 97643.65527 99996565 99999968 99992995 99974462
# 3: 2751 0 23.84322 99996050 100007735 100008407 100005063
# 4: 6701 0 23.84024 99998435 99993597 100000237 99999963
# 5: 8101 0 781300.35938 99991390 99998825 100005920 99994596
# ---
#22599: 99968305 4 47.67970 99988021 100012376 99996538 100012572
#22600: 99970551 0 23.84325 99996674 99995685 100014894 100013937
#22601: 99973932 31 11.92025 100012133 99990838 99993337 99984441
#22602: 99975660 9 47.68194 99985576 99992718 99982681 99991041
#22603: 99977051 0 47.68710 99998134 100004288 100007002 99990199
Writing integer output with fwrite
# Set assumptions
library(data.table) # data.table 1.10.0
N=1e8
set.seed(1)
# Make test data
check <- data.table(ID=1:N, Time=0:49)
check[, paste0("V",1:5) := lapply(rep(N, 5), rpois, .N)]
sapply(check, class)
# Modify type of test data
check[, names(check) := lapply(.SD, function(x) x+0.1)]
sapply(check, class)
check[, names(check) := lapply(.SD, function(x) x-0.1)]
# Convert classes back to integer
check[, names(check) := lapply(.SD, as.integer)]
sapply(check, class)
# Write data
fwrite(check, "check.csv")
rm(check)
gc()
# Read written data
checkWrite <- fread("check.csv")
sapply(checkWrite, class)
# Tests
checkWrite[, .N, by=.(Time)]
# Time N
# 1: 0 2000000
# 2: 1 2000000
# 3: 2 2000000
# 4: 3 4000000
# 5: 5 2000000
# ---
# 46: 45 2000000
# 47: 46 2000000
# 48: 47 2000000
# 49: 48 2000000
# 50: 49 2000000
checkWrite[V1 != round(V1,0)]
# Empty data.table (0 rows) of 7 cols: ID,Time,V1,V2,V3,V4...
I have been using
fwritein a large AWS environment detailed below:Instance: m4.4xlarge
RAM: 64gb
Threads: 16
Cost: $0.862 hourly
I notice that when writing out numeric values, the written output is incorrect. When the same data is coerced to the integer class, the output seems to be correct. This does not seem to be a problem with
setDTthreads(1). However, errors start to creep in whensetDTthreads(2), albeit less errors thansetDTthreads(16). I have detailed the two scenarios below.Writing numeric output with fwrite
Writing integer output with fwrite