Skip to content

[Bug] fwrite in large environments (tables up to 100M rows) #1968

@mgahan

Description

@mgahan

I have been using fwrite in a large AWS environment detailed below:

Instance: m4.4xlarge
RAM: 64gb
Threads: 16
Cost: $0.862 hourly

I notice that when writing out numeric values, the written output is incorrect. When the same data is coerced to the integer class, the output seems to be correct. This does not seem to be a problem with setDTthreads(1). However, errors start to creep in when setDTthreads(2), albeit less errors than setDTthreads(16). I have detailed the two scenarios below.

Writing numeric output with fwrite

# Set assumptions
library(data.table) # data.table 1.10.0 
# Also tested for 1.10.1 IN DEVELOPMENT built 
# 2016-12-09 03:17:34 UTC
N=1e8
set.seed(1)

# Make test data
check <- data.table(ID=1:N, Time=0:49)
check[, paste0("V",1:5) := lapply(rep(N, 5), rpois, .N)]
sapply(check, class)

# Modify type of test data
check[, names(check) := lapply(.SD, function(x) x+0.1)]
sapply(check, class)
check[, names(check) := lapply(.SD, function(x) x-0.1)]

# Write data
fwrite(check, "check.csv")
rm(check)
gc()

# Read written data
checkWrite <- fread("check.csv")
sapply(checkWrite, class)
	
# Tests
checkWrite[, .N, by=.(Time)]
	
#           Time       N
#    1:        0 2000000
#    2:        1 1986674
#    3:        2 1985449
#    4:        3 1984278
#    5:        4 1984690
#   ---                 
#44725: 99983607       1
#44726: 99980978       1
#44727: 99646303       1
#44728: 99649539       1
#44729: 99653656       1
	
checkWrite[V1 != round(V1,0)]

#             ID Time           V1        V2        V3        V4        V5
#    1:      449   48  97664.38965  99985524  99998767 100009754 100010494
#    2:     2509    8  97643.65527  99996565  99999968  99992995  99974462
#    3:     2751    0     23.84322  99996050 100007735 100008407 100005063
#    4:     6701    0     23.84024  99998435  99993597 100000237  99999963
#    5:     8101    0 781300.35938  99991390  99998825 100005920  99994596
#   ---                                                                   
#22599: 99968305    4     47.67970  99988021 100012376  99996538 100012572
#22600: 99970551    0     23.84325  99996674  99995685 100014894 100013937
#22601: 99973932   31     11.92025 100012133  99990838  99993337  99984441
#22602: 99975660    9     47.68194  99985576  99992718  99982681  99991041
#22603: 99977051    0     47.68710  99998134 100004288 100007002  99990199

Writing integer output with fwrite

# Set assumptions
library(data.table) # data.table 1.10.0
N=1e8
set.seed(1)

# Make test data
check <- data.table(ID=1:N, Time=0:49)
check[, paste0("V",1:5) := lapply(rep(N, 5), rpois, .N)]
sapply(check, class)

# Modify type of test data
check[, names(check) := lapply(.SD, function(x) x+0.1)]
sapply(check, class)
check[, names(check) := lapply(.SD, function(x) x-0.1)]

# Convert classes back to integer
check[, names(check) := lapply(.SD, as.integer)]
sapply(check, class)

# Write data
fwrite(check, "check.csv")
rm(check)
gc()
	
# Read written data
checkWrite <- fread("check.csv")
sapply(checkWrite, class)
	
# Tests
checkWrite[, .N, by=.(Time)]
	
#    Time       N
# 1:    0 2000000
# 2:    1 2000000
# 3:    2 2000000
# 4:    3 4000000
# 5:    5 2000000
#   ---   	
# 46:   45 2000000
# 47:   46 2000000
# 48:   47 2000000
# 49:   48 2000000
# 50:   49 2000000
	
checkWrite[V1 != round(V1,0)]
	
# Empty data.table (0 rows) of 7 cols: ID,Time,V1,V2,V3,V4...

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions