I'm brand new to fread and data.table. I'm trying out fread as (hopefully) a faster alternative to read.csv for two large sets of data from the US Department of Education (about 320 and 270 MB each). My dataset can be downloaded as a zip file here: http://nces.ed.gov/ipeds/deltacostproject/download/IPEDS_Analytics_DCP_87_12_CSV.zip (110 MB). The zip file contains two csv files. For this MRE, I'm working with delta_public_87_99.csv.
Given the csv in the working directory, this MRE reliably causes R to crash on my machine:
library(data.table)
sessionInfo()
ipeds1 <- 'delta_public_87_99.csv'
colclasses <- c(
rep('numeric', 5),
rep('character', 5),
rep('numeric', 964))
#thing <- read.csv(ipeds1, colClasses = colclasses)
thing <- fread(ipeds1, colClasses = colclasses, verbose=TRUE)
Here's the output from fread:
# Input contains no \n. Taking this to be a filename to open
# File opened, filesize is 0.294936 GB.
# Memory mapping ... ok
# Detected eol as \r\n (CRLF) in that order, the Windows standard.
# Positioned on line 1 after skip or autostart
# This line is the autostart and not blank so searching up for the last non-blank ... line 1
# Detecting sep ... ','
# Detected 124595570 columns. Longest stretch was from line 2 to line 2
# Starting data input on line 2 (either column names or first row of data). First 10 characters: -434973,19
# Some fields on line 2 are not type character (or are empty). Treating as a data row and using default column names.
At this point, memory use starts to grow dramatically. Around 6-8 GB the R session is aborted: "R encountered a fatal error. The session was terminated."
Output from sessionInfo():
# R version 3.2.0 (2015-04-16)
# Platform: x86_64-apple-darwin13.4.0 (64-bit)
# Running under: OS X 10.10.3 (Yosemite)
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] data.table_1.9.5
#
# loaded via a namespace (and not attached):
# [1] tools_3.2.0 chron_2.3-45
Looking for similar problems, I found issue #1035, "fread fails if whitespace before first character." However, using readLines, it doesn't look like there are preceding whitespaces in my data file.
Since I'm new to fread and data.table, I'm not sure if I might be missing something basic, so for now I am posing this as a [Support] rather than a bug report.
I'm brand new to
freadanddata.table. I'm trying outfreadas (hopefully) a faster alternative toread.csvfor two large sets of data from the US Department of Education (about 320 and 270 MB each). My dataset can be downloaded as a zip file here: http://nces.ed.gov/ipeds/deltacostproject/download/IPEDS_Analytics_DCP_87_12_CSV.zip (110 MB). The zip file contains two csv files. For this MRE, I'm working withdelta_public_87_99.csv.Given the csv in the working directory, this MRE reliably causes R to crash on my machine:
Here's the output from
fread:At this point, memory use starts to grow dramatically. Around 6-8 GB the R session is aborted: "R encountered a fatal error. The session was terminated."
Output from
sessionInfo():Looking for similar problems, I found issue #1035, "fread fails if whitespace before first character." However, using
readLines, it doesn't look like there are preceding whitespaces in my data file.Since I'm new to
freadanddata.table, I'm not sure if I might be missing something basic, so for now I am posing this as a [Support] rather than a bug report.