Skip to content

fread crashes on files with mixed Windows and Unix line endings #1183

@dhicks

Description

@dhicks

I'm brand new to fread and data.table. I'm trying out fread as (hopefully) a faster alternative to read.csv for two large sets of data from the US Department of Education (about 320 and 270 MB each). My dataset can be downloaded as a zip file here: http://nces.ed.gov/ipeds/deltacostproject/download/IPEDS_Analytics_DCP_87_12_CSV.zip (110 MB). The zip file contains two csv files. For this MRE, I'm working with delta_public_87_99.csv.

Given the csv in the working directory, this MRE reliably causes R to crash on my machine:

library(data.table)

sessionInfo()

ipeds1 <- 'delta_public_87_99.csv'

colclasses <- c(
    rep('numeric', 5), 
    rep('character', 5), 
    rep('numeric', 964))

#thing <- read.csv(ipeds1, colClasses = colclasses)
thing <- fread(ipeds1, colClasses = colclasses, verbose=TRUE)

Here's the output from fread:

# Input contains no \n. Taking this to be a filename to open
# File opened, filesize is 0.294936 GB.
# Memory mapping ... ok
# Detected eol as \r\n (CRLF) in that order, the Windows standard.
# Positioned on line 1 after skip or autostart
# This line is the autostart and not blank so searching up for the last non-blank ... line 1
# Detecting sep ... ','
# Detected 124595570 columns. Longest stretch was from line 2 to line 2
# Starting data input on line 2 (either column names or first row of data). First 10 characters: -434973,19
# Some fields on line 2 are not type character (or are empty). Treating as a data row and using default column names.

At this point, memory use starts to grow dramatically. Around 6-8 GB the R session is aborted: "R encountered a fatal error. The session was terminated."

Output from sessionInfo():

# R version 3.2.0 (2015-04-16)
# Platform: x86_64-apple-darwin13.4.0 (64-bit)
# Running under: OS X 10.10.3 (Yosemite)
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] data.table_1.9.5
# 
# loaded via a namespace (and not attached):
# [1] tools_3.2.0  chron_2.3-45

Looking for similar problems, I found issue #1035, "fread fails if whitespace before first character." However, using readLines, it doesn't look like there are preceding whitespaces in my data file.

Since I'm new to fread and data.table, I'm not sure if I might be missing something basic, so for now I am posing this as a [Support] rather than a bug report.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions