Skip to content

[Request] allow tmpDir to be supplied as argument: fread can run out of tmpfs space on unix during preprocessing #1139

@everdark

Description

@everdark

Hi,

Recently I've encountered an issue for large compressed files that could stop the functioning of fread due to tmpfs out off space. Since currently (in the master branch) fread on unix system will use tmpfs (/dev/shm) as long as it exists, the size of tmpfs will limit the capability of fread to read potentially large files before any preprocessing can be done. This is more severe when multi-threading is used to simultaneously load multiple files for speed gain, say, mclapply(input_list, fread, mc.cores=4), where input list may be something like

"zcat file1.gz | grep blabla | ..."
"zcat file2.gz | grep blabla | ..."
...

Each gz file could have several GBs uncompressed. I don't need them all in my analysis and a preprocessing could be done to significantly reduce the size of each file. However, the preprocessing requires each file to be uncompressed to disk in the first place, occupying all the space available in tmpfs. (There are, of course, several work-a-rounds for this kind of situation but it could be great to directly address it in one R function call, which is fread in discuss.)

It hence could be nice if a user-input argument is allowed to force tempfile location other than tmpfs on unix system. For example dat <- fread("zcat file.gz", tmpDir="/data"). The performance may be a bit worse due to disk I/O but the raw data will not be limited by size of tmpfs, which is usually by far smaller than any disk device at hand. (On my machine I have 8 GBs in tmpfs and that's it.)

A possible minor change to make this issue fixed on unix is to rewrite fread.R as everdark@4aaa745.

I only test it on my local machine and it works fine. There could be some ramification that I don't take into account in this simple modification so I create this request issue to open the discussion. :) Did anybody else also encounter such tmpfs out-of-space issue?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions