Skip to content

fread with large csv (44 GB) takes a lot of RAM in latest data.table dev version #2073

@geponce

Description

@geponce

Hi,

Hardware and software:
Server: Dell R930 4-Intel Xeon E7-8870 v3 2.1GHz,45M Cache,9.6GT/s QPI,Turbo,HT,18C/36T and 1TB in RAM
OS:Redhat 7.1
R-version: 3.3.2
data.table version: 1.10.5 built 2017-03-21

I'm loading a csv file (44 GB, 872505 rows x 12785 cols). It loads very fast, in 1.30 minutes using 144 cores (72 cores from the 4 processors with hyperthreading enabled to make it 144 cores box).

The main issue is that when the DT is loaded the amount of memory on-use increases significantly in relation to the size of the csv file. In this case the 44 GB csv (saved with fwrite, saved with saveRDS and compress=FALSE creates a file of 84GB) is using ~ 356 GB of RAM.

Here is the output using "verbose=TRUE"
Allocating 12785 column slots (12785 - 0 dropped)
madvise sequential: ok
Reading data with 1440 jump points and 144 threads
Read 95.7% of 858881 estimated rows
Read 872505 rows x 12785 columns from 43.772GB file in 1 mins 33.736 secs of wall clock time (affected by other apps running)
0.000s ( 0%) Memory map
0.070s ( 0%) sep, ncol and header detection
26.227s ( 28%) Column type detection using 34832 sample rows from 1440 jump points
0.614s ( 1%) Allocation of 3683116 rows x 12785 cols (350.838GB) in RAM
0.000s ( 0%) madvise sequential
66.825s ( 71%) Reading data
93.736s Total

It is showing a similar issue that sometimes arises when working with the parallel package, where one rsession is launched per core when using functions like "mclapply". See the Rsessions created/listed in this screenshot:

image

if I do "rm(DT)" RAM goes back to the initial state and the "Rsessions" get removed.

Already tried e.g. "setDTthreads(20)" and still using same amount of RAM.

By the way, if the file is loaded with the non-parallel version of "fread", the memory allocation only gets up to ~106 GB.

Guillermo

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions