Skip to content

fread segfault on file larger than 1GB #4775

@agarwal-i

Description

@agarwal-i

[Minimal reproducible example]

>library(data.table)
data.table 1.13.2 using 2 threads (see ?getDTthreads).  Latest news: r-datatable.com

> test=fread("test.txt.gz", verbose=T, header=F)
  omp_get_num_procs()            5
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          5
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 2 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 2 threads (omp_get_max_threads()=5, nth=2)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file /tmp/RtmpauSFQC/file238133bcee68
  File opened, size = 1.190GB (1277743763 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<chr1	11819	11820	AGT	G	A	downs>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=0x9  with 100 lines of 15 fields using quote rule 0
  Detected 15 columns on line 1. This line is either column names or first data row. Line starts as: <<chr1	11819	11820	AGT	G	A	downs>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 15
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to false
  Number of sampling jump points = 100 because (1277743762 bytes from row 1 to eof) / (2 * 12931 jump0size) == 49406
  Type codes (jump 000)    : C55CCCCCCCCC77C  Quote rule 0
  Type codes (jump 100)    : C55CCCCCCCCC77C  Quote rule 0
  =====
  Sampled 10052 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 1277743762
  Line length: mean=128.33 sd=16.15 min=97 max=189
  Estimated number of rows: 1277743762 / 128.33 = 9956806
  Initial alloc = 13172616 rows (9956806 + 32%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : C55CCCCCCCCC77C
[10] Allocate memory for the datatable
  Allocating 15 column slots (15 - 0 dropped) with 13172616 rows
[11] Read the data
  jumps=[0..1218), chunk_size=1049050, total_size=1277743762
|--------------------------------------------------|
|=
 *** caught segfault ***
address 0x64, cause 'memory not mapped'

Traceback:
 1: fread("test.txt.gz", verbose = T, header = F)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

Output of sessionInfo()

R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS: R-3.5.3/lib/libRblas.so
LAPACK: R-3.5.3/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.13.2

loaded via a namespace (and not attached):
[1] compiler_3.5.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions