fread ignores colClasses factor assignment

I'm using the fread function of `data.table` to read in a large csv file (5 million records of 28 variables) as efficiently as possible on a laptop with just 4GB RAM.  Many of my variables look `numeric` but they are actually `factors` (various id numbers).  In order to avoid incorrect interpretation of these variables when reading into R and also because I have read this further improves the speed of the import, I individually specified `colClasses()` for each variable in the call to `fread`.  However my assignments are ignored, see below:

``` R
# Function to convert dates of a specific format to date:
setAs("character","myDate", function(from) as.Date(from, format="%d/%m/%Y") )
setClass('myDate')

# Read in the data with column classes and NAs defined:
amrdt = fread("amrdata.csv", na.strings=c("NA", "", " "),
              colClasses=c(
                "isolate.key"="factor",
                "laboratory.id"="factor",
                "patient.id"= "factor",
                "patient.nhs.no"= "factor",
                "patient.dob"= "myDate",
                "patient.age"= "integer",
                "patient.age.in.months"= "integer",
                "patient.sex"= "factor",
                "patient.postcode"= "factor",
                "sender.code"= "factor",
                "ams.sender.code"= "factor",
                "care.trust"= "factor",
                "speciality"= "factor",
                "practice.code"= "factor",
                "lan"= "factor",
                "specimen.date"= "myDate",
                "specimen.site.id"= "factor",
                "organism.id"= "factor",
                "antibiotic.id"= "factor",
                "result.code"= "factor",
                "zone.size"= "integer",
                "mic"= "integer",
                "local.authority"= "factor",
                "repeat.exclusion"= "integer",
                "nhs.region.name"= "factor",
                "yyyy"= "integer",
                "mm"= "integer",
                "dd"= "integer"))

# Shorten the variable names:
> setnames(amrdt, c("iid", "labid", "pid", "nhsno",
                  "dob", "age", "agemon", "sex", "ppostcode",
                  "sender", "amsender", "trust", "speciality", "practice", "lan",
                  "specdate", "spectype", "organism", "antib", "result", "zone", "mic",
                  "localauthority", "repid", "nhsregion", "year", "month", "day"))

```

Here are the results of `fread`. Seemingly `fread` has defaulted to automatic interpretation of colClasses:

``` R
str(amrdt)
# Classes ‘data.table’ and 'data.frame':    5000000 obs. of  28 variables:
#  $ iid           : chr  "1" "2" "3" "4" ...
#  $ labid         : chr  "32520" "32520" "32520" "32520" ...
#  $ pid           : chr  "A425581             " "A425581             " "A425581             " "A425581             " ...
#  $ nhsno         : chr  "4250559920" "4250559920" "4250559920" "4250559920" ...
#  $ dob           : chr  "1942-10-09" "1942-10-09" "1942-10-09" "1942-10-09" ...
#  $ age           : int  68 68 68 68 68 68 68 68 68 3 ...
#  $ agemon        : int  NA NA NA NA NA NA NA NA NA NA ...
#  $ sex           : chr  "F" "F" "F" "F" ...
#  $ ppostcode     : chr  "DH1 3QQ" "DH1 3QQ" "DH1 3QQ" "DH1 3QQ" ...
#  $ sender        : chr  "ARLNDU    " "ARLNDU    " "ARLNDU    " "ARLNDU    " ...
#  $ amsender      : chr  "RLNDU     " "RLNDU     " "RLNDU     " "RLNDU     " ...
#  $ trust         : chr  "5ND       " "5ND       " "5ND       " "5ND       " ...
#  $ speciality    : chr  "180" "180" "180" "180" ...
#  $ practice      : chr  NA NA NA NA ...
#  $ lan           : chr  "11M028966" "11M028966" "11M028966" "11M028966" ...
#  $ specdate      : chr  "2011-03-10" "2011-03-10" "2011-03-10" "2011-03-10" ...
#  $ spectype      : chr  "T0X000" "T0X000" "T0X000" "T0X000" ...
#  $ organism      : chr  "2542.0000" "2542.0000" "2542.0000" "2542.0000" ...
#  $ antib         : chr  "AUG" "CTX" "CIP" "ERY" ...
#  $ result        : chr  "S" "S" "R" "S" ...
#  $ zone          : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ mic           : chr  NA NA NA NA ...
#  $ localauthority: chr  "00EJ" "00EJ" "00EJ" "00EJ" ...
#  $ repid         : int  0 0 0 0 0 0 0 0 0 0 ...
#  $ nhsregion     : chr  "North East" "North East" "North East" "North East" ...
#  $ year          : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
#  $ month         : int  3 3 3 3 3 3 3 3 3 3 ...
#  $ day           : int  10 10 10 10 10 10 10 10 10 7 ...
#  - attr(*, ".internal.selfref")=<externalptr> 

# They are definitely not factors:
is.factor(amrdt$result)
# [1] FALSE

# Here is example data from the head and tail:
head(amrdt)
#    iid labid                  pid      nhsno        dob age agemon sex ppostcode
#1:   1 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#2:   2 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#3:   3 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#4:   4 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#5:   5 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#6:   6 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#        sender   amsender      trust speciality practice       lan   specdate
#1: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#2: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#3: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#4: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#5: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#6: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#    spectype  organism antib result zone mic localauthority repid  nhsregion year
#1:   T0X000 2542.0000   AUG      S    0  NA           00EJ     0 North East 2011
#2:   T0X000 2542.0000   CTX      S    0  NA           00EJ     0 North East 2011
#3:   T0X000 2542.0000   CIP      R    0  NA           00EJ     0 North East 2011
#4:   T0X000 2542.0000   ERY      S    0  NA           00EJ     0 North East 2011
#5:   T0X000 2542.0000   OXA      S    0  NA           00EJ     0 North East 2011
#6:   T0X000 2542.0000   PEN      S    0  NA           00EJ     0 North East 2011
#    month day
#1:     3  10
#2:     3  10
#3:     3  10
#4:     3  10
#5:     3  10
#6:     3  10

tail(amrdt)
#        iid  labid                  pid nhsno        dob age agemon sex ppostcode
#1: 5369819 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#2: 5369820 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#3: 5369821 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#4: 5369822 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#5: 5369823 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#6: 5369824 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#        sender   amsender      trust speciality practice            lan
#1: GM85030    M85030     5M1               600       NA M.11.2021147.S
#2: GM85030    M85030     5M1               600       NA M.11.2021147.S
#3: GM85030    M85030     5M1               600       NA M.11.2021147.S
#4: GM85030    M85030     5M1               600       NA M.11.2021147.S
#5: GM85030    M85030     5M1               600       NA M.11.2021147.S
#6: GM85030    M85030     5M1               600       NA M.11.2021147.S
#      specdate spectype  organism antib result zone mic localauthority repid
#1: 2011-03-31   T7X100 1571.0010   MER      S    0  NA           00CN     0
#2: 2011-03-31   T7X100 1571.0010   NAL      R    0  NA           00CN     0
#3: 2011-03-31   T7X100 1571.0010 NITRO      S    0  NA           00CN     0
#4: 2011-03-31   T7X100 1571.0010   PIP      R    0  NA           00CN     0
#5: 2011-03-31   T7X100 1571.0010  TEMO      S    0  NA           00CN     0
#6: 2011-03-31   T7X100 1571.0010   TMP      R    0  NA           00CN     0
#        nhsregion year month day
#1: West Midlands 2011     3  31
#2: West Midlands 2011     3  31
#3: West Midlands 2011     3  31
#4: West Midlands 2011     3  31
#5: West Midlands 2011     3  31
#6: West Midlands 2011     3  31
```

I can convert everything to a factor with the following code:

``` R
dtnew <- amrdt[, lapply(.SD, as.factor), by=iid]
```

... but this is very slow, copies the data (which I want to avoid) and I would need to adjust the `colClasses()` for those few columns that are not supposed to be factors.  

From reading other posts it seems this behaviour has something to do with "factor" not being a basic type column class?  In any case it would be enormously useful if fread could accept factors and other non-basic types of column class (such as the date type that I created above). 

As a (related) aside, a feature that would be really useful is if one could convert an ffdf object directly to a `data.table`.  The reason I have read in a `.csv` file with the above code is because I couldn't figure out a way to do this (so wrote to `.csv` with `write.csv.ffdf`, which took about 30 mins to write to a 1.002 GB file and then read in the `.csv` with `fread`, which took 3 mins 12 secs).  If it were possible to convert directly from ffdf to a data.table with fread, without dumping into a `.csv` first, that would be a significant time saving.

Many thanks for your help.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread ignores colClasses factor assignment #721

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fread ignores colClasses factor assignment #721

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions