Skip to content

fread ignores colClasses factor assignment #721

@AmyMikhail

Description

@AmyMikhail

I'm using the fread function of data.table to read in a large csv file (5 million records of 28 variables) as efficiently as possible on a laptop with just 4GB RAM. Many of my variables look numeric but they are actually factors (various id numbers). In order to avoid incorrect interpretation of these variables when reading into R and also because I have read this further improves the speed of the import, I individually specified colClasses() for each variable in the call to fread. However my assignments are ignored, see below:

# Function to convert dates of a specific format to date:
setAs("character","myDate", function(from) as.Date(from, format="%d/%m/%Y") )
setClass('myDate')

# Read in the data with column classes and NAs defined:
amrdt = fread("amrdata.csv", na.strings=c("NA", "", " "),
              colClasses=c(
                "isolate.key"="factor",
                "laboratory.id"="factor",
                "patient.id"= "factor",
                "patient.nhs.no"= "factor",
                "patient.dob"= "myDate",
                "patient.age"= "integer",
                "patient.age.in.months"= "integer",
                "patient.sex"= "factor",
                "patient.postcode"= "factor",
                "sender.code"= "factor",
                "ams.sender.code"= "factor",
                "care.trust"= "factor",
                "speciality"= "factor",
                "practice.code"= "factor",
                "lan"= "factor",
                "specimen.date"= "myDate",
                "specimen.site.id"= "factor",
                "organism.id"= "factor",
                "antibiotic.id"= "factor",
                "result.code"= "factor",
                "zone.size"= "integer",
                "mic"= "integer",
                "local.authority"= "factor",
                "repeat.exclusion"= "integer",
                "nhs.region.name"= "factor",
                "yyyy"= "integer",
                "mm"= "integer",
                "dd"= "integer"))

# Shorten the variable names:
> setnames(amrdt, c("iid", "labid", "pid", "nhsno",
                  "dob", "age", "agemon", "sex", "ppostcode",
                  "sender", "amsender", "trust", "speciality", "practice", "lan",
                  "specdate", "spectype", "organism", "antib", "result", "zone", "mic",
                  "localauthority", "repid", "nhsregion", "year", "month", "day"))

Here are the results of fread. Seemingly fread has defaulted to automatic interpretation of colClasses:

str(amrdt)
# Classes ‘data.table’ and 'data.frame':    5000000 obs. of  28 variables:
#  $ iid           : chr  "1" "2" "3" "4" ...
#  $ labid         : chr  "32520" "32520" "32520" "32520" ...
#  $ pid           : chr  "A425581             " "A425581             " "A425581             " "A425581             " ...
#  $ nhsno         : chr  "4250559920" "4250559920" "4250559920" "4250559920" ...
#  $ dob           : chr  "1942-10-09" "1942-10-09" "1942-10-09" "1942-10-09" ...
#  $ age           : int  68 68 68 68 68 68 68 68 68 3 ...
#  $ agemon        : int  NA NA NA NA NA NA NA NA NA NA ...
#  $ sex           : chr  "F" "F" "F" "F" ...
#  $ ppostcode     : chr  "DH1 3QQ" "DH1 3QQ" "DH1 3QQ" "DH1 3QQ" ...
#  $ sender        : chr  "ARLNDU    " "ARLNDU    " "ARLNDU    " "ARLNDU    " ...
#  $ amsender      : chr  "RLNDU     " "RLNDU     " "RLNDU     " "RLNDU     " ...
#  $ trust         : chr  "5ND       " "5ND       " "5ND       " "5ND       " ...
#  $ speciality    : chr  "180" "180" "180" "180" ...
#  $ practice      : chr  NA NA NA NA ...
#  $ lan           : chr  "11M028966" "11M028966" "11M028966" "11M028966" ...
#  $ specdate      : chr  "2011-03-10" "2011-03-10" "2011-03-10" "2011-03-10" ...
#  $ spectype      : chr  "T0X000" "T0X000" "T0X000" "T0X000" ...
#  $ organism      : chr  "2542.0000" "2542.0000" "2542.0000" "2542.0000" ...
#  $ antib         : chr  "AUG" "CTX" "CIP" "ERY" ...
#  $ result        : chr  "S" "S" "R" "S" ...
#  $ zone          : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ mic           : chr  NA NA NA NA ...
#  $ localauthority: chr  "00EJ" "00EJ" "00EJ" "00EJ" ...
#  $ repid         : int  0 0 0 0 0 0 0 0 0 0 ...
#  $ nhsregion     : chr  "North East" "North East" "North East" "North East" ...
#  $ year          : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
#  $ month         : int  3 3 3 3 3 3 3 3 3 3 ...
#  $ day           : int  10 10 10 10 10 10 10 10 10 7 ...
#  - attr(*, ".internal.selfref")=<externalptr> 

# They are definitely not factors:
is.factor(amrdt$result)
# [1] FALSE

# Here is example data from the head and tail:
head(amrdt)
#    iid labid                  pid      nhsno        dob age agemon sex ppostcode
#1:   1 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#2:   2 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#3:   3 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#4:   4 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#5:   5 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#6:   6 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#        sender   amsender      trust speciality practice       lan   specdate
#1: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#2: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#3: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#4: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#5: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#6: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#    spectype  organism antib result zone mic localauthority repid  nhsregion year
#1:   T0X000 2542.0000   AUG      S    0  NA           00EJ     0 North East 2011
#2:   T0X000 2542.0000   CTX      S    0  NA           00EJ     0 North East 2011
#3:   T0X000 2542.0000   CIP      R    0  NA           00EJ     0 North East 2011
#4:   T0X000 2542.0000   ERY      S    0  NA           00EJ     0 North East 2011
#5:   T0X000 2542.0000   OXA      S    0  NA           00EJ     0 North East 2011
#6:   T0X000 2542.0000   PEN      S    0  NA           00EJ     0 North East 2011
#    month day
#1:     3  10
#2:     3  10
#3:     3  10
#4:     3  10
#5:     3  10
#6:     3  10

tail(amrdt)
#        iid  labid                  pid nhsno        dob age agemon sex ppostcode
#1: 5369819 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#2: 5369820 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#3: 5369821 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#4: 5369822 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#5: 5369823 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#6: 5369824 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#        sender   amsender      trust speciality practice            lan
#1: GM85030    M85030     5M1               600       NA M.11.2021147.S
#2: GM85030    M85030     5M1               600       NA M.11.2021147.S
#3: GM85030    M85030     5M1               600       NA M.11.2021147.S
#4: GM85030    M85030     5M1               600       NA M.11.2021147.S
#5: GM85030    M85030     5M1               600       NA M.11.2021147.S
#6: GM85030    M85030     5M1               600       NA M.11.2021147.S
#      specdate spectype  organism antib result zone mic localauthority repid
#1: 2011-03-31   T7X100 1571.0010   MER      S    0  NA           00CN     0
#2: 2011-03-31   T7X100 1571.0010   NAL      R    0  NA           00CN     0
#3: 2011-03-31   T7X100 1571.0010 NITRO      S    0  NA           00CN     0
#4: 2011-03-31   T7X100 1571.0010   PIP      R    0  NA           00CN     0
#5: 2011-03-31   T7X100 1571.0010  TEMO      S    0  NA           00CN     0
#6: 2011-03-31   T7X100 1571.0010   TMP      R    0  NA           00CN     0
#        nhsregion year month day
#1: West Midlands 2011     3  31
#2: West Midlands 2011     3  31
#3: West Midlands 2011     3  31
#4: West Midlands 2011     3  31
#5: West Midlands 2011     3  31
#6: West Midlands 2011     3  31

I can convert everything to a factor with the following code:

dtnew <- amrdt[, lapply(.SD, as.factor), by=iid]

... but this is very slow, copies the data (which I want to avoid) and I would need to adjust the colClasses() for those few columns that are not supposed to be factors.

From reading other posts it seems this behaviour has something to do with "factor" not being a basic type column class? In any case it would be enormously useful if fread could accept factors and other non-basic types of column class (such as the date type that I created above).

As a (related) aside, a feature that would be really useful is if one could convert an ffdf object directly to a data.table. The reason I have read in a .csv file with the above code is because I couldn't figure out a way to do this (so wrote to .csv with write.csv.ffdf, which took about 30 mins to write to a 1.002 GB file and then read in the .csv with fread, which took 3 mins 12 secs). If it were possible to convert directly from ffdf to a data.table with fread, without dumping into a .csv first, that would be a significant time saving.

Many thanks for your help.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions