I'm using the fread function of data.table to read in a large csv file (5 million records of 28 variables) as efficiently as possible on a laptop with just 4GB RAM. Many of my variables look numeric but they are actually factors (various id numbers). In order to avoid incorrect interpretation of these variables when reading into R and also because I have read this further improves the speed of the import, I individually specified colClasses() for each variable in the call to fread. However my assignments are ignored, see below:
# Function to convert dates of a specific format to date:
setAs("character","myDate", function(from) as.Date(from, format="%d/%m/%Y") )
setClass('myDate')
# Read in the data with column classes and NAs defined:
amrdt = fread("amrdata.csv", na.strings=c("NA", "", " "),
colClasses=c(
"isolate.key"="factor",
"laboratory.id"="factor",
"patient.id"= "factor",
"patient.nhs.no"= "factor",
"patient.dob"= "myDate",
"patient.age"= "integer",
"patient.age.in.months"= "integer",
"patient.sex"= "factor",
"patient.postcode"= "factor",
"sender.code"= "factor",
"ams.sender.code"= "factor",
"care.trust"= "factor",
"speciality"= "factor",
"practice.code"= "factor",
"lan"= "factor",
"specimen.date"= "myDate",
"specimen.site.id"= "factor",
"organism.id"= "factor",
"antibiotic.id"= "factor",
"result.code"= "factor",
"zone.size"= "integer",
"mic"= "integer",
"local.authority"= "factor",
"repeat.exclusion"= "integer",
"nhs.region.name"= "factor",
"yyyy"= "integer",
"mm"= "integer",
"dd"= "integer"))
# Shorten the variable names:
> setnames(amrdt, c("iid", "labid", "pid", "nhsno",
"dob", "age", "agemon", "sex", "ppostcode",
"sender", "amsender", "trust", "speciality", "practice", "lan",
"specdate", "spectype", "organism", "antib", "result", "zone", "mic",
"localauthority", "repid", "nhsregion", "year", "month", "day"))
Here are the results of fread. Seemingly fread has defaulted to automatic interpretation of colClasses:
str(amrdt)
# Classes ‘data.table’ and 'data.frame': 5000000 obs. of 28 variables:
# $ iid : chr "1" "2" "3" "4" ...
# $ labid : chr "32520" "32520" "32520" "32520" ...
# $ pid : chr "A425581 " "A425581 " "A425581 " "A425581 " ...
# $ nhsno : chr "4250559920" "4250559920" "4250559920" "4250559920" ...
# $ dob : chr "1942-10-09" "1942-10-09" "1942-10-09" "1942-10-09" ...
# $ age : int 68 68 68 68 68 68 68 68 68 3 ...
# $ agemon : int NA NA NA NA NA NA NA NA NA NA ...
# $ sex : chr "F" "F" "F" "F" ...
# $ ppostcode : chr "DH1 3QQ" "DH1 3QQ" "DH1 3QQ" "DH1 3QQ" ...
# $ sender : chr "ARLNDU " "ARLNDU " "ARLNDU " "ARLNDU " ...
# $ amsender : chr "RLNDU " "RLNDU " "RLNDU " "RLNDU " ...
# $ trust : chr "5ND " "5ND " "5ND " "5ND " ...
# $ speciality : chr "180" "180" "180" "180" ...
# $ practice : chr NA NA NA NA ...
# $ lan : chr "11M028966" "11M028966" "11M028966" "11M028966" ...
# $ specdate : chr "2011-03-10" "2011-03-10" "2011-03-10" "2011-03-10" ...
# $ spectype : chr "T0X000" "T0X000" "T0X000" "T0X000" ...
# $ organism : chr "2542.0000" "2542.0000" "2542.0000" "2542.0000" ...
# $ antib : chr "AUG" "CTX" "CIP" "ERY" ...
# $ result : chr "S" "S" "R" "S" ...
# $ zone : num 0 0 0 0 0 0 0 0 0 0 ...
# $ mic : chr NA NA NA NA ...
# $ localauthority: chr "00EJ" "00EJ" "00EJ" "00EJ" ...
# $ repid : int 0 0 0 0 0 0 0 0 0 0 ...
# $ nhsregion : chr "North East" "North East" "North East" "North East" ...
# $ year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
# $ month : int 3 3 3 3 3 3 3 3 3 3 ...
# $ day : int 10 10 10 10 10 10 10 10 10 7 ...
# - attr(*, ".internal.selfref")=<externalptr>
# They are definitely not factors:
is.factor(amrdt$result)
# [1] FALSE
# Here is example data from the head and tail:
head(amrdt)
# iid labid pid nhsno dob age agemon sex ppostcode
#1: 1 32520 A425581 1234567890 1900-01-01 110 NA F XX1 1YY
#2: 2 32520 A425581 1234567890 1900-01-01 110 NA F XX1 1YY
#3: 3 32520 A425581 1234567890 1900-01-01 110 NA F XX1 1YY
#4: 4 32520 A425581 1234567890 1900-01-01 110 NA F XX1 1YY
#5: 5 32520 A425581 1234567890 1900-01-01 110 NA F XX1 1YY
#6: 6 32520 A425581 1234567890 1900-01-01 110 NA F XX1 1YY
# sender amsender trust speciality practice lan specdate
#1: ARLNDU RLNDU 5ND 180 NA 11M028966 2011-03-10
#2: ARLNDU RLNDU 5ND 180 NA 11M028966 2011-03-10
#3: ARLNDU RLNDU 5ND 180 NA 11M028966 2011-03-10
#4: ARLNDU RLNDU 5ND 180 NA 11M028966 2011-03-10
#5: ARLNDU RLNDU 5ND 180 NA 11M028966 2011-03-10
#6: ARLNDU RLNDU 5ND 180 NA 11M028966 2011-03-10
# spectype organism antib result zone mic localauthority repid nhsregion year
#1: T0X000 2542.0000 AUG S 0 NA 00EJ 0 North East 2011
#2: T0X000 2542.0000 CTX S 0 NA 00EJ 0 North East 2011
#3: T0X000 2542.0000 CIP R 0 NA 00EJ 0 North East 2011
#4: T0X000 2542.0000 ERY S 0 NA 00EJ 0 North East 2011
#5: T0X000 2542.0000 OXA S 0 NA 00EJ 0 North East 2011
#6: T0X000 2542.0000 PEN S 0 NA 00EJ 0 North East 2011
# month day
#1: 3 10
#2: 3 10
#3: 3 10
#4: 3 10
#5: 3 10
#6: 3 10
tail(amrdt)
# iid labid pid nhsno dob age agemon sex ppostcode
#1: 5369819 610740 ZM08395925 1 1950-01-01 61 NA F NA
#2: 5369820 610740 ZM08395925 1 1950-01-01 61 NA F NA
#3: 5369821 610740 ZM08395925 1 1950-01-01 61 NA F NA
#4: 5369822 610740 ZM08395925 1 1950-01-01 61 NA F NA
#5: 5369823 610740 ZM08395925 1 1950-01-01 61 NA F NA
#6: 5369824 610740 ZM08395925 1 1950-01-01 61 NA F NA
# sender amsender trust speciality practice lan
#1: GM85030 M85030 5M1 600 NA M.11.2021147.S
#2: GM85030 M85030 5M1 600 NA M.11.2021147.S
#3: GM85030 M85030 5M1 600 NA M.11.2021147.S
#4: GM85030 M85030 5M1 600 NA M.11.2021147.S
#5: GM85030 M85030 5M1 600 NA M.11.2021147.S
#6: GM85030 M85030 5M1 600 NA M.11.2021147.S
# specdate spectype organism antib result zone mic localauthority repid
#1: 2011-03-31 T7X100 1571.0010 MER S 0 NA 00CN 0
#2: 2011-03-31 T7X100 1571.0010 NAL R 0 NA 00CN 0
#3: 2011-03-31 T7X100 1571.0010 NITRO S 0 NA 00CN 0
#4: 2011-03-31 T7X100 1571.0010 PIP R 0 NA 00CN 0
#5: 2011-03-31 T7X100 1571.0010 TEMO S 0 NA 00CN 0
#6: 2011-03-31 T7X100 1571.0010 TMP R 0 NA 00CN 0
# nhsregion year month day
#1: West Midlands 2011 3 31
#2: West Midlands 2011 3 31
#3: West Midlands 2011 3 31
#4: West Midlands 2011 3 31
#5: West Midlands 2011 3 31
#6: West Midlands 2011 3 31
I can convert everything to a factor with the following code:
dtnew <- amrdt[, lapply(.SD, as.factor), by=iid]
... but this is very slow, copies the data (which I want to avoid) and I would need to adjust the colClasses() for those few columns that are not supposed to be factors.
From reading other posts it seems this behaviour has something to do with "factor" not being a basic type column class? In any case it would be enormously useful if fread could accept factors and other non-basic types of column class (such as the date type that I created above).
As a (related) aside, a feature that would be really useful is if one could convert an ffdf object directly to a data.table. The reason I have read in a .csv file with the above code is because I couldn't figure out a way to do this (so wrote to .csv with write.csv.ffdf, which took about 30 mins to write to a 1.002 GB file and then read in the .csv with fread, which took 3 mins 12 secs). If it were possible to convert directly from ffdf to a data.table with fread, without dumping into a .csv first, that would be a significant time saving.
Many thanks for your help.
I'm using the fread function of
data.tableto read in a large csv file (5 million records of 28 variables) as efficiently as possible on a laptop with just 4GB RAM. Many of my variables looknumericbut they are actuallyfactors(various id numbers). In order to avoid incorrect interpretation of these variables when reading into R and also because I have read this further improves the speed of the import, I individually specifiedcolClasses()for each variable in the call tofread. However my assignments are ignored, see below:Here are the results of
fread. Seeminglyfreadhas defaulted to automatic interpretation of colClasses:I can convert everything to a factor with the following code:
... but this is very slow, copies the data (which I want to avoid) and I would need to adjust the
colClasses()for those few columns that are not supposed to be factors.From reading other posts it seems this behaviour has something to do with "factor" not being a basic type column class? In any case it would be enormously useful if fread could accept factors and other non-basic types of column class (such as the date type that I created above).
As a (related) aside, a feature that would be really useful is if one could convert an ffdf object directly to a
data.table. The reason I have read in a.csvfile with the above code is because I couldn't figure out a way to do this (so wrote to.csvwithwrite.csv.ffdf, which took about 30 mins to write to a 1.002 GB file and then read in the.csvwithfread, which took 3 mins 12 secs). If it were possible to convert directly from ffdf to a data.table with fread, without dumping into a.csvfirst, that would be a significant time saving.Many thanks for your help.