I'm working with a series of files, one of which has the UTF-8 BOM marking the beginning of the file: \0xef \0xbb \0xbf
As noted here, the default behavior of read.csv is now to detect and delete the BOM. Unfortunately, for me at least, fread seems to have converted the three characters into a space.
Fortunately, strip.white removes this before returning the data.table; unfortunately, my file also has lots of important trailing white space, so I need to set strip.white = FALSE, negating this.
Here's a link to the file I'm working with (caveat clickor: it's a scary executable link, and also non-trivial size, ~80 MB. For whatever reason they decided to "zip" the file with an executable. My only word of reassurance is that you can tell it's a US government website): http://lbstat.dpi.wi.gov/sites/default/files/imce/lbstat/exe/11STAFF.exe
To see the BOM, run:
r<-readBin("11STAFF.txt",raw(),file.info("11STAFF.txt")$size)
> r[1:10]
[1] ef bb bf 30 30 30 30 36 37 31
> r[1] == as.raw(0xef)
[1] TRUE
Here's some relevant output from fread with verbose = TRUE:
> fread("11STAFF.txt", sep = "^", header = FALSE, verbose = TRUE)
...
First 10 characters: 0000671
That is, it has treated the first 3 characters as being a space. With strip.white = TRUE, this space disappears in the output.
I compare this to the behavior of read.csv (also a nuisance to use because the file is on the large side):
> read.csv("11STAFF.txt", sep = "^", header = FALSE, stringsAsFactors = FALSE)$V1[1]
[1] "000067182Abel Nancy FW19554 2011R187 70 70 45880 21809 1 00070007030020530050KGKG1616N100 Abbotsford Sch Dist Abbotsford Elementary 61010Clark County 04PO Box A Abbotsford WI 54405-0901 510 W Hemlock St Abbotsford WI 54405 Abbotsford WI54405-0901Abbotsford WI54405 715-223-4281 Gary Gunderson NNN "
That is, read.csv seems to have deleted the BOM and kept the trailing white space. Just a shame that it's so slow.
For now, I've simply added deleting the BOM to my clean-up routine alluded to here, but it seems like fread should match the behavior of read.csv here.
I'm working with a series of files, one of which has the UTF-8 BOM marking the beginning of the file:
\0xef\0xbb\0xbfAs noted here, the default behavior of
read.csvis now to detect and delete the BOM. Unfortunately, for me at least,freadseems to have converted the three characters into a space.Fortunately,
strip.whiteremoves this before returning thedata.table; unfortunately, my file also has lots of important trailing white space, so I need to setstrip.white = FALSE, negating this.Here's a link to the file I'm working with (caveat clickor: it's a scary executable link, and also non-trivial size, ~80 MB. For whatever reason they decided to "zip" the file with an executable. My only word of reassurance is that you can tell it's a US government website): http://lbstat.dpi.wi.gov/sites/default/files/imce/lbstat/exe/11STAFF.exe
To see the BOM, run:
Here's some relevant output from
freadwithverbose = TRUE:That is, it has treated the first 3 characters as being a space. With
strip.white = TRUE, this space disappears in the output.I compare this to the behavior of
read.csv(also a nuisance to use because the file is on the large side):That is,
read.csvseems to have deleted the BOM and kept the trailing white space. Just a shame that it's so slow.For now, I've simply added deleting the BOM to my clean-up routine alluded to here, but it seems like
freadshould match the behavior ofread.csvhere.