Feature Request: finer control of `strip.white` in `fread`? Dealing with BOM

I'm working with a series of files, one of which has the UTF-8 BOM marking the beginning of the file: `\0xef`  `\0xbb` `\0xbf`

As noted [here](http://stackoverflow.com/questions/21624796/read-the-text-file-with-bom-in-r), the default behavior of `read.csv` is now to detect and delete the BOM. Unfortunately, for me at least, `fread` seems to have converted the three characters into a space.

Fortunately, `strip.white` removes this before returning the `data.table`; unfortunately, my file also has lots of important trailing white space, so I need to set `strip.white = FALSE`, negating this.

Here's a link to the file I'm working with (_caveat clickor_: it's a scary executable link, and also non-trivial size, ~80 MB. For whatever reason they decided to "zip" the file with an executable. My only word of reassurance is that you can tell it's a [US government website](http://lbstat.dpi.wi.gov/lbstat_newasr)): http://lbstat.dpi.wi.gov/sites/default/files/imce/lbstat/exe/11STAFF.exe

To see the BOM, run:

```
r<-readBin("11STAFF.txt",raw(),file.info("11STAFF.txt")$size)
> r[1:10]
 [1] ef bb bf 30 30 30 30 36 37 31
> r[1] == as.raw(0xef)
[1] TRUE
```

Here's some relevant output from `fread` with `verbose = TRUE`:

```
> fread("11STAFF.txt", sep = "^", header = FALSE, verbose = TRUE)
...
First 10 characters: ﻿0000671
```

That is, it has treated the first 3 characters as being a space. With `strip.white = TRUE`, this space disappears in the output.

I compare this to the behavior of `read.csv` (also a nuisance to use because the file is on the large side):

```
> read.csv("11STAFF.txt", sep = "^", header = FALSE, stringsAsFactors = FALSE)$V1[1]
[1] "000067182Abel                Nancy           FW19554    2011R187  70 70  45880  21809            1  00070007030020530050KGKG1616N100              Abbotsford Sch Dist           Abbotsford Elementary         61010Clark County                  04PO Box A                      Abbotsford WI  54405-0901                                   510 W Hemlock St              Abbotsford WI  54405                                        Abbotsford       WI54405-0901Abbotsford       WI54405     715-223-4281      Gary Gunderson                                    NNN                                                  "
```

That is, `read.csv` seems to have deleted the BOM _and_ kept the trailing white space. Just a shame that it's so slow.

For now, I've simply added deleting the BOM to my clean-up routine alluded to [here](http://stackoverflow.com/questions/34214859/removing-nul-characters-within-r/34215105#34215105), but it seems like `fread` should match the behavior of `read.csv` here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: finer control of `strip.white` in `fread`? Dealing with BOM #1465

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: finer control of strip.white in fread? Dealing with BOM #1465

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Feature Request: finer control of `strip.white` in `fread`? Dealing with BOM #1465