Skip to content

Support UTF-16 encoded files in fread #2560

@danielsjf

Description

@danielsjf

I am trying to open a file with '00' bytes in it. More specifically this happens every other byte. At the same time, there are also two bytes in the beginning that seem to specify only the encoding. I haven't been able to read this file with fread. See also the full example (+reprex) in this stackoverflow post:
https://stackoverflow.com/questions/48169100/reading-a-tsv-with-specific-encoding-initial-two-bytes-and-utf-8-afterwards-an

When I googled the issue I found similar problems:
https://q-a-assistant.info/computer-internet-technology/r-data-table-error-in-fread-embedded-nul-in-string-0-0-0-000/264557
https://stackoverflow.com/questions/31701365/error-with-fread-in-r-embedded-nul-in-string-0
https://stackoverflow.com/questions/22643372/embedded-nul-in-string-error-when-importing-csv-with-fread?lq=1

But the work-around is not sufficient:

  • The first one is just opening it with excel and saving it again. Since the data that I use is downloaded automatically from another source (without user interaction), I don't have this option. Even worse, even if I would like to do it in that way, the file is longer than the 1M rows of excel so I'm not able to use this for the entire file.
  • The subsequent ones use a Linux command and when I try to use it inside the fread command, it doesn't work (the person from the first post has the same issue).

Would it be possible to skip the NUL values, just as the base functions do? See readLines (skipNul) or read.table (skipNul).

This is how the file shows up in a hex editor:
image

First 100 bytes of the file: test_file.txt
It's actually a tsv file but github doesn't allow that format.

# Reprex

file <- 'test_file.txt'

# fread from data.table is not able to read the file
tmp <- data.table::fread(file, nrows = 2)
#> Error in data.table::fread(file, nrows = 2): embedded nul in string: 'ÿþy\0e\0a\0r\0'

# It also doesn't work with sed, potentially since I'm on Windows
tmp <- data.table::fread(paste0("sed 's/\\0//g' '", file, "'"), nrows = 2)
#> Error in data.table::fread(paste0("sed 's/\\0//g' '", file, "'"), nrows = 2): embedded nul in string: 'ÿþy\0e\0a\0r\0'

# Output of sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_3.4.1 magrittr_1.5 tools_3.4.1 yaml_2.1.15 stringi_1.1.6 data.table_1.10.4-3
[7] stringr_1.2.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions