Skip to content

The msg from fread should mark txt as declared encoding #4747

@shrektan

Description

@shrektan

On Windows, when the text is UTF-8 encoded and the printed message from fread() contains some text, the message will be displayed as garbage letters. The cause I believe is we didn't mark the txt as the declared encoding "UTF-8".

A reproducible example on Windows

Code

txt <- "A,B\n中文1,中文2\n中文3"
txt <- enc2utf8(txt)
data.table::fread(text = txt, encoding = 'UTF-8')

Output

       A     B
1: 中文1 中文2
Warning message:
In data.table::fread(text = txt, encoding = "UTF-8") :
  Discarded single-line footer: <<涓枃3>>

In contrast to native encoded txt which looks correct

Code

txt <- "A,B\n中文1,中文2\n中文3"
data.table::fread(text = txt)

Output

       A     B
1: 中文1 中文2
Warning message:
In data.table::fread(text = txt) : Discarded single-line footer: <<中文3>>

session Info

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936   
[3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C                              
[5] LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_1.5      data.table_1.13.0

Another example on Mac

x <- "fa\xE7ile"
Encoding(x) <- "latin1"
txt <- sprintf("A,B\n%s,%s\n%s", x, x, x)
Encoding(txt) <- "UTF-8"

data.table::fread(text = txt, encoding = 'UTF-8')

txt2 <- iconv(txt, "UTF-8", "latin1")
data.table::fread(text = txt2, encoding = 'Latin-1')

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions