data.table breaks when there're millions of Chinese characters

related to #2566 

I find a serious issue that when there're millions of non-ASCII characters encoded in non-UTF8 encoding (see the following example), `setkey()`  on that column causes `data.table` becomes extremely slow and throws the error that `'translateCharUTF8' must be called on a CHARSXP` in the end.

**After that error, all the `data.table` function calls end up with another error that `Internal error: savetl_init checks failed`.**

 I will investigate and report more details later. Hopefully, I can file a PR to fix this.

![image](https://user-images.githubusercontent.com/8368933/37463643-de14d83c-2890-11e8-8260-0a85d973812c.png)


# Example

NOTE, you have to execute this on __a windows machine with GB2312 as the default encoding (i.e., a Simplified Chinese Windows Machine)__. Otherwise it won't work. Also, if it won't fail for the first time, try to execute twice. I've tried this on several machines in my office. I'm quite confident it's reproducible.

```r
library(data.table)
dt <- data.table(
  x = sample(
    c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失"),
    1e7,
    TRUE
  ),
  z = 1
)
setkey(dt, x)
```

# session info

```
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
 [1] digest_0.6.15         crayon_1.3.4          withr_2.1.1           rprojroot_1.3-2       assertthat_0.2.0     
 [6] R6_2.2.2              backports_1.1.2       magrittr_1.5          rlang_0.2.0           cli_1.0.0            
[11] rstudioapi_0.7.0-9000 testthat_1.0.2.9000   devtools_1.13.3.9000  desc_1.1.1            tools_3.4.3          
[16] pkgload_0.0.0.9000    yaml_2.1.16           compiler_3.4.3        pkgbuild_0.0.0.9000   memoise_1.1.0        
[21] usethis_1.3.0   
```

# UPDATES

- I'm quite confident that `ENC2UTF8` is very slow for millions of chars now but still not sure if the issue is caused by this or not. Moreover, it's hard for me to understand why it's slow because it seems like R itself implements `enc2utf8` [in a similar way](https://github.com/wch/r-source/blob/44d54d6f848468a7353d99cc9be0255105185975/src/main/util.c#L1837)  **EDIT:** `enc2utf8()` takes a long time (17s) to convert 1e7 chars, too. So this is not the issue.
- I doubt it's related to `savetl()`, `savetl_end()`. I'm not familiar with how the global string pool works in R. However, I doubt that the utf-8 char created by `data.table` gets released when `gc()` causes (that's why it only occurs when the number of chars is large). If the char gets released and `savetl_end()` tries to modify a non-existed char's `truelength`...
- Should be related to GC, SEXP, may need PROTEC ... Basically confirmed...


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.table breaks when there're millions of Chinese characters #2674

Example

session info

UPDATES

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

data.table breaks when there're millions of Chinese characters #2674

Description

Example

session info

UPDATES

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions