Skip to content

prevent the utf8 string from being collected by the garbage collector in forder()#2675

Closed
shrektan wants to merge 1 commit intoRdatatable:masterfrom
shrektan:fix#2674
Closed

prevent the utf8 string from being collected by the garbage collector in forder()#2675
shrektan wants to merge 1 commit intoRdatatable:masterfrom
shrektan:fix#2674

Conversation

@shrektan
Copy link
Copy Markdown
Member

closes #2674

If somehow the garbage collector was triggered during sorting (like there're millions of non-ASCII characters), it leads to the collapse of data.table (see #2674 for details) because it assumes there're converted UTF-8 chars in the global string pool. This PR tries to fix this issue.

In addition, it brings performance enhancement in the case of millions non-ASCII chars, because now it only needs to be converted to UTF-8 once. Before this PR, the strings need to be converted twice in csort_pre() and csort() respectively, which may be a big issue for a large character vector (for example, on my computer, enc2utf8() takes about 20s for a 1e7 length Chinese character).

This PR is unfinished because it needs:

  • a new test
  • entries in NEWS.md

I will add them later if I can be sure that there's no need for further modification on

https://github.com/Rdatatable/data.table/blob/master/src/forder.c#L1227

or any other places.

@codecov-io
Copy link
Copy Markdown

codecov-io commented Mar 15, 2018

Codecov Report

Merging #2675 into master will increase coverage by 0.56%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2675      +/-   ##
==========================================
+ Coverage    93.2%   93.76%   +0.56%     
==========================================
  Files          61       61              
  Lines       12169    13594    +1425     
==========================================
+ Hits        11342    12747    +1405     
- Misses        827      847      +20
Impacted Files Coverage Δ
src/forder.c 98.46% <100%> (+0.57%) ⬆️
src/rbindlist.c 87.43% <0%> (-1.5%) ⬇️
src/freadR.c 88.85% <0%> (-0.72%) ⬇️
R/timetaken.R 100% <0%> (ø) ⬆️
R/test.data.table.R 100% <0%> (ø) ⬆️
R/setkey.R 93.9% <0%> (+0.04%) ⬆️
src/bmerge.c 97.8% <0%> (+0.09%) ⬆️
R/duplicated.R 89.36% <0%> (+0.11%) ⬆️
src/fread.c 98.26% <0%> (+0.29%) ⬆️
src/fwriteR.c 95.23% <0%> (+0.32%) ⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 620e707...6e855c7. Read the comment docs.

@mattdowle
Copy link
Copy Markdown
Member

mattdowle commented Mar 15, 2018

Oh. Wow. Really great find. Did it take you long to find it?!

The question now is what about R itself: https://github.com/wch/r-source/blob/trunk/src/main/radixsort.c#L1133. It doesn't use ENC2UTF8 there, so my first guess is that R itself is ok. Our code was ported to R before ENC2UTF8 was added to data.table in dev, I guess.

Looking at the source tar.gz of data.table 1.10.4-3 on CRAN, I do see ENC2UTF8 but only used in StrCmp and StrCmp2, and its result is not saved. It's only used in the way you've pointed out, in data.table dev, IIUC, on first glance.

So is this a data.table-1.10.5 only problem do you think? Have you seen it in R itself without data.table loaded, or in data.table_1.10.4-3 or earlier?

@mattdowle
Copy link
Copy Markdown
Member

mattdowle commented Mar 15, 2018

I was going to add the test and NEWS item, as you've done the hard work already. But it's a branch in your repo so I don't think I can, easily at least. You're a project member and trusted, so please create branches directly in the main project so we can work on the same branch together.

Is there any slow down when all the strings are ASCII? The added allocVector seems to be populated even when all strings are ASCII.

@shrektan
Copy link
Copy Markdown
Member Author

I forgot the fact that I have the right to open a new branch here. I've created a new PR #2678 based on a branch in the main project so that you can edit it easily. For further comments please see #2678.

@shrektan shrektan closed this Mar 16, 2018
@shrektan shrektan reopened this Mar 23, 2018
@shrektan shrektan closed this Mar 23, 2018
@MichaelChirico
Copy link
Copy Markdown
Member

MichaelChirico commented Mar 23, 2018 via email

@shrektan shrektan added the encoding issues related to Encoding label Sep 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

encoding issues related to Encoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

data.table breaks when there're millions of Chinese characters

4 participants