prevent the utf8 string from being collected by the garbage collector in forder() by shrektan · Pull Request #2675 · Rdatatable/data.table

shrektan · 2018-03-15T18:17:02Z

If somehow the garbage collector was triggered during sorting (like there're millions of non-ASCII characters), it leads to the collapse of data.table (see #2674 for details) because it assumes there're converted UTF-8 chars in the global string pool. This PR tries to fix this issue.

In addition, it brings performance enhancement in the case of millions non-ASCII chars, because now it only needs to be converted to UTF-8 once. Before this PR, the strings need to be converted twice in csort_pre() and csort() respectively, which may be a big issue for a large character vector (for example, on my computer, enc2utf8() takes about 20s for a 1e7 length Chinese character).

This PR is unfinished because it needs:

a new test
entries in NEWS.md

I will add them later if I can be sure that there's no need for further modification on

https://github.com/Rdatatable/data.table/blob/master/src/forder.c#L1227

or any other places.

… in forder()

codecov-io · 2018-03-15T18:39:11Z

Codecov Report

Merging #2675 into master will increase coverage by 0.56%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2675      +/-   ##
==========================================
+ Coverage    93.2%   93.76%   +0.56%     
==========================================
  Files          61       61              
  Lines       12169    13594    +1425     
==========================================
+ Hits        11342    12747    +1405     
- Misses        827      847      +20

Impacted Files	Coverage Δ
src/forder.c	`98.46% <100%> (+0.57%)`	⬆️
src/rbindlist.c	`87.43% <0%> (-1.5%)`	⬇️
src/freadR.c	`88.85% <0%> (-0.72%)`	⬇️
R/timetaken.R	`100% <0%> (ø)`	⬆️
R/test.data.table.R	`100% <0%> (ø)`	⬆️
R/setkey.R	`93.9% <0%> (+0.04%)`	⬆️
src/bmerge.c	`97.8% <0%> (+0.09%)`	⬆️
R/duplicated.R	`89.36% <0%> (+0.11%)`	⬆️
src/fread.c	`98.26% <0%> (+0.29%)`	⬆️
src/fwriteR.c	`95.23% <0%> (+0.32%)`	⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 620e707...6e855c7. Read the comment docs.

mattdowle · 2018-03-15T19:56:10Z

Oh. Wow. Really great find. Did it take you long to find it?!

The question now is what about R itself: https://github.com/wch/r-source/blob/trunk/src/main/radixsort.c#L1133. It doesn't use ENC2UTF8 there, so my first guess is that R itself is ok. Our code was ported to R before ENC2UTF8 was added to data.table in dev, I guess.

Looking at the source tar.gz of data.table 1.10.4-3 on CRAN, I do see ENC2UTF8 but only used in StrCmp and StrCmp2, and its result is not saved. It's only used in the way you've pointed out, in data.table dev, IIUC, on first glance.

So is this a data.table-1.10.5 only problem do you think? Have you seen it in R itself without data.table loaded, or in data.table_1.10.4-3 or earlier?

mattdowle · 2018-03-15T20:20:48Z

I was going to add the test and NEWS item, as you've done the hard work already. But it's a branch in your repo so I don't think I can, easily at least. You're a project member and trusted, so please create branches directly in the main project so we can work on the same branch together.

Is there any slow down when all the strings are ASCII? The added allocVector seems to be populated even when all strings are ASCII.

shrektan · 2018-03-16T01:53:04Z

I forgot the fact that I have the right to open a new branch here. I've created a new PR #2678 based on a branch in the main project so that you can edit it easily. For further comments please see #2678.

MichaelChirico · 2018-03-23T02:01:20Z

just throwing it out there -- I've had a problem I can't reproduce where some Chinese characters are considered distinct by by=, but recognized as identical by keyby=. this issue looks like it might solve that (but since I can't figure out how to reproduce reliably, it'll be hard to tell)

…

On Fri, Mar 23, 2018, 9:49 AM Xianying Tan ***@***.***> wrote: Closed #2675 <#2675>. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2675 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHQQdUJayFtWrq8OSuTrREY8FwVYzTj7ks5thFSxgaJpZM4Ssph6> .

prevent the utf8 string from being collected by the garbage collector…

6e855c7

… in forder()

mattdowle added this to the v1.10.6 milestone Mar 15, 2018

shrektan mentioned this pull request Mar 16, 2018

prevent the utf8 string from being collected by the garbage collector in forder() #2678

Merged

4 tasks

shrektan closed this Mar 16, 2018

shrektan reopened this Mar 23, 2018

shrektan closed this Mar 23, 2018

shrektan added the encoding issues related to Encoding label Sep 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prevent the utf8 string from being collected by the garbage collector in forder()#2675

prevent the utf8 string from being collected by the garbage collector in forder()#2675
shrektan wants to merge 1 commit intoRdatatable:masterfrom
shrektan:fix#2674

shrektan commented Mar 15, 2018

Uh oh!

codecov-io commented Mar 15, 2018 •

edited

Loading

Uh oh!

mattdowle commented Mar 15, 2018 •

edited

Loading

Uh oh!

mattdowle commented Mar 15, 2018 •

edited

Loading

Uh oh!

shrektan commented Mar 16, 2018

Uh oh!

MichaelChirico commented Mar 23, 2018 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shrektan commented Mar 15, 2018

Uh oh!

codecov-io commented Mar 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mattdowle commented Mar 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdowle commented Mar 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shrektan commented Mar 16, 2018

Uh oh!

MichaelChirico commented Mar 23, 2018 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented Mar 15, 2018 •

edited

Loading

mattdowle commented Mar 15, 2018 •

edited

Loading

mattdowle commented Mar 15, 2018 •

edited

Loading