prevent the utf8 string from being collected by the garbage collector in forder()#2675
prevent the utf8 string from being collected by the garbage collector in forder()#2675shrektan wants to merge 1 commit intoRdatatable:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2675 +/- ##
==========================================
+ Coverage 93.2% 93.76% +0.56%
==========================================
Files 61 61
Lines 12169 13594 +1425
==========================================
+ Hits 11342 12747 +1405
- Misses 827 847 +20
Continue to review full report at Codecov.
|
|
Oh. Wow. Really great find. Did it take you long to find it?! The question now is what about R itself: https://github.com/wch/r-source/blob/trunk/src/main/radixsort.c#L1133. It doesn't use ENC2UTF8 there, so my first guess is that R itself is ok. Our code was ported to R before ENC2UTF8 was added to data.table in dev, I guess. Looking at the source tar.gz of data.table 1.10.4-3 on CRAN, I do see ENC2UTF8 but only used in StrCmp and StrCmp2, and its result is not saved. It's only used in the way you've pointed out, in data.table dev, IIUC, on first glance. So is this a data.table-1.10.5 only problem do you think? Have you seen it in R itself without data.table loaded, or in data.table_1.10.4-3 or earlier? |
|
I was going to add the test and NEWS item, as you've done the hard work already. But it's a branch in your repo so I don't think I can, easily at least. You're a project member and trusted, so please create branches directly in the main project so we can work on the same branch together. Is there any slow down when all the strings are ASCII? The added |
|
just throwing it out there -- I've had a problem I can't reproduce where
some Chinese characters are considered distinct by by=, but recognized as
identical by keyby=. this issue looks like it might solve that (but since I
can't figure out how to reproduce reliably, it'll be hard to tell)
…On Fri, Mar 23, 2018, 9:49 AM Xianying Tan ***@***.***> wrote:
Closed #2675 <#2675>.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2675 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdUJayFtWrq8OSuTrREY8FwVYzTj7ks5thFSxgaJpZM4Ssph6>
.
|
closes #2674
If somehow the garbage collector was triggered during sorting (like there're millions of non-ASCII characters), it leads to the collapse of
data.table(see #2674 for details) because it assumes there're converted UTF-8 chars in the global string pool. This PR tries to fix this issue.In addition, it brings performance enhancement in the case of millions non-ASCII chars, because now it only needs to be converted to UTF-8 once. Before this PR, the strings need to be converted twice in
csort_pre()andcsort()respectively, which may be a big issue for a large character vector (for example, on my computer,enc2utf8()takes about 20s for a 1e7 length Chinese character).This PR is unfinished because it needs:
I will add them later if I can be sure that there's no need for further modification on
https://github.com/Rdatatable/data.table/blob/master/src/forder.c#L1227
or any other places.