prevent the utf8 string from being collected by the garbage collector in forder() by shrektan · Pull Request #2678 · Rdatatable/data.table

shrektan · 2018-03-16T01:37:43Z

Closes #2674
This PR replaced PR #2675 (see comments there).

If somehow the garbage collector was triggered during sorting (like there're millions of non-ASCII characters), it leads to the collapse of data.table (see #2674 for details) because it assumes there're converted UTF-8 chars in the global string pool. This PR tries to fix this issue.

In addition, it brings performance enhancement in the case of millions non-ASCII chars, because now it only needs to be converted to UTF-8 once. Before this PR, the strings need to be converted twice in csort_pre() and csort() respectively, which may be a big issue for a large character vector (for example, on my computer, enc2utf8() takes about 20s for a 1e7 length Chinese character).

TODO

a new test
entries in NEWS.md
by = .(x, y) should return the same row as by = .(y, x) (see comment below)
should not affect the performance if all the strings are ASCII or encoded in UTF-8.

… in forder()

MichaelChirico · 2018-03-16T01:58:53Z

     alloc_csort_otmp(n) is called from forder for either n=nrow if 1st column,
     or n=maxgrpn if onwards columns */
-  for(i=0; i<n; i++) csort_otmp[i] = (x[i] == NA_STRING) ? NA_INTEGER : -TRUELENGTH(ENC2UTF8(x[i]));
+  for(i=0; i<n; i++) csort_otmp[i] = (x[i] == NA_STRING) ? NA_INTEGER : -TRUELENGTH(x[i]);


great catch 👏

shrektan · 2018-03-16T02:14:35Z

@mattdowle It's the first time I realize that the radixsort in R is the same as in data.table. Thanks for the great work! I don't think I've encountered any issue before so it should be a problem of data.table-1.10.5 only.

~~However, regarding the following line, are you sure that it's ok because the xd here will not be converted to UTF-8?~~ (UPDATE: fixed by be9a6de)

About the performance, it gets significantly improved when there're lots on non-ASCII chars ~~but yes, it becomes a little slower if all the strings are ASCII~~ (UPDATES with commit 84ca0b0, the overheads should be minimal.) See the tests on my computers:

library(data.table)
nonascii_string <- function(n, utf8 = TRUE) {
  x <- c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失")
  if (isTRUE(utf8)) x <- enc2utf8(x)
  sample(x, n, TRUE)
}

# ascii 1
tmp <- data.table(x = sample(letters, 1e8, TRUE))
system.time(setkey(tmp, x)) 
# ascii 2
tmp <- data.table(x = sample(letters, 1e8, TRUE), y = sample(letters, 1e8, TRUE))
system.time(setkey(tmp, y, x)) 
# utf8 1
tmp <- data.table(x = nonascii_string(1e7))
system.time(setkey(tmp, x)) 
# utf8 2
tmp <- data.table(x = nonascii_string(1e7), y = nonascii_string(1e7))
system.time(setkey(tmp, y, x)) 
# native
tmp <- data.table(x = nonascii_string(1e5, FALSE))
system.time(setkey(tmp, x))

version	ascii 1	ascii 2	utf8 1	utf8 2	native
v1.9.6 (before including ENC2UTF8)	1.90s	5.83s	0.16s	0.41s	0.00s
master (`4d8545e`)	1.90s	5.83s	0.16s	0.34s	0.47s (if not fail)
this version	1.90s	5.90s	0.16s	0.33s	0.14s

codecov-io · 2018-03-16T06:26:12Z

Codecov Report

Merging #2678 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2678      +/-   ##
==========================================
- Coverage   93.32%   93.31%   -0.01%     
==========================================
  Files          61       61              
  Lines       12225    12237      +12     
==========================================
+ Hits        11409    11419      +10     
- Misses        816      818       +2

Impacted Files	Coverage Δ
src/forder.c	`97.92% <100%> (+0.03%)`	⬆️
src/fsort.c	`72.72% <0%> (-1.22%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 437646f...1ca89ba. Read the comment docs.

shrektan · 2018-03-16T06:41:46Z

~~I confirm that we need to edit here as well.~~ UPDATE this has been fixed by be9a6de

data.table/src/forder.c

Line 1227 in 4d8545e

if (sortStr) { csort_pre(xd, n); alloc_csort_otmp(gsmax[1-flip]); g = &csort; }

library(data.table)
utf8_strings <- enc2utf8(c("红利收入", "价差收入"))
native_strings <- enc2native(utf8_strings)
mixed_strings <- c(utf8_strings, native_strings)
DT <- data.table(x = mixed_strings, y = 1)
DT[, .N, by = .(x, y)]

          x y N
1: 红利收入 1 2
2: 价差收入 1 2

DT[, .N, by = .(y, x)]

   y        x N
1: 1 红利收入 1
2: 1 价差收入 1
3: 1 红利收入 1
4: 1 价差收入 1

shrektan · 2018-03-17T15:10:40Z

@mattdowle I've pushed more commits to fix the cases on grouping (a.k.a, DT[, j, by = .(x, y)]) and the overheads for all ASCII/UTF8 columns.

mattdowle · 2018-03-29T23:47:32Z

Thanks for all this! It's looking great to me. According to codecov, need2utf8 always returns false. Is there a way to add a test to cover those 4 lines?

shrektan · 2018-03-30T13:03:06Z

I guess adding an example uses the Latin-1 encoding may work. I'll submit one later.

shrektan · 2018-03-30T15:08:36Z

The error log I downloaded from AppVeyor says:

Running test id 1896.4      Test 1896.4 ran without errors but failed check that x equals y:
> x = nrow(DT[, .N, by = .(z, x, y)]) 
First 6 of 1 (type 'integer'): [1] 10
> y = 5L 
First 6 of 1 (type 'integer'): [1] 5

I can't understand the failure because the following code gives the correct answer 5 on both my Mac and Windows (I even tried to set my language and locale to the United States)...

utf8_strings <- c("\u00e7ile", "fa\u00e7ile", "El. pa\u00c5\u00a1tas", "\u00a1tas", "\u00de")
latin1_strings <- iconv(utf8_strings, from = "UTF-8", to = "latin1")
mixed_strings <- c(utf8_strings, latin1_strings)
DT <- data.table(x = mixed_strings, y = c(latin1_strings, utf8_strings), z = 1)
nrow(DT[, .N, by = .(z, x, y)])
# 5

EDIT

Get it. Yes, it indeed fails on the x32 version of R... I'm investigating it now...

EDIT2

Should have been fixed now. Also, the need2utf8 condition gets covered.

mattdowle

Very nice. Thanks for the good comments. I read a few times; indeed a better cleaner approach.

…ed reminder to benchmarks.Rraw.

prevent the utf8 string from being collected by the garbage collector…

6e855c7

… in forder()

shrektan mentioned this pull request Mar 16, 2018

prevent the utf8 string from being collected by the garbage collector in forder() #2675

Closed

2 tasks

MichaelChirico reviewed Mar 16, 2018

View reviewed changes

shrektan requested a review from mattdowle March 16, 2018 02:31

shrektan added this to the v1.10.6 milestone Mar 16, 2018

add tests and entries in NEWS.md

c717a8b

shrektan added 2 commits March 16, 2018 14:29

Merge branch 'master' of github.com:Rdatatable/data.table into fix#2674

436dba6

bumper the test number

803b5f2

shrektan added 3 commits March 16, 2018 16:00

mixed encoded chars should be compared under UTF8

be9a6de

fix a typo on test number

de3cc86

reduce overheads for all ascii/utf8 columns

84ca0b0

shrektan and others added 2 commits March 20, 2018 09:13

erase the gcc warning that uxd may be used before initilizing

3f36a4b

Merge branch 'master' into fix#2674

6e257f5

shrektan mentioned this pull request Mar 27, 2018

Segfault during sorting #2707

Closed

Merge branch 'master' into fix#2674

df12f23

use latin1 encoding example so that it can be tested on a linux machine

8e04d53

fix the encoding related error on the i386 version of R

886fa68

mattdowle approved these changes Mar 30, 2018

View reviewed changes

mattdowle added 2 commits March 30, 2018 13:46

Minor simplification.

ecf92e9

Reduced size of new test 1896 from 60s down to under 1s for CRAN. Add…

1ca89ba

…ed reminder to benchmarks.Rraw.

mattdowle merged commit 3950774 into master Mar 30, 2018

mattdowle deleted the fix#2674 branch March 30, 2018 21:42

shrektan mentioned this pull request Sep 9, 2019

add test against regression on #2041 #3843

Merged

shrektan added the encoding issues related to Encoding label Sep 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prevent the utf8 string from being collected by the garbage collector in forder()#2678

prevent the utf8 string from being collected by the garbage collector in forder()#2678
mattdowle merged 14 commits intomasterfrom
fix#2674

shrektan commented Mar 16, 2018 •

edited by mattdowle

Loading

Uh oh!

MichaelChirico Mar 16, 2018

Uh oh!

shrektan commented Mar 16, 2018 •

edited

Loading

Uh oh!

codecov-io commented Mar 16, 2018 •

edited

Loading

Uh oh!

shrektan commented Mar 16, 2018 •

edited

Loading

Uh oh!

shrektan commented Mar 17, 2018

Uh oh!

mattdowle commented Mar 29, 2018

Uh oh!

shrektan commented Mar 30, 2018

Uh oh!

shrektan commented Mar 30, 2018 •

edited

Loading

Uh oh!

mattdowle left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shrektan commented Mar 16, 2018 • edited by mattdowle Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

MichaelChirico Mar 16, 2018

Choose a reason for hiding this comment

Uh oh!

shrektan commented Mar 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Mar 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shrektan commented Mar 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shrektan commented Mar 17, 2018

Uh oh!

mattdowle commented Mar 29, 2018

Uh oh!

shrektan commented Mar 30, 2018

Uh oh!

shrektan commented Mar 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

EDIT

EDIT2

Uh oh!

mattdowle left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shrektan commented Mar 16, 2018 •

edited by mattdowle

Loading

shrektan commented Mar 16, 2018 •

edited

Loading

codecov-io commented Mar 16, 2018 •

edited

Loading

shrektan commented Mar 16, 2018 •

edited

Loading

shrektan commented Mar 30, 2018 •

edited

Loading