Conversation
| alloc_csort_otmp(n) is called from forder for either n=nrow if 1st column, | ||
| or n=maxgrpn if onwards columns */ | ||
| for(i=0; i<n; i++) csort_otmp[i] = (x[i] == NA_STRING) ? NA_INTEGER : -TRUELENGTH(ENC2UTF8(x[i])); | ||
| for(i=0; i<n; i++) csort_otmp[i] = (x[i] == NA_STRING) ? NA_INTEGER : -TRUELENGTH(x[i]); |
|
@mattdowle It's the first time I realize that the
About the performance, it gets significantly improved when there're lots on non-ASCII chars library(data.table)
nonascii_string <- function(n, utf8 = TRUE) {
x <- c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失")
if (isTRUE(utf8)) x <- enc2utf8(x)
sample(x, n, TRUE)
}
# ascii 1
tmp <- data.table(x = sample(letters, 1e8, TRUE))
system.time(setkey(tmp, x))
# ascii 2
tmp <- data.table(x = sample(letters, 1e8, TRUE), y = sample(letters, 1e8, TRUE))
system.time(setkey(tmp, y, x))
# utf8 1
tmp <- data.table(x = nonascii_string(1e7))
system.time(setkey(tmp, x))
# utf8 2
tmp <- data.table(x = nonascii_string(1e7), y = nonascii_string(1e7))
system.time(setkey(tmp, y, x))
# native
tmp <- data.table(x = nonascii_string(1e5, FALSE))
system.time(setkey(tmp, x))
|
Codecov Report
@@ Coverage Diff @@
## master #2678 +/- ##
==========================================
- Coverage 93.32% 93.31% -0.01%
==========================================
Files 61 61
Lines 12225 12237 +12
==========================================
+ Hits 11409 11419 +10
- Misses 816 818 +2
Continue to review full report at Codecov.
|
|
Line 1227 in 4d8545e library(data.table)
utf8_strings <- enc2utf8(c("红利收入", "价差收入"))
native_strings <- enc2native(utf8_strings)
mixed_strings <- c(utf8_strings, native_strings)
DT <- data.table(x = mixed_strings, y = 1)
DT[, .N, by = .(x, y)]
x y N
1: 红利收入 1 2
2: 价差收入 1 2
DT[, .N, by = .(y, x)]
y x N
1: 1 红利收入 1
2: 1 价差收入 1
3: 1 红利收入 1
4: 1 价差收入 1 |
|
@mattdowle I've pushed more commits to fix the cases on grouping (a.k.a, |
|
Thanks for all this! It's looking great to me. According to codecov, |
|
I guess adding an example uses the |
|
The error log I downloaded from I can't understand the failure because the following code gives the correct answer utf8_strings <- c("\u00e7ile", "fa\u00e7ile", "El. pa\u00c5\u00a1tas", "\u00a1tas", "\u00de")
latin1_strings <- iconv(utf8_strings, from = "UTF-8", to = "latin1")
mixed_strings <- c(utf8_strings, latin1_strings)
DT <- data.table(x = mixed_strings, y = c(latin1_strings, utf8_strings), z = 1)
nrow(DT[, .N, by = .(z, x, y)])
# 5EDITGet it. Yes, it indeed fails on the x32 version of R... I'm investigating it now... EDIT2Should have been fixed now. Also, the |
mattdowle
left a comment
There was a problem hiding this comment.
Very nice. Thanks for the good comments. I read a few times; indeed a better cleaner approach.
…ed reminder to benchmarks.Rraw.
Closes #2674
This PR replaced PR #2675 (see comments there).
If somehow the garbage collector was triggered during sorting (like there're millions of non-ASCII characters), it leads to the collapse of
data.table(see #2674 for details) because it assumes there're converted UTF-8 chars in the global string pool. This PR tries to fix this issue.In addition, it brings performance enhancement in the case of millions non-ASCII chars, because now it only needs to be converted to UTF-8 once. Before this PR, the strings need to be converted twice in
csort_pre()andcsort()respectively, which may be a big issue for a large character vector (for example, on my computer,enc2utf8()takes about 20s for a 1e7 length Chinese character).TODO
by = .(x, y)should return the same row asby = .(y, x)(see comment below)