Taking that out from #4346 so discussion only about that could be here.
I would like to propose for forderv to have a default retGrp=TRUE, that means secondary indices would carry that attribute as well. As a result it will be a little bit more heavy, but it opens more possibilities to avoid heavy re-computation. One of many examples
# TODO: could check/reuse secondary indices, but we need 'starts' attribute as well!
as well #2947
I made small benchmark...
tl;dr
The difference in timings above are significant. My conclusion is that we should not make that a defaut, but rather keep those information whenever user compute them somehow, for example when calling unique. In such case there is no extra performance cost, and those information doesn't have to be re-computed again. It could be computed when calling setindex.
Each of comment describes a different factor used.
library(data.table)
set.seed(108)
forderv = data.table:::forderv
N = 1e8
## th
setDTthreads(40L)
setDTthreads(1L)
## n unique
DT = data.table(V1 = sample(N, N, FALSE))
DT = data.table(V1 = sample(1:2, N, TRUE))
## fun: order vs order+groups
system.time(o <- forderv(DT, by="V1", sort=TRUE, retGrp=FALSE))
system.time(p <- forderv(DT, by="V1", sort=TRUE, retGrp=TRUE))
and got the following timings
d = fread("
th,unqn,fun,sec
40,1e8,o,0.851
40,1e8,og,1.759
40,2,o,0.244
40,2,og,0.253
1,1e8,o,4.901
1,1e8,og,5.630
1,2,o,1.061
1,2,og,1.075
")
cube(d, by=c("th","unqn"), j=sprintf("%.2f%%", mean(sec[fun=="o"]/sec[fun=="og"])*100))
# th unqn V1
#1: 40 1e+08 48.38%
#2: 40 2e+00 96.44%
#3: 1 1e+08 87.05%
#4: 1 2e+00 98.70%
#5: 40 NA 72.41%
#6: 1 NA 92.87%
#7: NA 1e+08 67.72%
#8: NA 2e+00 97.57%
#9: NA NA 82.64%
On average finding order but no groups takes 82% of time that order+groups would take.
Importance of unique value (number of groups) is 97% vs 67%. So if there are only 2 groups, the difference is not significant, but for all unique rows, the average difference is 67%.
Importance of 40 vs 1 thread is 92% vs 72%.
In combination of 40 threads and all unique rows, calculating order+groups is twice slower comparing to just order. When using 1 thread it is only around 10% slower.
Regarding memory, number of threads is not factor anymore.
All unique rows, will take twice as much memory, while 2 groups will take almost no extra memory.
Taking that out from #4346 so discussion only about that could be here.
I would like to propose for
fordervto have a defaultretGrp=TRUE, that means secondary indices would carry that attribute as well. As a result it will be a little bit more heavy, but it opens more possibilities to avoid heavy re-computation. One of many examplesas well #2947
I made small benchmark...
tl;dr
The difference in timings above are significant. My conclusion is that we should not make that a defaut, but rather keep those information whenever user compute them somehow, for example when calling
unique. In such case there is no extra performance cost, and those information doesn't have to be re-computed again. It could be computed when callingsetindex.Each of comment describes a different factor used.
and got the following timings
On average finding
orderbut nogroupstakes 82% of time thatorder+groupswould take.Importance of unique value (number of groups) is 97% vs 67%. So if there are only 2 groups, the difference is not significant, but for all unique rows, the average difference is 67%.
Importance of 40 vs 1 thread is 92% vs 72%.
In combination of 40 threads and all unique rows, calculating
order+groupsis twice slower comparing to justorder. When using 1 thread it is only around 10% slower.Regarding memory, number of threads is not factor anymore.
All unique rows, will take twice as much memory, while 2 groups will take almost no extra memory.