computing index could find groups as well

Taking that out from #4346 so discussion only about that could be here.
I would like to propose for `forderv` to have a default `retGrp=TRUE`, that means secondary indices would carry that attribute as well. As a result it will be a little bit more heavy, but it opens more possibilities to avoid heavy re-computation. One of many examples
```
# TODO: could check/reuse secondary indices, but we need 'starts' attribute as well!
```
as well #2947

----

I made small benchmark...

tl;dr
----

The difference in timings above are significant. My conclusion is that we should not make that a defaut, but rather keep those information whenever user compute them somehow, for example when calling `unique`. In such case there is no extra performance cost, and those information doesn't have to be re-computed again. It could be computed when calling `setindex`.

----

Each of comment describes a different factor used.

```r
library(data.table)
set.seed(108)
forderv = data.table:::forderv
N = 1e8

## th
setDTthreads(40L)
setDTthreads(1L)

## n unique
DT = data.table(V1 = sample(N, N, FALSE))
DT = data.table(V1 = sample(1:2, N, TRUE))

## fun: order vs order+groups
system.time(o <- forderv(DT, by="V1", sort=TRUE, retGrp=FALSE))
system.time(p <- forderv(DT, by="V1", sort=TRUE, retGrp=TRUE))
```
and got the following timings
```r
d = fread("
th,unqn,fun,sec
40,1e8,o,0.851
40,1e8,og,1.759
40,2,o,0.244
40,2,og,0.253
1,1e8,o,4.901
1,1e8,og,5.630
1,2,o,1.061
1,2,og,1.075
")
cube(d, by=c("th","unqn"), j=sprintf("%.2f%%", mean(sec[fun=="o"]/sec[fun=="og"])*100))
#      th  unqn     V1
#1:    40 1e+08 48.38%
#2:    40 2e+00 96.44%
#3:     1 1e+08 87.05%
#4:     1 2e+00 98.70%
#5:    40    NA 72.41%
#6:     1    NA 92.87%
#7:    NA 1e+08 67.72%
#8:    NA 2e+00 97.57%
#9:    NA    NA 82.64%
```
On average finding `order` but no `groups` takes 82% of time that `order+groups` would take.
Importance of unique value (number of groups) is 97% vs 67%. So if there are only 2 groups, the difference is not significant, but for all unique rows, the average difference is 67%.
Importance of 40 vs 1 thread is 92% vs 72%.
In combination of 40 threads and all unique rows, calculating `order+groups` is twice slower comparing to just `order`. When using 1 thread it is only around 10% slower.

Regarding memory, number of threads is not factor anymore.
All unique rows, will take twice as much memory, while 2 groups will take almost no extra memory.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

computing index could find groups as well #4387

tl;dr

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

computing index could find groups as well #4387

Description

tl;dr

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions