sum & mean group by operation not optimised for factor types

Seems like group by are not optimised where the by-variable is of factor type. Essentially one can perform group-by faster for factors given that it's represented by integers from 1 to n where n is the number of groups.

See Python implementation discussion http://wesmckinney.com/blog/mastering-high-performance-data-algorithms-i-group-by/
and Julia implementation discussion https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell

**Update - Benchmarks**

Here are some [benchmarks](https://gist.github.com/xiaodaigh/127d65f09c49d4ec16b807ed856c97eb) to show that the factors as group-by doesn't seem to receive extra optimisation.

So I ran two benchmarks one for group-by variable with 2.5million groups, and one for 100millions groups. I ran one where the group-by variable is an integer and one where the group-by variable is a factor. In both cases the run time did not get faster in the group by variable is factor case. One can probably confirm by looking into the source code that sum group-by a factor is not optimised for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sum & mean group by operation not optimised for factor types #2458

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sum & mean group by operation not optimised for factor types #2458

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions