Skip to content

sum & mean group by operation not optimised for factor types #2458

@xiaodaigh

Description

@xiaodaigh

Seems like group by are not optimised where the by-variable is of factor type. Essentially one can perform group-by faster for factors given that it's represented by integers from 1 to n where n is the number of groups.

See Python implementation discussion http://wesmckinney.com/blog/mastering-high-performance-data-algorithms-i-group-by/
and Julia implementation discussion https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell

Update - Benchmarks

Here are some benchmarks to show that the factors as group-by doesn't seem to receive extra optimisation.

So I ran two benchmarks one for group-by variable with 2.5million groups, and one for 100millions groups. I ran one where the group-by variable is an integer and one where the group-by variable is a factor. In both cases the run time did not get faster in the group by variable is factor case. One can probably confirm by looking into the source code that sum group-by a factor is not optimised for.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions