Skip to content

A small note on the order of cube result #3179

@Henrik-P

Description

@Henrik-P

In the Description in ?cube, we find that [cube]

Reflects SQLs GROUPING SETS

Then, the Value section is rather vague:

A data.table with various aggregates

...but References are provided to PostgreSQL 7.2.4. GROUPING SETS, CUBE, and ROLLUP.

There we find that:

CUBE ( a, b, c )

is equivalent to

GROUPING SETS (
    ( a, b, c ),
    ( a, b    ),
    ( a,    c ),
    ( a       ),
    (    b, c ),
    (    b    ),
    (       c ),
    (         )
)

It seems like the first grouping variable varies slowest, and the last variable varies fastest. Disclaimer: I don't have access to PostgreSQL, so I can't confirm that this actually reflects the final output ;)

In cube, it is the other way around: the last grouping variable in by varies slowest, and the first fastest, like:

GROUPING SETS (
    ( a, b, c ),
    (    b, c ),
    ( a,    c ),
    (       c ),
    ( a, b    ),
    (    b    ),
    ( a       ),
    (         )
)

I don't claim that PostgreSQL is "right", but given that no explicit Value section is provided in ?cube, and that the help text instead refers to PostgreSQL docs, the order of cube output may be considered inconsistent (but again, note my disclaimer above). In addition, I find the PostgreSQL ordering more intuitive, a matter of taste perhaps.


An example to illustrate the order of cube:

set.seed(1)
d <- data.table(a = rep(1:2, each = 8),
                b = rep(1:2, each = 4),
                c = rep(1:2, each = 2),
                val = sample(0:1, 16, replace = TRUE))

all.equal(
  cube(d, j= sum(val), by = c("a", "b", "c")),
  groupingsets(d, j = sum(val), by = c("a", "b", "c"),
               sets = list(c("a", "b", "c"),
                           c(     "b", "c"),
                           c("a",      "c"),
                           c(          "c"),
                           c("a", "b"     ),
                           c(     "b"     ),
                           c("a"          ),
                           character()))
)
# [1] TRUE

Update

When I added an id, it's obvious that the counter is in fact based on the PostgreSQL order, which in the current output order becomes non-consecutive. Somewhat odd. It seems to me that the output rather could have the PostgreSQL order right away.

cube(d, j= sum(val), by = c("a", "b", "c"), id = TRUE)
    grouping  a  b  c V1
 1:        0  1  1  1  0 #  ( a, b, c )
 2:        0  1  1  2  2
 3:        0  1  2  1  1
 4:        0  1  2  2  2
 5:        0  2  1  1  1
 6:        0  2  1  2  0
 7:        0  2  2  1  1
 8:        0  2  2  2  1
 9:        4 NA  1  1  1 # (    b, c )
10:        4 NA  1  2  2
11:        4 NA  2  1  2
12:        4 NA  2  2  3
13:        2  1 NA  1  1 # ( a,    c )
14:        2  1 NA  2  4
15:        2  2 NA  1  2
16:        2  2 NA  2  1
17:        6 NA NA  1  3 # (       c )
18:        6 NA NA  2  5
19:        1  1  1 NA  2 # ( a, b    )
20:        1  1  2 NA  3
21:        1  2  1 NA  1
22:        1  2  2 NA  2
23:        5 NA  1 NA  3 # (    b    )
24:        5 NA  2 NA  5
25:        3  1 NA NA  5 # ( a       )
26:        3  2 NA NA  3
27:        7 NA NA NA  8 # (         )
 grouping  a  b  c V1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions