Skip to content

[C++] Switching the column ordering of "segment key" and "hash key" #34908

@icexelloss

Description

@icexelloss

Describe the enhancement requested

Recently, this PR changes the ordering output columns of aggregation to
(hash)keys + segment_keys + aggregates
(https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/aggregate_node.cc#L654)

I wondering it makes more sense to change this to
segment_keys + (hash)keys + aggregates

My reasoning is as follows:
This only really affects the result in the case of segmented aggregation so I will solely reason about this in the context of segmented aggregation (in non-segmented aggregation, segment_keys is empty so the result wouldn't change regardless).

In segmented aggregation, the hash key is used within each segment (i.e., we don't combine grouped from different segments together) so in a way, the hash key feels like "second key", i.e., data is first split up by segments, then by hash keys. And therefore, it feels more natural to order the output columns: primary key (segment keys) + secondary key (hash keys) + aggregates.

To give an example of time ordered aggregations (an application of segmented aggregations). where segment key = "time" and hash key = "id", we almost always wanted the output to be "time, id" instead of "id, time" because the full stream is time ordered so it makes sense to make that the first column.

@westonpace I wonder what is your thoughts on switching this up? If you feel strongly about keeping the current order I think we might still be able to tweak this at the substrait level with emit output to get the order we want, but I figured it's probably easier / more consistent if we just switch up the default ordering.

I am happy to make the change if that sounds ok but want to discuss first.

Component(s)

C++

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions