[C++][Compute][Acero] Poor aggregate performance when there is a large number of batches on the build side

### Describe the bug, including details regarding any error messages, version, and platform.

I am running a performance test on a plan as described below. This execution plan is fairly straightforward, involving only two source nodes, one hash join node, and one aggregation node.

```ast
         aggr
           +
         join
     +-----+------+
  source_0     source_1
  (probe)      (build)
```

However, I noticed that the performance is far below my expectations. So, I made the following table for further analysis.

In this table, the horizontal axis represents the number of batches generated on the build side, while the vertical axis denotes the number of batches produced on the probe side, with each batch size being `1<<15`.


| time(in seconds) | build 1 | build 2 | build 4 | build 16 | build 32 | build 64 |
| ---------------- | ------- | ------- | ------- | -------- | -------- | -------- |
| probe 1          | 0.03    | 0.04    | 0.06    | 0.03     | 7.6      | 269.7    |
| probe 2          | 0.04    | 0.04    | 0.06    | 7.7      | 59.1     | 176.0    |
| probe 4          | 0.05    | 0.06    | 0.06    | 7.7      | 56.1     | 229.9    |
| probe 16         | 0.11    | 0.12    | 0.12    | 3.2      | 32.2     | 145.8    |
| probe 32         | 0.19    | 0.19    | 0.20    | 3.1      | 32.2     | 107.5    |
| probe 64         | 0.36    | 0.36    | 0.36    | 3.4      | 44.9     | 134.9    |
| probe 256        | 1.33    | 1.33    | 1.34    | 5.5      | 30.7     | 197.3    |
| probe 1024       | 5.2     | 5.3     | 5.2     | 5.3      | 17.7     | 50.9     |


More information:
1. I run these tests on a Intel x86_64 machine with about 100 cores.

2. I have noticed that in scenarios where the execution time exceeds 10 seconds, the CPU utilization is very low, and in the most of time, only one CPU core is being used.

3. I found that the `arrow/compute/row/grouper.cc:ConsumeImpl` method is quite time-consuming. (map phase)

4. The process of generating data also consumes a considerable amount of time, but in the tests mentioned above, the time spent on data generation does not account for a significant portion.

5. The distribution of data is also certain to affect the execution time, but I have not conducted any relevant verification.

6. In some scenarios, the execution time is longer even with a smaller amount of data, which should be considered unreasonable.

7. I tested on both version 19.0.1 and the main branch, and no significant differences were observed.(this table is based on v19.0.1)
 
8. If we remove the aggregate node, everything becomes OK, so I guess the main problem is the hash aggr node; If we replace `hash_count` in the test code below with `hash_sum` or another hash aggregation function, the results were almost unanimous.

### Component(s)

C++


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++][Compute][Acero] Poor aggregate performance when there is a large number of batches on the build side #45847

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

time(in seconds)	build 1	build 2	build 4	build 16	build 32	build 64
probe 1	0.03	0.04	0.06	0.03	7.6	269.7
probe 2	0.04	0.04	0.06	7.7	59.1	176.0
probe 4	0.05	0.06	0.06	7.7	56.1	229.9
probe 16	0.11	0.12	0.12	3.2	32.2	145.8
probe 32	0.19	0.19	0.20	3.1	32.2	107.5
probe 64	0.36	0.36	0.36	3.4	44.9	134.9
probe 256	1.33	1.33	1.34	5.5	30.7	197.3
probe 1024	5.2	5.3	5.2	5.3	17.7	50.9

[C++][Compute][Acero] Poor aggregate performance when there is a large number of batches on the build side #45847

Description

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions