perf: Optimize `multi_group_by` when there are a lot of unique groups #17592

rluvaton · 2025-09-16T11:47:56Z

Which issue does this PR close?

N/A

Rationale for this change

I want fast grouping when there are a lot of columns to group by and there are a lot of unique groups

What changes are included in this PR?

This optimization is fairly simple:
if the row indices to append are continues (i.e. append_row_indices[i] + 1 == append_row_indices[i + 1]) we will call an optimized function for that case

the optimized function should copy all the data in a single pass making it very fast as opposed to item by item.

I did not implement for Bytes views at the moment as I don't think it would be very beneficial as there is no chunk of data that we can copy once and finish with it.

I also added fuzz tests for grouping on multiple columns that each one column that I tested have different optimized implementation. all rows are unique in the test so we assert that we are getting the same output

Are these changes tested?

Yes, and I also try to insert bugs (manual mutation tests) to see that the test is solid

Are there any user-facing changes?

Nope

This optimization is fairly simple: if the row indices to append are continues (i.e. `append_row_indices[i] + 1 == append_row_indices[i + 1]`) we will call an optimized function for that case the optimized function should copy all the data in a single pass making it very fast as opposed to item by item

rluvaton · 2025-09-16T11:52:26Z

@alamb Are there any benchmarks that you can run that use that flow?

I only see the following but it benchmark bytes view which is the only non optimized case:
https://github.com/apache/datafusion/blob/49d49fd92dddf55bfb22787fea17dda1a698dc4d/datafusion/physical-plan/benches/aggregate_vectorized.rs

I see that the original PR that added it run clickbench but I don't know if clickbench have a lot of unique groups:

Avoid RowConverter for multi column grouping (10% faster clickbench queries) #12269

alamb · 2025-09-16T12:32:29Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-for-unique-groups (62ebc86) to 49d49fd diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-09-16T12:33:22Z

@alamb Are there any benchmarks that you can run that use that flow?

I only see the following but it benchmark bytes view which is the only non optimized case: https://github.com/apache/datafusion/blob/49d49fd92dddf55bfb22787fea17dda1a698dc4d/datafusion/physical-plan/benches/aggregate_vectorized.rs

I see that the original PR that added it run clickbench but I don't know if clickbench have a lot of unique groups:

Avoid RowConverter for multi column grouping (10% faster clickbench queries) #12269

Several of the ClickBench queries have many distinct groups so hopefully that will cover it. I kicked off the run and will check back shortly.

Thank you @rluvaton -- this PR sounds quite cool.

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs

alamb · 2025-09-16T13:31:24Z

🤖: Benchmark completed

Details

Comparing HEAD and optimize-for-unique-groups
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-for-unique-groups ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2722.03 ms │                 2794.77 ms │ no change │
│ QQuery 1     │  1409.38 ms │                 1440.73 ms │ no change │
│ QQuery 2     │  2566.38 ms │                 2651.74 ms │ no change │
│ QQuery 3     │  1129.65 ms │                 1175.43 ms │ no change │
│ QQuery 4     │  2241.61 ms │                 2290.29 ms │ no change │
│ QQuery 5     │ 27576.04 ms │                27538.89 ms │ no change │
│ QQuery 6     │  4173.47 ms │                 4199.45 ms │ no change │
│ QQuery 7     │  3618.47 ms │                 3530.42 ms │ no change │
└──────────────┴─────────────┴────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 45437.03ms │
│ Total Time (optimize-for-unique-groups)   │ 45621.73ms │
│ Average Time (HEAD)                       │  5679.63ms │
│ Average Time (optimize-for-unique-groups) │  5702.72ms │
│ Queries Faster                            │          0 │
│ Queries Slower                            │          0 │
│ Queries with No Change                    │          8 │
│ Queries with Failure                      │          0 │
└───────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-for-unique-groups ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.28 ms │                    2.55 ms │  1.12x slower │
│ QQuery 1     │    52.32 ms │                   53.72 ms │     no change │
│ QQuery 2     │   139.49 ms │                  135.95 ms │     no change │
│ QQuery 3     │   164.84 ms │                  168.49 ms │     no change │
│ QQuery 4     │  1197.55 ms │                 1144.75 ms │     no change │
│ QQuery 5     │  1637.89 ms │                 1537.58 ms │ +1.07x faster │
│ QQuery 6     │     2.29 ms │                    2.33 ms │     no change │
│ QQuery 7     │    57.34 ms │                   56.23 ms │     no change │
│ QQuery 8     │  1673.75 ms │                 1570.60 ms │ +1.07x faster │
│ QQuery 9     │  1908.51 ms │                 1898.76 ms │     no change │
│ QQuery 10    │   385.82 ms │                  401.86 ms │     no change │
│ QQuery 11    │   450.09 ms │                  443.56 ms │     no change │
│ QQuery 12    │  1517.20 ms │                 1468.64 ms │     no change │
│ QQuery 13    │  2314.56 ms │                 2240.68 ms │     no change │
│ QQuery 14    │  1366.00 ms │                 1311.77 ms │     no change │
│ QQuery 15    │  1383.56 ms │                 1285.60 ms │ +1.08x faster │
│ QQuery 16    │  2786.81 ms │                 2755.46 ms │     no change │
│ QQuery 17    │  2773.44 ms │                 2707.80 ms │     no change │
│ QQuery 18    │  5778.64 ms │                 5174.39 ms │ +1.12x faster │
│ QQuery 19    │   131.26 ms │                  126.86 ms │     no change │
│ QQuery 20    │  2066.77 ms │                 1955.42 ms │ +1.06x faster │
│ QQuery 21    │  2401.34 ms │                 2302.42 ms │     no change │
│ QQuery 22    │  6141.70 ms │                 3904.40 ms │ +1.57x faster │
│ QQuery 23    │ 12899.35 ms │                12909.65 ms │     no change │
│ QQuery 24    │   228.55 ms │                  217.73 ms │     no change │
│ QQuery 25    │   505.93 ms │                  517.77 ms │     no change │
│ QQuery 26    │   220.93 ms │                  214.14 ms │     no change │
│ QQuery 27    │  2869.58 ms │                 2812.20 ms │     no change │
│ QQuery 28    │ 23375.69 ms │                22868.16 ms │     no change │
│ QQuery 29    │   977.07 ms │                  969.30 ms │     no change │
│ QQuery 30    │  1471.08 ms │                 1351.87 ms │ +1.09x faster │
│ QQuery 31    │  1438.58 ms │                 1327.67 ms │ +1.08x faster │
│ QQuery 32    │  4834.82 ms │                 5019.46 ms │     no change │
│ QQuery 33    │  6235.27 ms │                 5806.49 ms │ +1.07x faster │
│ QQuery 34    │  6331.78 ms │                 5857.93 ms │ +1.08x faster │
│ QQuery 35    │  2288.27 ms │                 2069.91 ms │ +1.11x faster │
│ QQuery 36    │   126.56 ms │                  119.79 ms │ +1.06x faster │
│ QQuery 37    │    54.04 ms │                   54.19 ms │     no change │
│ QQuery 38    │   125.03 ms │                  121.10 ms │     no change │
│ QQuery 39    │   207.64 ms │                  195.97 ms │ +1.06x faster │
│ QQuery 40    │    45.03 ms │                   45.48 ms │     no change │
│ QQuery 41    │    43.67 ms │                   41.54 ms │     no change │
│ QQuery 42    │    35.11 ms │                   31.34 ms │ +1.12x faster │
└──────────────┴─────────────┴────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 100647.45ms │
│ Total Time (optimize-for-unique-groups)   │  95201.52ms │
│ Average Time (HEAD)                       │   2340.64ms │
│ Average Time (optimize-for-unique-groups) │   2213.99ms │
│ Queries Faster                            │          14 │
│ Queries Slower                            │           1 │
│ Queries with No Change                    │          28 │
│ Queries with Failure                      │           0 │
└───────────────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ optimize-for-unique-groups ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 179.03 ms │                  169.45 ms │ +1.06x faster │
│ QQuery 2     │  26.49 ms │                   25.82 ms │     no change │
│ QQuery 3     │  45.92 ms │                   45.74 ms │     no change │
│ QQuery 4     │  26.90 ms │                   26.87 ms │     no change │
│ QQuery 5     │  75.99 ms │                   76.58 ms │     no change │
│ QQuery 6     │  19.58 ms │                   19.50 ms │     no change │
│ QQuery 7     │ 150.54 ms │                  158.20 ms │  1.05x slower │
│ QQuery 8     │  34.10 ms │                   31.76 ms │ +1.07x faster │
│ QQuery 9     │  88.95 ms │                   89.12 ms │     no change │
│ QQuery 10    │  59.43 ms │                   59.19 ms │     no change │
│ QQuery 11    │  41.45 ms │                   41.67 ms │     no change │
│ QQuery 12    │  50.83 ms │                   51.81 ms │     no change │
│ QQuery 13    │  48.98 ms │                   46.55 ms │     no change │
│ QQuery 14    │  13.59 ms │                   14.11 ms │     no change │
│ QQuery 15    │  24.65 ms │                   24.83 ms │     no change │
│ QQuery 16    │  24.46 ms │                   25.42 ms │     no change │
│ QQuery 17    │ 150.75 ms │                  162.26 ms │  1.08x slower │
│ QQuery 18    │ 332.04 ms │                  337.59 ms │     no change │
│ QQuery 19    │  37.18 ms │                   36.85 ms │     no change │
│ QQuery 20    │  50.63 ms │                   51.92 ms │     no change │
│ QQuery 21    │ 230.26 ms │                  231.37 ms │     no change │
│ QQuery 22    │  20.40 ms │                   20.51 ms │     no change │
└──────────────┴───────────┴────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 1732.14ms │
│ Total Time (optimize-for-unique-groups)   │ 1747.09ms │
│ Average Time (HEAD)                       │   78.73ms │
│ Average Time (optimize-for-unique-groups) │   79.41ms │
│ Queries Faster                            │         2 │
│ Queries Slower                            │         2 │
│ Queries with No Change                    │        18 │
│ Queries with Failure                      │         0 │
└───────────────────────────────────────────┴───────────┘

alamb · 2025-09-16T13:31:28Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-for-unique-groups (00b5f07) to 49d49fd diff using: clickbench_extended
Results will be posted here when complete

alamb · 2025-09-16T13:50:06Z

🤖: Benchmark completed

Details

Comparing HEAD and optimize-for-unique-groups
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-for-unique-groups ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2701.79 ms │                 2755.58 ms │ no change │
│ QQuery 1     │  1428.12 ms │                 1423.31 ms │ no change │
│ QQuery 2     │  2565.84 ms │                 2612.89 ms │ no change │
│ QQuery 3     │  1178.71 ms │                 1218.91 ms │ no change │
│ QQuery 4     │  2229.11 ms │                 2294.06 ms │ no change │
│ QQuery 5     │ 28010.48 ms │                27439.24 ms │ no change │
│ QQuery 6     │  4203.36 ms │                 4244.84 ms │ no change │
│ QQuery 7     │  3458.36 ms │                 3372.46 ms │ no change │
└──────────────┴─────────────┴────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 45775.76ms │
│ Total Time (optimize-for-unique-groups)   │ 45361.29ms │
│ Average Time (HEAD)                       │  5721.97ms │
│ Average Time (optimize-for-unique-groups) │  5670.16ms │
│ Queries Faster                            │          0 │
│ Queries Slower                            │          0 │
│ Queries with No Change                    │          8 │
│ Queries with Failure                      │          0 │
└───────────────────────────────────────────┴────────────┘

rluvaton · 2025-09-16T19:33:37Z

@alamb I added fuzz tests, and this PR is ready for review

rluvaton · 2025-09-16T19:34:03Z

@jayzhan211 would love your review as well

rluvaton · 2025-09-16T19:53:45Z

I might be missing something but the benchmark showed that q22 is really fast now

Benchmark clickbench_partitioned.json

...

│ QQuery 22    │  6141.70 ms │                 3904.40 ms │ +1.57x faster │

But q22 have single group by column and it should no go to the path I just changed

SELECT
    SearchPhrase,
    MIN(URL),
    MIN(Title),
    COUNT(*) AS c,
    COUNT(DISTINCT UserID)
FROM hits
WHERE
    Title LIKE '%Google%' AND
    URL NOT LIKE '%.google.%' AND
    SearchPhrase <> ''
GROUP BY SearchPhrase
ORDER BY c DESC
LIMIT 10;

I verified that the codepath was not used by adding panic!, this is the plan BTW:

+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │  SortPreservingMergeExec  │ |
|               | │    --------------------   │ |
|               | │      c DESClimit: 10      │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       SortExec(TopK)      │ |
|               | │    --------------------   │ |
|               | │          c@3 DESC         │ |
|               | │                           │ |
|               | │         limit: 10         │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       ProjectionExec      │ |
|               | │    --------------------   │ |
|               | │       SearchPhrase:       │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │     c: count(Int64(1))    │ |
|               | │                           │ |
|               | │ count(DISTINCT hits.UserID│ |
|               | │             ):            │ |
|               | │count(DISTINCT hits.UserID)│ |
|               | │                           │ |
|               | │      min(hits.Title):     │ |
|               | │      min(hits.Title)      │ |
|               | │                           │ |
|               | │       min(hits.URL):      │ |
|               | │       min(hits.URL)       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │           aggr:           │ |
|               | │  min(hits.URL), min(hits  │ |
|               | │     .Title), count(1),    │ |
|               | │     count(DISTINCT hits   │ |
|               | │          .UserID)         │ |
|               | │                           │ |
|               | │         group_by:         │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │           mode:           │ |
|               | │      FinalPartitioned     │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │      RepartitionExec      │ |
|               | │    --------------------   │ |
|               | │ partition_count(in->out): │ |
|               | │          14 -> 14         │ |
|               | │                           │ |
|               | │    partitioning_scheme:   │ |
|               | │ Hash([SearchPhrase@0], 14)│ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │           aggr:           │ |
|               | │  min(hits.URL), min(hits  │ |
|               | │     .Title), count(1),    │ |
|               | │     count(DISTINCT hits   │ |
|               | │          .UserID)         │ |
|               | │                           │ |
|               | │         group_by:         │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │       mode: Partial       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │         predicate:        │ |
|               | │  Title LIKE %Google% AND  │ |
|               | │    URL NOT LIKE %.google  │ |
|               | │   .% AND SearchPhrase !=  │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         files: 113        │ |
|               | │      format: parquet      │ |
|               | │                           │ |
|               | │         predicate:        │ |
|               | │  Title LIKE %Google% AND  │ |
|               | │    URL NOT LIKE %.google  │ |
|               | │   .% AND SearchPhrase !=  │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+

alamb · 2025-09-16T20:14:15Z

Will rerun to see if we can reproduce the results

alamb · 2025-09-16T20:39:24Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-for-unique-groups (a052b39) to 986cfcd diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-09-16T21:37:59Z

🤖: Benchmark completed

Details

Comparing HEAD and optimize-for-unique-groups
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-for-unique-groups ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2779.99 ms │                 2744.00 ms │ no change │
│ QQuery 1     │  1377.86 ms │                 1406.89 ms │ no change │
│ QQuery 2     │  2575.17 ms │                 2592.62 ms │ no change │
│ QQuery 3     │  1147.53 ms │                 1177.08 ms │ no change │
│ QQuery 4     │  2298.05 ms │                 2276.63 ms │ no change │
│ QQuery 5     │ 27332.48 ms │                27435.99 ms │ no change │
│ QQuery 6     │  4262.29 ms │                 4214.97 ms │ no change │
│ QQuery 7     │  3712.11 ms │                 3566.74 ms │ no change │
└──────────────┴─────────────┴────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 45485.47ms │
│ Total Time (optimize-for-unique-groups)   │ 45414.93ms │
│ Average Time (HEAD)                       │  5685.68ms │
│ Average Time (optimize-for-unique-groups) │  5676.87ms │
│ Queries Faster                            │          0 │
│ Queries Slower                            │          0 │
│ Queries with No Change                    │          8 │
│ Queries with Failure                      │          0 │
└───────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-for-unique-groups ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.16 ms │                    2.41 ms │  1.12x slower │
│ QQuery 1     │    51.32 ms │                   52.64 ms │     no change │
│ QQuery 2     │   134.05 ms │                  138.09 ms │     no change │
│ QQuery 3     │   164.53 ms │                  168.08 ms │     no change │
│ QQuery 4     │  1059.15 ms │                 1069.04 ms │     no change │
│ QQuery 5     │  1492.03 ms │                 1512.52 ms │     no change │
│ QQuery 6     │     2.20 ms │                    2.19 ms │     no change │
│ QQuery 7     │    56.75 ms │                   55.48 ms │     no change │
│ QQuery 8     │  1447.33 ms │                 1506.52 ms │     no change │
│ QQuery 9     │  1765.67 ms │                 1840.73 ms │     no change │
│ QQuery 10    │   389.93 ms │                  394.80 ms │     no change │
│ QQuery 11    │   445.84 ms │                  443.51 ms │     no change │
│ QQuery 12    │  1350.02 ms │                 1435.38 ms │  1.06x slower │
│ QQuery 13    │  2139.82 ms │                 2249.94 ms │  1.05x slower │
│ QQuery 14    │  1248.48 ms │                 1299.54 ms │     no change │
│ QQuery 15    │  1209.39 ms │                 1224.94 ms │     no change │
│ QQuery 16    │  2614.64 ms │                 2718.59 ms │     no change │
│ QQuery 17    │  2690.80 ms │                 2677.02 ms │     no change │
│ QQuery 18    │  5640.97 ms │                 5329.08 ms │ +1.06x faster │
│ QQuery 19    │   130.40 ms │                  129.25 ms │     no change │
│ QQuery 20    │  2030.74 ms │                 2033.11 ms │     no change │
│ QQuery 21    │  2357.74 ms │                 2359.23 ms │     no change │
│ QQuery 22    │  4106.57 ms │                 3974.35 ms │     no change │
│ QQuery 23    │ 15305.21 ms │                12909.29 ms │ +1.19x faster │
│ QQuery 24    │   211.86 ms │                  224.01 ms │  1.06x slower │
│ QQuery 25    │   506.21 ms │                  515.96 ms │     no change │
│ QQuery 26    │   228.53 ms │                  227.99 ms │     no change │
│ QQuery 27    │  2894.44 ms │                 2928.74 ms │     no change │
│ QQuery 28    │ 25113.46 ms │                24712.70 ms │     no change │
│ QQuery 29    │   968.04 ms │                  989.37 ms │     no change │
│ QQuery 30    │  1339.26 ms │                 1376.65 ms │     no change │
│ QQuery 31    │  1354.96 ms │                 1390.03 ms │     no change │
│ QQuery 32    │  5189.73 ms │                 4835.51 ms │ +1.07x faster │
│ QQuery 33    │  6139.17 ms │                 6037.26 ms │     no change │
│ QQuery 34    │  6252.61 ms │                 6170.29 ms │     no change │
│ QQuery 35    │  2096.93 ms │                 2093.04 ms │     no change │
│ QQuery 36    │   120.12 ms │                  122.87 ms │     no change │
│ QQuery 37    │    54.44 ms │                   54.08 ms │     no change │
│ QQuery 38    │   126.87 ms │                  122.23 ms │     no change │
│ QQuery 39    │   207.68 ms │                  200.99 ms │     no change │
│ QQuery 40    │    43.55 ms │                   45.15 ms │     no change │
│ QQuery 41    │    39.51 ms │                   39.51 ms │     no change │
│ QQuery 42    │    35.52 ms │                   33.72 ms │ +1.05x faster │
└──────────────┴─────────────┴────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 100758.66ms │
│ Total Time (optimize-for-unique-groups)   │  97645.83ms │
│ Average Time (HEAD)                       │   2343.22ms │
│ Average Time (optimize-for-unique-groups) │   2270.83ms │
│ Queries Faster                            │           4 │
│ Queries Slower                            │           4 │
│ Queries with No Change                    │          35 │
│ Queries with Failure                      │           0 │
└───────────────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ optimize-for-unique-groups ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 208.80 ms │                  178.99 ms │ +1.17x faster │
│ QQuery 2     │  31.58 ms │                   27.56 ms │ +1.15x faster │
│ QQuery 3     │  45.11 ms │                   46.61 ms │     no change │
│ QQuery 4     │  27.17 ms │                   27.46 ms │     no change │
│ QQuery 5     │  73.71 ms │                   74.51 ms │     no change │
│ QQuery 6     │  19.59 ms │                   20.09 ms │     no change │
│ QQuery 7     │ 151.52 ms │                  155.61 ms │     no change │
│ QQuery 8     │  31.41 ms │                   33.64 ms │  1.07x slower │
│ QQuery 9     │  83.51 ms │                   86.70 ms │     no change │
│ QQuery 10    │  57.83 ms │                   58.84 ms │     no change │
│ QQuery 11    │  41.13 ms │                   41.17 ms │     no change │
│ QQuery 12    │  52.54 ms │                   50.77 ms │     no change │
│ QQuery 13    │  47.04 ms │                   46.79 ms │     no change │
│ QQuery 14    │  13.80 ms │                   14.10 ms │     no change │
│ QQuery 15    │  24.12 ms │                   24.34 ms │     no change │
│ QQuery 16    │  24.57 ms │                   24.54 ms │     no change │
│ QQuery 17    │ 153.88 ms │                  149.93 ms │     no change │
│ QQuery 18    │ 337.89 ms │                  333.17 ms │     no change │
│ QQuery 19    │  37.47 ms │                   36.62 ms │     no change │
│ QQuery 20    │  49.28 ms │                   48.80 ms │     no change │
│ QQuery 21    │ 223.68 ms │                  227.82 ms │     no change │
│ QQuery 22    │  20.37 ms │                   19.97 ms │     no change │
└──────────────┴───────────┴────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 1756.00ms │
│ Total Time (optimize-for-unique-groups)   │ 1728.02ms │
│ Average Time (HEAD)                       │   79.82ms │
│ Average Time (optimize-for-unique-groups) │   78.55ms │
│ Queries Faster                            │         2 │
│ Queries Slower                            │         1 │
│ Queries with No Change                    │        19 │
│ Queries with Failure                      │         0 │
└───────────────────────────────────────────┴───────────┘

alamb · 2025-09-16T21:38:02Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-for-unique-groups (a052b39) to 986cfcd diff using: clickbench_extended
Results will be posted here when complete

alamb · 2025-09-16T21:45:48Z

🤖: Benchmark completed

Details

Comparing HEAD and optimize-for-unique-groups
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-for-unique-groups ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2766.44 ms │                 2697.50 ms │     no change │
│ QQuery 1     │  1440.11 ms │                 1354.09 ms │ +1.06x faster │
│ QQuery 2     │  2574.85 ms │                 2524.95 ms │     no change │
│ QQuery 3     │  1185.21 ms │                 1134.59 ms │     no change │
│ QQuery 4     │  2290.05 ms │                 2271.87 ms │     no change │
│ QQuery 5     │ 27649.57 ms │                27470.27 ms │     no change │
│ QQuery 6     │  4192.77 ms │                 4178.30 ms │     no change │
│ QQuery 7     │  3644.90 ms │                 3554.50 ms │     no change │
└──────────────┴─────────────┴────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 45743.90ms │
│ Total Time (optimize-for-unique-groups)   │ 45186.07ms │
│ Average Time (HEAD)                       │  5717.99ms │
│ Average Time (optimize-for-unique-groups) │  5648.26ms │
│ Queries Faster                            │          1 │
│ Queries Slower                            │          0 │
│ Queries with No Change                    │          7 │
│ Queries with Failure                      │          0 │
└───────────────────────────────────────────┴────────────┘

rluvaton · 2025-09-16T22:12:52Z

Most of the benefit would be from bytes but all bytes are treated as views by default when parsing from sql due to:

feat: mapping sql Char/Text/String default to Utf8View #16290

so no query would show that unless disabled map_string_types_to_utf8view

this is done to make sure we don't allocate the indices again when it is not supported

rluvaton · 2025-09-17T09:32:39Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs

+    /// Whether this builder supports [`Self::append_array_slice`] optimization
+    /// In case it returns true, [`Self::append_array_slice`] must be implemented
+    fn support_append_array_slice(&self) -> bool {
+        false
+    }
+
+    /// Append slice of values from `array`, starting at `start` for `length` rows
+    ///
+    /// This is a special case of `vectorized_append` when the rows are continuous
+    ///
+    /// You should implement this to optimize large copies of contiguous values.
+    ///
+    /// This does not get the sliced array even though it would be more user-friendly
+    /// to allow optimization that avoid the additional computation that can happen in a slice
+    ///
+    /// Note: in order for this to be used, [`Self::support_append_array_slice`] must return true
+    fn append_array_slice(
+        &mut self,
+        _array: &ArrayRef,
+        _start: usize,
+        _length: usize,
+    ) -> Result<()> {
+        assert!(!self.support_append_array_slice(), "support_append_array_slice() return true while append_array_slice() is not implemented");
+        not_impl_err!(
+            "append_array_slice is not implemented for this GroupColumn, please implement it as well as support_append_array_slice"
+        )
+    }


Originally I did not had support_append_array_slice and I had this default implementation for append_array_slice:

fn append_array_slice( &mut self, array: &ArrayRef, start: usize, length: usize, ) -> Result<()> { let rows = (start..start + length).collect::<Vec<_>>(); self.vectorized_append(array, &rows) }

but I moved out of this to avoid having the allocation here as we already hold the append_row_indices.
as I saw from the benchmarks that it can be slower for non supported impl

alamb · 2025-09-17T22:02:24Z

I ran out of time for a thorough review today -- I need to do this one with a clean 🧠 in the morning

alamb · 2025-09-18T15:16:36Z

Second run shows no real change for Query 22 🤔

│ QQuery 22 │ 4106.57 ms │ 3974.35 ms │ no change │

alamb · 2025-09-18T16:01:24Z

I see that the original PR that added it run clickbench but I don't know if clickbench have a lot of unique groups:

Avoid RowConverter for multi column grouping (10% faster clickbench queries) #12269

I tried one query that has a large number of groups:

SELECT "UserID", "SearchPhrase", COUNT(*)
FROM '/Users/andrewlamb/Software/datafusion/benchmarks/data/hits.parquet'
GROUP BY "UserID", "SearchPhrase"
ORDER BY COUNT(*) DESC
LIMIT 10;

and it didn't seem to show any difference

alamb

Thank you @rluvaton

I reviewed this code carefully and it makes a lot of sense to me. In general I think it is almost ready to merge.

I also verified that the new fuzz test covers the newly added code

The only thing I think is required is some sort of benchmark that shows this actually improves performance in some case(there are more guidelines here)

In general, the performance improvement from a change should be “enough” to justify any added code complexity. How much is “enough” is a judgement made by the committers, but generally means that the improvement should be noticeable in a real-world scenario and is greater than the noise of the benchmarking system.

Do you have any ideas / ways to create one? Perhaps a SELECT DISTINCT query with some strings 🤔 ?

alamb · 2025-09-18T15:38:04Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs

    /// The `vectorized append` row indices buffer
    append_row_indices: Vec<usize>,

+    /// If all the values in `append_row_indices` are consecutive


Suggested change

/// If all the values in `append_row_indices` are consecutive

/// If all the values in `append_row_indices` are consecutive.

/// This is updated by [`Self::add_append_row_index`]

alamb · 2025-09-18T15:41:00Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs

-                col,
-                &self.vectorized_operation_buffers.append_row_indices,
-            )?;
+        if self


I suggest putting this check into a function with a descriptive name, something like

if let Some(start, length) = self.consecutive_row_indices() { ... } else { ... }

I think that would make the intent clearer

But this is not necessary for this PR

This reverts commit cc2e725.

alamb · 2025-10-03T15:59:29Z

Any update on the benchmark / showing some query that this improve performances?

rluvaton · 2025-10-08T11:35:35Z

Any update on the benchmark / showing some query that this improve performances?

Sorry we are in a holiday season, creating one now

…offset extending

Dandandan · 2025-10-08T12:17:47Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs

        self.remaining_row_indices.clear();
    }
+
+    fn add_append_row_index(&mut self, row: usize) {


It would probably faster and somewhat cleaner to do this in a single check on append_row_indices in vectorized_append,

something like:

append_row_indices.windows(2).all(|[x, y]| x + 1 == y)

will do.

I think even better, it should also be possible to just check the length of append_row_indices. If this is equal to the size of the incoming batch, all of the indices should be consecutive already (i.e. all values are unique), so it becomes a really cheap check.

I think even better, it should also be possible to just check the length of append_row_indices. If this is equal to the size of the incoming batch, all of the indices should be consecutive already (i.e. all values are unique), so it becomes a really cheap check.

but what if we only need to add the last half of the column. then the length would not be equal but it is still consecutive

also added benchmark for testing pure grouping performance for more than 1 column. ---- I run this query for the data: ``` SELECT COUNT(*) AS total_count, COUNT(DISTINCT u64_wide) AS unique_count, COUNT(DISTINCT u64_wide) * 1.0 / COUNT(*) AS cardinality FROM t; ``` Before: ``` | total_count | unique_count | cardinality | | ----------- | ------------ | ----------- | | 65536 | 2048 | 0.03125 | ``` After: ``` | total_count | unique_count | cardinality | | ----------- | ------------ | ----------- | | 65536 | 65536 | 1.0 | ```

…e-groups

github-actions bot added the physical-plan Changes to the physical-plan crate label Sep 16, 2025

Merge branch 'main' into optimize-for-unique-groups

62ebc86

alamb added the performance Make DataFusion faster label Sep 16, 2025

ashdnazg reviewed Sep 16, 2025

View reviewed changes

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs Outdated Show resolved Hide resolved

ashdnazg reviewed Sep 16, 2025

View reviewed changes

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs Outdated Show resolved Hide resolved

rluvaton added 2 commits September 16, 2025 16:04

updated based on cr

3133196

update comment

00b5f07

rluvaton added 2 commits September 16, 2025 22:29

added fuzz test to test all values are unique in aggregate group by

9476184

Merge branch 'main' into optimize-for-unique-groups

bbc62dc

github-actions bot added the core Core DataFusion crate label Sep 16, 2025

added comment

a052b39

rluvaton added 2 commits September 17, 2025 12:22

don't always call append_array_slice if it is not supported

390bd68

this is done to make sure we don't allocate the indices again when it is not supported

add comment

0a78ede

rluvaton commented Sep 17, 2025

View reviewed changes

alamb reviewed Sep 18, 2025

View reviewed changes

rluvaton added 4 commits September 18, 2025 21:31

add benchmark

c1bcb17

try optimization

cc2e725

Revert "try optimization"

35c5a02

This reverts commit cc2e725.

reserve and use extend

093b069

rluvaton added 4 commits October 8, 2025 14:35

Merge branch 'refs/heads/main' into optimize-for-unique-groups

aa56172

Merge branch 'main' into optimize-for-unique-groups

cbd7fc0

support append_array_slice for boolean group values

1ce9617

avoid push and use extend instead to make the compiler vectorize the …

045020e

…offset extending

Dandandan reviewed Oct 8, 2025

View reviewed changes

rluvaton added 9 commits October 8, 2025 16:05

Merge branch 'main' into fix-benchmark-value-generation

42df966

Merge branch 'fix-benchmark-value-generation' into optimize-for-uniqu…

e690921

…e-groups

format

155868d

Merge branch 'fix-benchmark-value-generation' into optimize-for-uniqu…

ff1a744

…e-groups

added multi group by benchmark on primitive only columns

f4d0373

Merge branch 'fix-benchmark-value-generation' into optimize-for-uniqu…

8f2974d

…e-groups

Merge branch 'main' into optimize-for-unique-groups

c9cbe59

optimize bytes

7d8db1a

	/// If all the values in `append_row_indices` are consecutive
	/// If all the values in `append_row_indices` are consecutive.
	/// This is updated by [`Self::add_append_row_index`]

perf: Optimize multi_group_by when there are a lot of unique groups #17592

Are you sure you want to change the base?

perf: Optimize multi_group_by when there are a lot of unique groups #17592

Conversation

rluvaton commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

rluvaton commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Sep 16, 2025

Uh oh!

alamb commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

alamb commented Sep 16, 2025

Uh oh!

alamb commented Sep 16, 2025

Uh oh!

alamb commented Sep 16, 2025

Uh oh!

rluvaton commented Sep 16, 2025

Uh oh!

rluvaton commented Sep 16, 2025

Uh oh!

rluvaton commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Sep 16, 2025

Uh oh!

alamb commented Sep 16, 2025

Uh oh!

alamb commented Sep 16, 2025

Uh oh!

alamb commented Sep 16, 2025

Uh oh!

alamb commented Sep 16, 2025

Uh oh!

rluvaton commented Sep 16, 2025

Uh oh!

rluvaton Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Sep 17, 2025

Uh oh!

alamb commented Sep 18, 2025

Uh oh!

alamb commented Sep 18, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 3, 2025

Uh oh!

rluvaton commented Oct 8, 2025

Uh oh!

Dandandan Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Dandandan Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

rluvaton Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

perf: Optimize `multi_group_by` when there are a lot of unique groups #17592

perf: Optimize `multi_group_by` when there are a lot of unique groups #17592

rluvaton commented Sep 16, 2025 •

edited

Loading

rluvaton commented Sep 16, 2025 •

edited

Loading

rluvaton commented Sep 16, 2025 •

edited

Loading