Skip to content

Conversation

@rluvaton
Copy link
Member

@rluvaton rluvaton commented Sep 16, 2025

Which issue does this PR close?

N/A

Rationale for this change

I want fast grouping when there are a lot of columns to group by and there are a lot of unique groups

What changes are included in this PR?

This optimization is fairly simple:
if the row indices to append are continues (i.e. append_row_indices[i] + 1 == append_row_indices[i + 1]) we will call an optimized function for that case

the optimized function should copy all the data in a single pass making it very fast as opposed to item by item.

I did not implement for Bytes views at the moment as I don't think it would be very beneficial as there is no chunk of data that we can copy once and finish with it.

I also added fuzz tests for grouping on multiple columns that each one column that I tested have different optimized implementation. all rows are unique in the test so we assert that we are getting the same output

Are these changes tested?

Yes, and I also try to insert bugs (manual mutation tests) to see that the test is solid

Are there any user-facing changes?

Nope

This optimization is fairly simple:
if the row indices to append are continues (i.e. `append_row_indices[i] + 1 == append_row_indices[i + 1]`) we will call an optimized function for that case

the optimized function should copy all the data in a single pass making it very fast as opposed to item by item
@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Sep 16, 2025
@rluvaton
Copy link
Member Author

rluvaton commented Sep 16, 2025

@alamb Are there any benchmarks that you can run that use that flow?

I only see the following but it benchmark bytes view which is the only non optimized case:
https://github.com/apache/datafusion/blob/49d49fd92dddf55bfb22787fea17dda1a698dc4d/datafusion/physical-plan/benches/aggregate_vectorized.rs

I see that the original PR that added it run clickbench but I don't know if clickbench have a lot of unique groups:

@alamb
Copy link
Contributor

alamb commented Sep 16, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-for-unique-groups (62ebc86) to 49d49fd diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 16, 2025

@alamb Are there any benchmarks that you can run that use that flow?

I only see the following but it benchmark bytes view which is the only non optimized case: https://github.com/apache/datafusion/blob/49d49fd92dddf55bfb22787fea17dda1a698dc4d/datafusion/physical-plan/benches/aggregate_vectorized.rs

I see that the original PR that added it run clickbench but I don't know if clickbench have a lot of unique groups:

Several of the ClickBench queries have many distinct groups so hopefully that will cover it. I kicked off the run and will check back shortly.

Thank you @rluvaton -- this PR sounds quite cool.

@alamb alamb added the performance Make DataFusion faster label Sep 16, 2025
@alamb
Copy link
Contributor

alamb commented Sep 16, 2025

🤖: Benchmark completed

Details

Comparing HEAD and optimize-for-unique-groups
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-for-unique-groups ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2722.03 ms │                 2794.77 ms │ no change │
│ QQuery 1     │  1409.38 ms │                 1440.73 ms │ no change │
│ QQuery 2     │  2566.38 ms │                 2651.74 ms │ no change │
│ QQuery 3     │  1129.65 ms │                 1175.43 ms │ no change │
│ QQuery 4     │  2241.61 ms │                 2290.29 ms │ no change │
│ QQuery 5     │ 27576.04 ms │                27538.89 ms │ no change │
│ QQuery 6     │  4173.47 ms │                 4199.45 ms │ no change │
│ QQuery 7     │  3618.47 ms │                 3530.42 ms │ no change │
└──────────────┴─────────────┴────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 45437.03ms │
│ Total Time (optimize-for-unique-groups)   │ 45621.73ms │
│ Average Time (HEAD)                       │  5679.63ms │
│ Average Time (optimize-for-unique-groups) │  5702.72ms │
│ Queries Faster                            │          0 │
│ Queries Slower                            │          0 │
│ Queries with No Change                    │          8 │
│ Queries with Failure                      │          0 │
└───────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-for-unique-groups ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.28 ms │                    2.55 ms │  1.12x slower │
│ QQuery 1     │    52.32 ms │                   53.72 ms │     no change │
│ QQuery 2     │   139.49 ms │                  135.95 ms │     no change │
│ QQuery 3     │   164.84 ms │                  168.49 ms │     no change │
│ QQuery 4     │  1197.55 ms │                 1144.75 ms │     no change │
│ QQuery 5     │  1637.89 ms │                 1537.58 ms │ +1.07x faster │
│ QQuery 6     │     2.29 ms │                    2.33 ms │     no change │
│ QQuery 7     │    57.34 ms │                   56.23 ms │     no change │
│ QQuery 8     │  1673.75 ms │                 1570.60 ms │ +1.07x faster │
│ QQuery 9     │  1908.51 ms │                 1898.76 ms │     no change │
│ QQuery 10    │   385.82 ms │                  401.86 ms │     no change │
│ QQuery 11    │   450.09 ms │                  443.56 ms │     no change │
│ QQuery 12    │  1517.20 ms │                 1468.64 ms │     no change │
│ QQuery 13    │  2314.56 ms │                 2240.68 ms │     no change │
│ QQuery 14    │  1366.00 ms │                 1311.77 ms │     no change │
│ QQuery 15    │  1383.56 ms │                 1285.60 ms │ +1.08x faster │
│ QQuery 16    │  2786.81 ms │                 2755.46 ms │     no change │
│ QQuery 17    │  2773.44 ms │                 2707.80 ms │     no change │
│ QQuery 18    │  5778.64 ms │                 5174.39 ms │ +1.12x faster │
│ QQuery 19    │   131.26 ms │                  126.86 ms │     no change │
│ QQuery 20    │  2066.77 ms │                 1955.42 ms │ +1.06x faster │
│ QQuery 21    │  2401.34 ms │                 2302.42 ms │     no change │
│ QQuery 22    │  6141.70 ms │                 3904.40 ms │ +1.57x faster │
│ QQuery 23    │ 12899.35 ms │                12909.65 ms │     no change │
│ QQuery 24    │   228.55 ms │                  217.73 ms │     no change │
│ QQuery 25    │   505.93 ms │                  517.77 ms │     no change │
│ QQuery 26    │   220.93 ms │                  214.14 ms │     no change │
│ QQuery 27    │  2869.58 ms │                 2812.20 ms │     no change │
│ QQuery 28    │ 23375.69 ms │                22868.16 ms │     no change │
│ QQuery 29    │   977.07 ms │                  969.30 ms │     no change │
│ QQuery 30    │  1471.08 ms │                 1351.87 ms │ +1.09x faster │
│ QQuery 31    │  1438.58 ms │                 1327.67 ms │ +1.08x faster │
│ QQuery 32    │  4834.82 ms │                 5019.46 ms │     no change │
│ QQuery 33    │  6235.27 ms │                 5806.49 ms │ +1.07x faster │
│ QQuery 34    │  6331.78 ms │                 5857.93 ms │ +1.08x faster │
│ QQuery 35    │  2288.27 ms │                 2069.91 ms │ +1.11x faster │
│ QQuery 36    │   126.56 ms │                  119.79 ms │ +1.06x faster │
│ QQuery 37    │    54.04 ms │                   54.19 ms │     no change │
│ QQuery 38    │   125.03 ms │                  121.10 ms │     no change │
│ QQuery 39    │   207.64 ms │                  195.97 ms │ +1.06x faster │
│ QQuery 40    │    45.03 ms │                   45.48 ms │     no change │
│ QQuery 41    │    43.67 ms │                   41.54 ms │     no change │
│ QQuery 42    │    35.11 ms │                   31.34 ms │ +1.12x faster │
└──────────────┴─────────────┴────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 100647.45ms │
│ Total Time (optimize-for-unique-groups)   │  95201.52ms │
│ Average Time (HEAD)                       │   2340.64ms │
│ Average Time (optimize-for-unique-groups) │   2213.99ms │
│ Queries Faster                            │          14 │
│ Queries Slower                            │           1 │
│ Queries with No Change                    │          28 │
│ Queries with Failure                      │           0 │
└───────────────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ optimize-for-unique-groups ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 179.03 ms │                  169.45 ms │ +1.06x faster │
│ QQuery 2     │  26.49 ms │                   25.82 ms │     no change │
│ QQuery 3     │  45.92 ms │                   45.74 ms │     no change │
│ QQuery 4     │  26.90 ms │                   26.87 ms │     no change │
│ QQuery 5     │  75.99 ms │                   76.58 ms │     no change │
│ QQuery 6     │  19.58 ms │                   19.50 ms │     no change │
│ QQuery 7     │ 150.54 ms │                  158.20 ms │  1.05x slower │
│ QQuery 8     │  34.10 ms │                   31.76 ms │ +1.07x faster │
│ QQuery 9     │  88.95 ms │                   89.12 ms │     no change │
│ QQuery 10    │  59.43 ms │                   59.19 ms │     no change │
│ QQuery 11    │  41.45 ms │                   41.67 ms │     no change │
│ QQuery 12    │  50.83 ms │                   51.81 ms │     no change │
│ QQuery 13    │  48.98 ms │                   46.55 ms │     no change │
│ QQuery 14    │  13.59 ms │                   14.11 ms │     no change │
│ QQuery 15    │  24.65 ms │                   24.83 ms │     no change │
│ QQuery 16    │  24.46 ms │                   25.42 ms │     no change │
│ QQuery 17    │ 150.75 ms │                  162.26 ms │  1.08x slower │
│ QQuery 18    │ 332.04 ms │                  337.59 ms │     no change │
│ QQuery 19    │  37.18 ms │                   36.85 ms │     no change │
│ QQuery 20    │  50.63 ms │                   51.92 ms │     no change │
│ QQuery 21    │ 230.26 ms │                  231.37 ms │     no change │
│ QQuery 22    │  20.40 ms │                   20.51 ms │     no change │
└──────────────┴───────────┴────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 1732.14ms │
│ Total Time (optimize-for-unique-groups)   │ 1747.09ms │
│ Average Time (HEAD)                       │   78.73ms │
│ Average Time (optimize-for-unique-groups) │   79.41ms │
│ Queries Faster                            │         2 │
│ Queries Slower                            │         2 │
│ Queries with No Change                    │        18 │
│ Queries with Failure                      │         0 │
└───────────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Sep 16, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-for-unique-groups (00b5f07) to 49d49fd diff using: clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 16, 2025

🤖: Benchmark completed

Details

Comparing HEAD and optimize-for-unique-groups
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-for-unique-groups ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2701.79 ms │                 2755.58 ms │ no change │
│ QQuery 1     │  1428.12 ms │                 1423.31 ms │ no change │
│ QQuery 2     │  2565.84 ms │                 2612.89 ms │ no change │
│ QQuery 3     │  1178.71 ms │                 1218.91 ms │ no change │
│ QQuery 4     │  2229.11 ms │                 2294.06 ms │ no change │
│ QQuery 5     │ 28010.48 ms │                27439.24 ms │ no change │
│ QQuery 6     │  4203.36 ms │                 4244.84 ms │ no change │
│ QQuery 7     │  3458.36 ms │                 3372.46 ms │ no change │
└──────────────┴─────────────┴────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 45775.76ms │
│ Total Time (optimize-for-unique-groups)   │ 45361.29ms │
│ Average Time (HEAD)                       │  5721.97ms │
│ Average Time (optimize-for-unique-groups) │  5670.16ms │
│ Queries Faster                            │          0 │
│ Queries Slower                            │          0 │
│ Queries with No Change                    │          8 │
│ Queries with Failure                      │          0 │
└───────────────────────────────────────────┴────────────┘

@github-actions github-actions bot added the core Core DataFusion crate label Sep 16, 2025
@rluvaton
Copy link
Member Author

@alamb I added fuzz tests, and this PR is ready for review

@rluvaton
Copy link
Member Author

@jayzhan211 would love your review as well

@rluvaton
Copy link
Member Author

rluvaton commented Sep 16, 2025

I might be missing something but the benchmark showed that q22 is really fast now

Benchmark clickbench_partitioned.json

...

│ QQuery 22    │  6141.70 ms │                 3904.40 ms │ +1.57x faster │

But q22 have single group by column and it should no go to the path I just changed

SELECT
    SearchPhrase,
    MIN(URL),
    MIN(Title),
    COUNT(*) AS c,
    COUNT(DISTINCT UserID)
FROM hits
WHERE
    Title LIKE '%Google%' AND
    URL NOT LIKE '%.google.%' AND
    SearchPhrase <> ''
GROUP BY SearchPhrase
ORDER BY c DESC
LIMIT 10;

I verified that the codepath was not used by adding panic!, this is the plan BTW:

+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │  SortPreservingMergeExec  │ |
|               | │    --------------------   │ |
|               | │      c DESClimit: 10      │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       SortExec(TopK)      │ |
|               | │    --------------------   │ |
|               | │          c@3 DESC         │ |
|               | │                           │ |
|               | │         limit: 10         │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       ProjectionExec      │ |
|               | │    --------------------   │ |
|               | │       SearchPhrase:       │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │     c: count(Int64(1))    │ |
|               | │                           │ |
|               | │ count(DISTINCT hits.UserID│ |
|               | │             ):            │ |
|               | │count(DISTINCT hits.UserID)│ |
|               | │                           │ |
|               | │      min(hits.Title):     │ |
|               | │      min(hits.Title)      │ |
|               | │                           │ |
|               | │       min(hits.URL):      │ |
|               | │       min(hits.URL)       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │           aggr:           │ |
|               | │  min(hits.URL), min(hits  │ |
|               | │     .Title), count(1),    │ |
|               | │     count(DISTINCT hits   │ |
|               | │          .UserID)         │ |
|               | │                           │ |
|               | │         group_by:         │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │           mode:           │ |
|               | │      FinalPartitioned     │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │      RepartitionExec      │ |
|               | │    --------------------   │ |
|               | │ partition_count(in->out): │ |
|               | │          14 -> 14         │ |
|               | │                           │ |
|               | │    partitioning_scheme:   │ |
|               | │ Hash([SearchPhrase@0], 14)│ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │           aggr:           │ |
|               | │  min(hits.URL), min(hits  │ |
|               | │     .Title), count(1),    │ |
|               | │     count(DISTINCT hits   │ |
|               | │          .UserID)         │ |
|               | │                           │ |
|               | │         group_by:         │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │       mode: Partial       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │         predicate:        │ |
|               | │  Title LIKE %Google% AND  │ |
|               | │    URL NOT LIKE %.google  │ |
|               | │   .% AND SearchPhrase !=  │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         files: 113        │ |
|               | │      format: parquet      │ |
|               | │                           │ |
|               | │         predicate:        │ |
|               | │  Title LIKE %Google% AND  │ |
|               | │    URL NOT LIKE %.google  │ |
|               | │   .% AND SearchPhrase !=  │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+

@alamb
Copy link
Contributor

alamb commented Sep 16, 2025

Will rerun to see if we can reproduce the results

@alamb
Copy link
Contributor

alamb commented Sep 16, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-for-unique-groups (a052b39) to 986cfcd diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 16, 2025

🤖: Benchmark completed

Details

Comparing HEAD and optimize-for-unique-groups
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-for-unique-groups ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2779.99 ms │                 2744.00 ms │ no change │
│ QQuery 1     │  1377.86 ms │                 1406.89 ms │ no change │
│ QQuery 2     │  2575.17 ms │                 2592.62 ms │ no change │
│ QQuery 3     │  1147.53 ms │                 1177.08 ms │ no change │
│ QQuery 4     │  2298.05 ms │                 2276.63 ms │ no change │
│ QQuery 5     │ 27332.48 ms │                27435.99 ms │ no change │
│ QQuery 6     │  4262.29 ms │                 4214.97 ms │ no change │
│ QQuery 7     │  3712.11 ms │                 3566.74 ms │ no change │
└──────────────┴─────────────┴────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 45485.47ms │
│ Total Time (optimize-for-unique-groups)   │ 45414.93ms │
│ Average Time (HEAD)                       │  5685.68ms │
│ Average Time (optimize-for-unique-groups) │  5676.87ms │
│ Queries Faster                            │          0 │
│ Queries Slower                            │          0 │
│ Queries with No Change                    │          8 │
│ Queries with Failure                      │          0 │
└───────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-for-unique-groups ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.16 ms │                    2.41 ms │  1.12x slower │
│ QQuery 1     │    51.32 ms │                   52.64 ms │     no change │
│ QQuery 2     │   134.05 ms │                  138.09 ms │     no change │
│ QQuery 3     │   164.53 ms │                  168.08 ms │     no change │
│ QQuery 4     │  1059.15 ms │                 1069.04 ms │     no change │
│ QQuery 5     │  1492.03 ms │                 1512.52 ms │     no change │
│ QQuery 6     │     2.20 ms │                    2.19 ms │     no change │
│ QQuery 7     │    56.75 ms │                   55.48 ms │     no change │
│ QQuery 8     │  1447.33 ms │                 1506.52 ms │     no change │
│ QQuery 9     │  1765.67 ms │                 1840.73 ms │     no change │
│ QQuery 10    │   389.93 ms │                  394.80 ms │     no change │
│ QQuery 11    │   445.84 ms │                  443.51 ms │     no change │
│ QQuery 12    │  1350.02 ms │                 1435.38 ms │  1.06x slower │
│ QQuery 13    │  2139.82 ms │                 2249.94 ms │  1.05x slower │
│ QQuery 14    │  1248.48 ms │                 1299.54 ms │     no change │
│ QQuery 15    │  1209.39 ms │                 1224.94 ms │     no change │
│ QQuery 16    │  2614.64 ms │                 2718.59 ms │     no change │
│ QQuery 17    │  2690.80 ms │                 2677.02 ms │     no change │
│ QQuery 18    │  5640.97 ms │                 5329.08 ms │ +1.06x faster │
│ QQuery 19    │   130.40 ms │                  129.25 ms │     no change │
│ QQuery 20    │  2030.74 ms │                 2033.11 ms │     no change │
│ QQuery 21    │  2357.74 ms │                 2359.23 ms │     no change │
│ QQuery 22    │  4106.57 ms │                 3974.35 ms │     no change │
│ QQuery 23    │ 15305.21 ms │                12909.29 ms │ +1.19x faster │
│ QQuery 24    │   211.86 ms │                  224.01 ms │  1.06x slower │
│ QQuery 25    │   506.21 ms │                  515.96 ms │     no change │
│ QQuery 26    │   228.53 ms │                  227.99 ms │     no change │
│ QQuery 27    │  2894.44 ms │                 2928.74 ms │     no change │
│ QQuery 28    │ 25113.46 ms │                24712.70 ms │     no change │
│ QQuery 29    │   968.04 ms │                  989.37 ms │     no change │
│ QQuery 30    │  1339.26 ms │                 1376.65 ms │     no change │
│ QQuery 31    │  1354.96 ms │                 1390.03 ms │     no change │
│ QQuery 32    │  5189.73 ms │                 4835.51 ms │ +1.07x faster │
│ QQuery 33    │  6139.17 ms │                 6037.26 ms │     no change │
│ QQuery 34    │  6252.61 ms │                 6170.29 ms │     no change │
│ QQuery 35    │  2096.93 ms │                 2093.04 ms │     no change │
│ QQuery 36    │   120.12 ms │                  122.87 ms │     no change │
│ QQuery 37    │    54.44 ms │                   54.08 ms │     no change │
│ QQuery 38    │   126.87 ms │                  122.23 ms │     no change │
│ QQuery 39    │   207.68 ms │                  200.99 ms │     no change │
│ QQuery 40    │    43.55 ms │                   45.15 ms │     no change │
│ QQuery 41    │    39.51 ms │                   39.51 ms │     no change │
│ QQuery 42    │    35.52 ms │                   33.72 ms │ +1.05x faster │
└──────────────┴─────────────┴────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 100758.66ms │
│ Total Time (optimize-for-unique-groups)   │  97645.83ms │
│ Average Time (HEAD)                       │   2343.22ms │
│ Average Time (optimize-for-unique-groups) │   2270.83ms │
│ Queries Faster                            │           4 │
│ Queries Slower                            │           4 │
│ Queries with No Change                    │          35 │
│ Queries with Failure                      │           0 │
└───────────────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ optimize-for-unique-groups ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 208.80 ms │                  178.99 ms │ +1.17x faster │
│ QQuery 2     │  31.58 ms │                   27.56 ms │ +1.15x faster │
│ QQuery 3     │  45.11 ms │                   46.61 ms │     no change │
│ QQuery 4     │  27.17 ms │                   27.46 ms │     no change │
│ QQuery 5     │  73.71 ms │                   74.51 ms │     no change │
│ QQuery 6     │  19.59 ms │                   20.09 ms │     no change │
│ QQuery 7     │ 151.52 ms │                  155.61 ms │     no change │
│ QQuery 8     │  31.41 ms │                   33.64 ms │  1.07x slower │
│ QQuery 9     │  83.51 ms │                   86.70 ms │     no change │
│ QQuery 10    │  57.83 ms │                   58.84 ms │     no change │
│ QQuery 11    │  41.13 ms │                   41.17 ms │     no change │
│ QQuery 12    │  52.54 ms │                   50.77 ms │     no change │
│ QQuery 13    │  47.04 ms │                   46.79 ms │     no change │
│ QQuery 14    │  13.80 ms │                   14.10 ms │     no change │
│ QQuery 15    │  24.12 ms │                   24.34 ms │     no change │
│ QQuery 16    │  24.57 ms │                   24.54 ms │     no change │
│ QQuery 17    │ 153.88 ms │                  149.93 ms │     no change │
│ QQuery 18    │ 337.89 ms │                  333.17 ms │     no change │
│ QQuery 19    │  37.47 ms │                   36.62 ms │     no change │
│ QQuery 20    │  49.28 ms │                   48.80 ms │     no change │
│ QQuery 21    │ 223.68 ms │                  227.82 ms │     no change │
│ QQuery 22    │  20.37 ms │                   19.97 ms │     no change │
└──────────────┴───────────┴────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 1756.00ms │
│ Total Time (optimize-for-unique-groups)   │ 1728.02ms │
│ Average Time (HEAD)                       │   79.82ms │
│ Average Time (optimize-for-unique-groups) │   78.55ms │
│ Queries Faster                            │         2 │
│ Queries Slower                            │         1 │
│ Queries with No Change                    │        19 │
│ Queries with Failure                      │         0 │
└───────────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Sep 16, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-for-unique-groups (a052b39) to 986cfcd diff using: clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 16, 2025

🤖: Benchmark completed

Details

Comparing HEAD and optimize-for-unique-groups
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-for-unique-groups ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2766.44 ms │                 2697.50 ms │     no change │
│ QQuery 1     │  1440.11 ms │                 1354.09 ms │ +1.06x faster │
│ QQuery 2     │  2574.85 ms │                 2524.95 ms │     no change │
│ QQuery 3     │  1185.21 ms │                 1134.59 ms │     no change │
│ QQuery 4     │  2290.05 ms │                 2271.87 ms │     no change │
│ QQuery 5     │ 27649.57 ms │                27470.27 ms │     no change │
│ QQuery 6     │  4192.77 ms │                 4178.30 ms │     no change │
│ QQuery 7     │  3644.90 ms │                 3554.50 ms │     no change │
└──────────────┴─────────────┴────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 45743.90ms │
│ Total Time (optimize-for-unique-groups)   │ 45186.07ms │
│ Average Time (HEAD)                       │  5717.99ms │
│ Average Time (optimize-for-unique-groups) │  5648.26ms │
│ Queries Faster                            │          1 │
│ Queries Slower                            │          0 │
│ Queries with No Change                    │          7 │
│ Queries with Failure                      │          0 │
└───────────────────────────────────────────┴────────────┘

@rluvaton
Copy link
Member Author

Most of the benefit would be from bytes but all bytes are treated as views by default when parsing from sql due to:

so no query would show that unless disabled map_string_types_to_utf8view

this is done to make sure we don't allocate the indices again when it is not supported
Comment on lines +91 to +117
/// Whether this builder supports [`Self::append_array_slice`] optimization
/// In case it returns true, [`Self::append_array_slice`] must be implemented
fn support_append_array_slice(&self) -> bool {
false
}

/// Append slice of values from `array`, starting at `start` for `length` rows
///
/// This is a special case of `vectorized_append` when the rows are continuous
///
/// You should implement this to optimize large copies of contiguous values.
///
/// This does not get the sliced array even though it would be more user-friendly
/// to allow optimization that avoid the additional computation that can happen in a slice
///
/// Note: in order for this to be used, [`Self::support_append_array_slice`] must return true
fn append_array_slice(
&mut self,
_array: &ArrayRef,
_start: usize,
_length: usize,
) -> Result<()> {
assert!(!self.support_append_array_slice(), "support_append_array_slice() return true while append_array_slice() is not implemented");
not_impl_err!(
"append_array_slice is not implemented for this GroupColumn, please implement it as well as support_append_array_slice"
)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I did not had support_append_array_slice and I had this default implementation for append_array_slice:

fn append_array_slice(
        &mut self,
        array: &ArrayRef,
        start: usize,
        length: usize,
    ) -> Result<()> {
        let rows = (start..start + length).collect::<Vec<_>>();
        self.vectorized_append(array, &rows)
    }

but I moved out of this to avoid having the allocation here as we already hold the append_row_indices.
as I saw from the benchmarks that it can be slower for non supported impl

@alamb
Copy link
Contributor

alamb commented Sep 17, 2025

I ran out of time for a thorough review today -- I need to do this one with a clean 🧠 in the morning

@alamb
Copy link
Contributor

alamb commented Sep 18, 2025

Second run shows no real change for Query 22 🤔

│ QQuery 22 │ 4106.57 ms │ 3974.35 ms │ no change │

@alamb
Copy link
Contributor

alamb commented Sep 18, 2025

I see that the original PR that added it run clickbench but I don't know if clickbench have a lot of unique groups:

I tried one query that has a large number of groups:

SELECT "UserID", "SearchPhrase", COUNT(*)
FROM '/Users/andrewlamb/Software/datafusion/benchmarks/data/hits.parquet'
GROUP BY "UserID", "SearchPhrase"
ORDER BY COUNT(*) DESC
LIMIT 10;

and it didn't seem to show any difference

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rluvaton

I reviewed this code carefully and it makes a lot of sense to me. In general I think it is almost ready to merge.

I also verified that the new fuzz test covers the newly added code

The only thing I think is required is some sort of benchmark that shows this actually improves performance in some case(there are more guidelines here)

In general, the performance improvement from a change should be “enough” to justify any added code complexity. How much is “enough” is a judgement made by the committers, but generally means that the improvement should be noticeable in a real-world scenario and is greater than the noise of the benchmarking system.

Do you have any ideas / ways to create one? Perhaps a SELECT DISTINCT query with some strings 🤔 ?

/// The `vectorized append` row indices buffer
append_row_indices: Vec<usize>,

/// If all the values in `append_row_indices` are consecutive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// If all the values in `append_row_indices` are consecutive
/// If all the values in `append_row_indices` are consecutive.
/// This is updated by [`Self::add_append_row_index`]

col,
&self.vectorized_operation_buffers.append_row_indices,
)?;
if self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest putting this check into a function with a descriptive name, something like

if let Some(start, length) = self.consecutive_row_indices() {
...
} else {
...
}

I think that would make the intent clearer

But this is not necessary for this PR

@alamb
Copy link
Contributor

alamb commented Oct 3, 2025

Any update on the benchmark / showing some query that this improve performances?

@rluvaton
Copy link
Member Author

rluvaton commented Oct 8, 2025

Any update on the benchmark / showing some query that this improve performances?

Sorry we are in a holiday season, creating one now

self.remaining_row_indices.clear();
}

fn add_append_row_index(&mut self, row: usize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably faster and somewhat cleaner to do this in a single check on append_row_indices in vectorized_append,

something like:

append_row_indices.windows(2).all(|[x, y]| x + 1 == y)

will do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even better, it should also be possible to just check the length of append_row_indices. If this is equal to the size of the incoming batch, all of the indices should be consecutive already (i.e. all values are unique), so it becomes a really cheap check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even better, it should also be possible to just check the length of append_row_indices. If this is equal to the size of the incoming batch, all of the indices should be consecutive already (i.e. all values are unique), so it becomes a really cheap check.

but what if we only need to add the last half of the column. then the length would not be equal but it is still consecutive

also added benchmark for testing pure grouping performance for more than 1 column.

----

I run this query for the data:
```
SELECT
    COUNT(*) AS total_count,
    COUNT(DISTINCT u64_wide) AS unique_count,
    COUNT(DISTINCT u64_wide) * 1.0 / COUNT(*) AS cardinality
FROM t;
```

Before:
```
| total_count | unique_count | cardinality |
| ----------- | ------------ | ----------- |
|    65536    |    2048      |   0.03125   |
```

After:
```
| total_count | unique_count | cardinality |
| ----------- | ------------ | ----------- |
|    65536    |    65536     |     1.0     |
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate performance Make DataFusion faster physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants