Add `StructArray` and `RunArray` benchmark tests to `with_hashes` by notashes · Pull Request #20182 · apache/datafusion

notashes · 2026-02-06T08:55:20Z

Which issue does this PR close?

Closes Add StructArray and RunArray benchmarks to with_hashes suite in datafusion-common #20181

Rationale for this change

Issue #20152 shows some areas of optimization for RunArray and StructArray hashing. But the existing with_hashes benchmark tests don't include coverage for these!

What changes are included in this PR?

Added benchmarks to with_hashes.rs:

StructArray: 4-column struct (bool, int32, int64, string)
RunArray: Int32 run-encoded array
Both include single/multiple columns and with/without nulls

Are these changes tested?

No additional tests added, but the benchmarks both compile and run.

a sample run:

❯ cargo bench --features=parquet --bench with_hashes -- array
   Compiling datafusion-common v52.1.0 (/Users/notashes/dev/datafusion/datafusion/common)
    Finished `bench` profile [optimized] target(s) in 34.49s
     Running benches/with_hashes.rs (target/release/deps/with_hashes-2f180744d22084f3)
Gnuplot not found, using plotters backend
struct_array: single, no nulls
                        time:   [38.389 µs 38.437 µs 38.485 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild

struct_array: single, nulls
                        time:   [46.108 µs 46.197 µs 46.291 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

struct_array: multiple, no nulls
                        time:   [114.64 µs 114.79 µs 114.93 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild

struct_array: multiple, nulls
                        time:   [138.29 µs 138.62 µs 139.07 µs]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

run_array_int32: single, no nulls
                        time:   [1.8777 µs 1.9098 µs 1.9457 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

run_array_int32: single, nulls
                        time:   [2.0110 µs 2.0417 µs 2.0751 µs]
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

run_array_int32: multiple, no nulls
                        time:   [5.0511 µs 5.0603 µs 5.0693 µs]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild

run_array_int32: multiple, nulls
                        time:   [5.6052 µs 5.6201 µs 5.6353 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Are there any user-facing changes?

Jefffrey · 2026-02-08T03:05:37Z

-            do_hash_test(b, &arrays);
-        });
+
+        // Union arrays can't have null bitmasks


Mentioning union array when we don't implement that here?

I've copied that from the other PR verbatim 😅 (to not have merge conflicts in the future?). but I'm getting a sense that it's the wrong approach here!

Jefffrey · 2026-02-08T03:07:03Z

+                    .clone()
+                    .into_data()
+                    .into_builder()
+                    .nulls(Some(create_null_mask(values.len())))


Something to think about is how null density acts differently here for run arrays, since we'd apply null on entire runs 🤔

I was thinking about it for a while. It probably should come up to be around the same 3% zone even though the variance could be a bit high.

I've set the run_length to be within 1..50.
Let's say we have ~300 runs on average, with each each one carrying ~25 elements. 3% of which will roughly translate to 10 * 25 = 250. But yes that is probably our ideal scenario.

let me know what you think? i'll try to do some testing regarding this.

Jefffrey · 2026-02-08T03:07:31Z

+    )
+}
+
+fn string_array(array_len: usize) -> ArrayRef {


Do we need this if we already have StringPool above?

done! don't think a different one offers any benefit! both seem to give me close to 10% speed up locally (with the struct_array optimization)

adriangb · 2026-02-11T14:11:31Z

run benchmark with_hashes

alamb-ghbot · 2026-02-11T18:36:18Z

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing with_hashes (b485c65) to f9d37cf diff
BENCH_NAME=with_hashes
BENCH_COMMAND=cargo bench --features=parquet --bench with_hashes
BENCH_FILTER=
BENCH_BRANCH_NAME=with_hashes
Results will be posted here when complete

alamb-ghbot · 2026-02-11T18:47:46Z

🤖: Benchmark completed

Details

group                                        main                                   with_hashes
-----                                        ----                                   -----------
dictionary_utf8_int32: multiple, no nulls    1.00     74.1±0.79µs        ? ?/sec    1.03     76.2±2.05µs        ? ?/sec
dictionary_utf8_int32: multiple, nulls       1.00     90.9±0.46µs        ? ?/sec    1.00     91.0±0.15µs        ? ?/sec
dictionary_utf8_int32: single, no nulls      1.06     27.8±0.22µs        ? ?/sec    1.00     26.2±0.13µs        ? ?/sec
dictionary_utf8_int32: single, nulls         1.00     38.5±1.10µs        ? ?/sec    1.00     38.3±0.17µs        ? ?/sec
int64: multiple, no nulls                    1.00     38.7±0.20µs        ? ?/sec    1.00     38.8±0.15µs        ? ?/sec
int64: multiple, nulls                       1.00     75.0±2.67µs        ? ?/sec    1.00     74.8±0.42µs        ? ?/sec
int64: single, no nulls                      1.00     11.3±0.08µs        ? ?/sec    1.00     11.3±0.05µs        ? ?/sec
int64: single, nulls                         1.00     23.9±0.12µs        ? ?/sec    1.00     23.9±0.33µs        ? ?/sec
large_utf8: multiple, no nulls               1.00    229.6±5.45µs        ? ?/sec    1.00    229.1±1.86µs        ? ?/sec
large_utf8: multiple, nulls                  1.00    280.4±1.33µs        ? ?/sec    1.00    280.9±3.47µs        ? ?/sec
large_utf8: single, no nulls                 1.01     72.4±2.77µs        ? ?/sec    1.00     71.9±1.51µs        ? ?/sec
large_utf8: single, nulls                    1.00     81.6±0.34µs        ? ?/sec    1.00     81.4±0.77µs        ? ?/sec
run_array_int32: multiple, no nulls                                                 1.00     12.0±0.09µs        ? ?/sec
run_array_int32: multiple, nulls                                                    1.00     13.4±0.07µs        ? ?/sec
run_array_int32: single, no nulls                                                   1.00      3.9±0.03µs        ? ?/sec
run_array_int32: single, nulls                                                      1.00      4.4±0.08µs        ? ?/sec
struct_array: multiple, no nulls                                                    1.00   407.7±12.90µs        ? ?/sec
struct_array: multiple, nulls                                                       1.00   442.9±15.81µs        ? ?/sec
struct_array: single, no nulls                                                      1.00    135.5±0.53µs        ? ?/sec
struct_array: single, nulls                                                         1.00    147.5±3.66µs        ? ?/sec
utf8: multiple, no nulls                     1.00    244.4±3.36µs        ? ?/sec    1.00    243.8±1.25µs        ? ?/sec
utf8: multiple, nulls                        1.00    264.4±3.37µs        ? ?/sec    1.00    263.8±2.87µs        ? ?/sec
utf8: single, no nulls                       1.00     73.7±0.60µs        ? ?/sec    1.00     73.6±1.01µs        ? ?/sec
utf8: single, nulls                          1.00     83.2±1.92µs        ? ?/sec    1.00     83.2±2.03µs        ? ?/sec
utf8_view (small): multiple, no nulls        1.00     47.7±1.50µs        ? ?/sec    1.00     47.6±0.15µs        ? ?/sec
utf8_view (small): multiple, nulls           1.00     64.1±0.30µs        ? ?/sec    1.00     64.2±0.22µs        ? ?/sec
utf8_view (small): single, no nulls          1.00     13.9±0.27µs        ? ?/sec    1.00     13.9±0.04µs        ? ?/sec
utf8_view (small): single, nulls             1.00     21.6±0.05µs        ? ?/sec    1.00     21.6±0.08µs        ? ?/sec
utf8_view: multiple, no nulls                1.00    235.4±2.64µs        ? ?/sec    1.00    235.9±1.64µs        ? ?/sec
utf8_view: multiple, nulls                   1.00    227.0±1.25µs        ? ?/sec    1.00    227.5±1.16µs        ? ?/sec
utf8_view: single, no nulls                  1.00     72.2±0.54µs        ? ?/sec    1.01     72.6±1.64µs        ? ?/sec
utf8_view: single, nulls                     1.00     71.9±0.23µs        ? ?/sec    1.00     72.0±0.55µs        ? ?/sec

…ache#20182) ## Which issue does this PR close?  - Closes apache#20181 ## Rationale for this change  Issue apache#20152 shows some areas of optimization for `RunArray` and `StructArray` hashing. But the existing `with_hashes` benchmark tests don't include coverage for these! ## What changes are included in this PR?  Added benchmarks to `with_hashes.rs`: - **StructArray**: 4-column struct (bool, int32, int64, string) - **RunArray**: Int32 run-encoded array - Both include single/multiple columns and with/without nulls ## Are these changes tested?  No additional tests added, but the benchmarks both compile and run. <details> <summary>a sample run:</summary> ``` ❯ cargo bench --features=parquet --bench with_hashes -- array Compiling datafusion-common v52.1.0 (/Users/notashes/dev/datafusion/datafusion/common) Finished `bench` profile [optimized] target(s) in 34.49s Running benches/with_hashes.rs (target/release/deps/with_hashes-2f180744d22084f3) Gnuplot not found, using plotters backend struct_array: single, no nulls time: [38.389 µs 38.437 µs 38.485 µs] Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) low severe 2 (2.00%) low mild 2 (2.00%) high mild struct_array: single, nulls time: [46.108 µs 46.197 µs 46.291 µs] Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe struct_array: multiple, no nulls time: [114.64 µs 114.79 µs 114.93 µs] Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) low severe 2 (2.00%) low mild 1 (1.00%) high mild struct_array: multiple, nulls time: [138.29 µs 138.62 µs 139.07 µs] Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low severe 4 (4.00%) low mild 1 (1.00%) high mild 2 (2.00%) high severe run_array_int32: single, no nulls time: [1.8777 µs 1.9098 µs 1.9457 µs] Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild run_array_int32: single, nulls time: [2.0110 µs 2.0417 µs 2.0751 µs] Found 7 outliers among 100 measurements (7.00%) 6 (6.00%) high mild 1 (1.00%) high severe run_array_int32: multiple, no nulls time: [5.0511 µs 5.0603 µs 5.0693 µs] Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low mild 5 (5.00%) high mild run_array_int32: multiple, nulls time: [5.6052 µs 5.6201 µs 5.6353 µs] Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe ``` </details> ## Are there any user-facing changes?

bench: adds benchmark tests for StructArray and RunArray

2adccac

github-actions Bot added the common Related to common crate label Feb 6, 2026

Merge branch 'main' into with_hashes

6dac5a6

notashes mentioned this pull request Feb 6, 2026

perf: various optimizations to eliminate branch misprediction in hash_utils #20168

Merged

Jefffrey reviewed Feb 7, 2026

View reviewed changes

Comment thread datafusion/common/benches/with_hashes.rs Outdated

Comment thread datafusion/common/benches/with_hashes.rs Outdated

Comment thread datafusion/common/benches/with_hashes.rs Outdated

notashes and others added 2 commits February 7, 2026 22:51

Merge branch 'main' into with_hashes

caa31e6

fix: null_array with null creation overhaul

b2aca22

Jefffrey reviewed Feb 8, 2026

View reviewed changes

fix: use existing stringpool for struct_array

0918040

Jefffrey approved these changes Feb 9, 2026

View reviewed changes

Merge branch 'main' into with_hashes

b485c65

adriangb added this pull request to the merge queue Feb 11, 2026

Merged via the queue into apache:main with commit ecf3b50 Feb 11, 2026
28 checks passed

notashes deleted the with_hashes branch February 11, 2026 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `StructArray` and `RunArray` benchmark tests to `with_hashes`#20182

Add `StructArray` and `RunArray` benchmark tests to `with_hashes`#20182
adriangb merged 6 commits intoapache:mainfrom
notashes:with_hashes

notashes commented Feb 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jefffrey Feb 8, 2026

Uh oh!

notashes Feb 8, 2026 •

edited

Loading

Uh oh!

Jefffrey Feb 8, 2026

Uh oh!

notashes Feb 8, 2026

Uh oh!

Jefffrey Feb 8, 2026

Uh oh!

notashes Feb 8, 2026

Uh oh!

adriangb commented Feb 11, 2026

Uh oh!

alamb-ghbot commented Feb 11, 2026

Uh oh!

alamb-ghbot commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

notashes commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jefffrey Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

notashes Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jefffrey Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

notashes Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

notashes Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb commented Feb 11, 2026

Uh oh!

alamb-ghbot commented Feb 11, 2026

Uh oh!

alamb-ghbot commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

notashes commented Feb 6, 2026 •

edited

Loading

notashes Feb 8, 2026 •

edited

Loading