chore: Add existence (semi / anti ) benchmarks for hashjoinexec by coderfender · Pull Request #21821 · apache/datafusion

coderfender · 2026-04-24T06:55:20Z

Which issue does this PR close?

Add existence benchmarks

Test RightSemi and RightAnti join types
Use Int32 keys (required for perf : experiment roaring bitmap for int32 anti and semi joins #21817 )
Vary density and hit rate to cover different workload patterns

  | Query | Join Type | Density | Hit Rate |
  |-------|-----------|---------|----------|
  | Q16 | RightSemi | 100% | 100% |
  | Q17 | RightSemi | 100% | 10% |
  | Q18 | RightSemi | 50% | 100% |
  | Q19 | RightSemi | 50% | 10% |
  | Q20 | RightSemi | 10% | 100% |
  | Q21 | RightSemi | 10% | 10% |
  | Q22 | RightAnti | 100% | 100% |
  | Q23 | RightAnti | 100% | 10% |
  | Q24 | RightAnti | 50% | 100% |
  | Q25 | RightAnti | 50% | 10% |
  | Q26 | RightAnti | 10% | 100% |
  | Q27 | RightAnti | 10% | 10% |

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

coderfender · 2026-04-24T06:58:08Z

@Dandandan , @2010YOUY01 , Please take a look at these benchmarks I plan to refer for bitmap based optimizations : #21817 . This essentially has a cargo ben h (for faster / simpler bench tests through critcmp ) along with additional existence tests to TPCH datasets . I tried adding various densities and match rates to try and replicate as many real worlds scenarios as possible as well

2010YOUY01 · 2026-04-25T07:10:56Z

Thank you for working on this! I have some suggestions for you to consider.

High-level issue

I think the main issue is using density as a primary axis when evaluating equi-join performance. While it was introduced in #21821 for perfect hash join experiments, it seems it is not a good axis for designing representative benchmarks.

A good benchmark should reflect realistic workloads. To achieve that, we should define a set of core axes and vary them systematically, I think for equi-joins, it could be:

Equi-join benchmark key axes:
- Build/probe side size
- Join type (inner, outer, semi, etc.)
- Number of join keys
- Join key data type
- Probe hit rate
- Fanout (average number of matches per probe key)

In contrast, density (i.e., key range span divided by key count) is not representative of typical workloads. It is primarily useful for evaluating specific fast paths (e.g., perfect hash join), but making it a primary axis complicates the benchmark design, and may mislead future optimization efforts.

I believe we'd better remove density from the key axes in the future. For fast paths like perfect Hj and semi/anti join, we could simply add a few queries that the fast path wins.

For this PR

For this PR, I suggest keeping the end-to-end hj benchmark simple. We don’t need to enumerate all density combinations here—a smaller set of representative queries should be enough to evaluate the optimization.

For the Criterion micro-benchmarks, it would be better to first focus on a few representative workloads (e.g., join size, type), and then optionally add a small number of targeted cases for specific fast paths, such as right semi/anti joins with Int32 keys, otherwise it would be hard to extend and maintain.

In short, fewer end-to-end queries should be sufficient for this PR. We could add criterion micro-benches later based on the above design.

2010YOUY01 · 2026-04-25T07:18:56Z

+    // RightSemi Join benchmarks with Int32 keys
+    // Q16: RightSemi, 100% Density, 100% Hit rate
+    HashJoinQuery {
+        sql: r###"SELECT l.k


It might be clearer to express these directly using RIGHT SEMI JOIN, for example:

DataFusion CLI v53.1.0 > select count(*) from generate_series(100) as t1(v1) right semi join generate_series(100000) as t2(v1) on t1.v1 > t2.v1; +----------+ | count(*) | +----------+ | 100 | +----------+ 1 row(s) fetched. Elapsed 0.077 seconds. > select count(*) from generate_series(100) as t1(v1) right anti join generate_series(100000) as t2(v1) on t1.v1 > t2.v1; +----------+ | count(*) | +----------+ | 99901 | +----------+ 1 row(s) fetched. Elapsed 0.007 seconds.

Though, I'm not sure if it's standard SQL 🤔 , but df have them and it's easier to read.

add_existence_benchmarks_hashjoin

a9d6d68

github-actions Bot added the physical-plan Changes to the physical-plan crate label Apr 24, 2026

add_existence_benchmarks_hashjoin

6eb40b7

coderfender force-pushed the implement_additional_benchmarks_hashjoin branch from 2fbe56f to 6eb40b7 Compare April 24, 2026 07:03

Merge branch 'main' into implement_additional_benchmarks_hashjoin

57cfc3e

coderfender changed the title ~~feat: Add existence (semi / anti ) benchmarks hashjoin~~ chore: Add existence (semi / anti ) benchmarks hashjoin Apr 24, 2026

coderfender changed the title ~~chore: Add existence (semi / anti ) benchmarks hashjoin~~ chore: Add existence (semi / anti ) benchmarks for hashjoinexec Apr 24, 2026

Merge branch 'main' into implement_additional_benchmarks_hashjoin

ad83914

2010YOUY01 reviewed Apr 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Add existence (semi / anti ) benchmarks for hashjoinexec#21821

chore: Add existence (semi / anti ) benchmarks for hashjoinexec#21821
coderfender wants to merge 4 commits intoapache:mainfrom
coderfender:implement_additional_benchmarks_hashjoin

coderfender commented Apr 24, 2026 •

edited

Loading

Uh oh!

coderfender commented Apr 24, 2026

Uh oh!

2010YOUY01 commented Apr 25, 2026

Uh oh!

2010YOUY01 Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coderfender commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

coderfender commented Apr 24, 2026

Uh oh!

2010YOUY01 commented Apr 25, 2026

High-level issue

For this PR

Uh oh!

2010YOUY01 Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderfender commented Apr 24, 2026 •

edited

Loading