Skip to content

chore: Add existence (semi / anti ) benchmarks for hashjoinexec#21821

Open
coderfender wants to merge 4 commits intoapache:mainfrom
coderfender:implement_additional_benchmarks_hashjoin
Open

chore: Add existence (semi / anti ) benchmarks for hashjoinexec#21821
coderfender wants to merge 4 commits intoapache:mainfrom
coderfender:implement_additional_benchmarks_hashjoin

Conversation

@coderfender
Copy link
Copy Markdown
Contributor

@coderfender coderfender commented Apr 24, 2026

Which issue does this PR close?

Add existence benchmarks

  1. Test RightSemi and RightAnti join types
  2. Use Int32 keys (required for perf : experiment roaring bitmap for int32 anti and semi joins #21817 )
  3. Vary density and hit rate to cover different workload patterns
  | Query | Join Type | Density | Hit Rate |
  |-------|-----------|---------|----------|
  | Q16 | RightSemi | 100% | 100% |
  | Q17 | RightSemi | 100% | 10% |
  | Q18 | RightSemi | 50% | 100% |
  | Q19 | RightSemi | 50% | 10% |
  | Q20 | RightSemi | 10% | 100% |
  | Q21 | RightSemi | 10% | 10% |
  | Q22 | RightAnti | 100% | 100% |
  | Q23 | RightAnti | 100% | 10% |
  | Q24 | RightAnti | 50% | 100% |
  | Q25 | RightAnti | 50% | 10% |
  | Q26 | RightAnti | 10% | 100% |
  | Q27 | RightAnti | 10% | 10% |

  • Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Apr 24, 2026
@coderfender
Copy link
Copy Markdown
Contributor Author

@Dandandan , @2010YOUY01 , Please take a look at these benchmarks I plan to refer for bitmap based optimizations : #21817 . This essentially has a cargo ben h (for faster / simpler bench tests through critcmp ) along with additional existence tests to TPCH datasets . I tried adding various densities and match rates to try and replicate as many real worlds scenarios as possible as well

@coderfender coderfender force-pushed the implement_additional_benchmarks_hashjoin branch from 2fbe56f to 6eb40b7 Compare April 24, 2026 07:03
@coderfender coderfender changed the title feat: Add existence (semi / anti ) benchmarks hashjoin chore: Add existence (semi / anti ) benchmarks hashjoin Apr 24, 2026
@coderfender coderfender changed the title chore: Add existence (semi / anti ) benchmarks hashjoin chore: Add existence (semi / anti ) benchmarks for hashjoinexec Apr 24, 2026
@2010YOUY01
Copy link
Copy Markdown
Contributor

Thank you for working on this! I have some suggestions for you to consider.

High-level issue

I think the main issue is using density as a primary axis when evaluating equi-join performance. While it was introduced in #21821 for perfect hash join experiments, it seems it is not a good axis for designing representative benchmarks.

A good benchmark should reflect realistic workloads. To achieve that, we should define a set of core axes and vary them systematically, I think for equi-joins, it could be:

Equi-join benchmark key axes:
- Build/probe side size
- Join type (inner, outer, semi, etc.)
- Number of join keys
- Join key data type
- Probe hit rate
- Fanout (average number of matches per probe key)

In contrast, density (i.e., key range span divided by key count) is not representative of typical workloads. It is primarily useful for evaluating specific fast paths (e.g., perfect hash join), but making it a primary axis complicates the benchmark design, and may mislead future optimization efforts.

I believe we'd better remove density from the key axes in the future. For fast paths like perfect Hj and semi/anti join, we could simply add a few queries that the fast path wins.

For this PR

For this PR, I suggest keeping the end-to-end hj benchmark simple. We don’t need to enumerate all density combinations here—a smaller set of representative queries should be enough to evaluate the optimization.

For the Criterion micro-benchmarks, it would be better to first focus on a few representative workloads (e.g., join size, type), and then optionally add a small number of targeted cases for specific fast paths, such as right semi/anti joins with Int32 keys, otherwise it would be hard to extend and maintain.

In short, fewer end-to-end queries should be sufficient for this PR. We could add criterion micro-benches later based on the above design.

Comment thread benchmarks/src/hj.rs
// RightSemi Join benchmarks with Int32 keys
// Q16: RightSemi, 100% Density, 100% Hit rate
HashJoinQuery {
sql: r###"SELECT l.k
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be clearer to express these directly using RIGHT SEMI JOIN, for example:

DataFusion CLI v53.1.0
> select count(*)
from generate_series(100) as t1(v1)
right semi join generate_series(100000) as t2(v1)
on t1.v1 > t2.v1;
+----------+
| count(*) |
+----------+
| 100      |
+----------+
1 row(s) fetched.
Elapsed 0.077 seconds.

> select count(*)
from generate_series(100) as t1(v1)
right anti join generate_series(100000) as t2(v1)
on t1.v1 > t2.v1;
+----------+
| count(*) |
+----------+
| 99901    |
+----------+
1 row(s) fetched.
Elapsed 0.007 seconds.

Though, I'm not sure if it's standard SQL 🤔 , but df have them and it's easier to read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants